Title: Profiling Stylistic Variations in Dickens and Smollett through Correspondence Analysis of Low Frequency Words

Author: Tomoji Tabata
Statement of responsibility:
Marked up by Martin Holmes
Patricia Baer
Marked up to be included in the ACH/ALLC 2005 Conference Abstracts book.
Source(s):
None
Text classification:
Keywords:
paper
Keywords:
  • stylistic variation
  • stylometry
  • correspondence analysis
  • MDH: Created from John Bradley's XML March 2005
  • MDH: Proofed and passed without changes by RS 25 May 2005

Profiling Stylistic Variations in Dickens and Smollett through Correspondence Analysis of Low Frequency Words

Tomoji Tabata    tabata@lang.osaka-u.ac.jp

Osaka University

The aim of this paper is to present the result of a corpus-driven, quantitative analysis of the style of Dickens in comparison with the style of Smollett. The particular problem discussed is the differing distribution of -ly adverbs in the texts written by the two authors. By applying a multivariate stylo-statistics model, this study illustrates how sharply the two authors differ in their uses of adverbs as well as how texts are differentiated according to genre and chronology within authorial groups.
On the relationship between linguistic registers and adverbs, Biber et al. (1999, 541) present interesting findings from a large-scale corpus:
It is interesting to note that, overall, fiction ... uses many different descriptive -ly adverbs, although few of these are notably common (occurring over 50 times per million words). Rather, fiction shows great diversity in its use of -ly adverbs. In describing fictional events and the actions of fictional characters, writers often use adverbs with specific descriptive meanings.
In fact, -ly adverbs found in Dickens are quite diverse. In the 23 texts used in this study, the number of types amount to 1,728; Smollet employed 634 types. Among those, a few types are highly frequent, such as really and certainly, occurring more than one thousand times. Conversely, a large number of adverbs occur only once. Such hapax legomena include a few types which sound very much Dickensian, such as evil-adverbiously, patientissamentally, Shakespearianly. Although the number of tokens of -ly adverbs account for only a little more than 1% of total word-tokens in the texts, the findings by Biber et al. suggest that -ly adverbs deserve special attention in stylistic study of fiction.
This study deals with a corpus of texts comprising Dickens' and Smollett's major works. Dickens' set includes fifteen 'serial fictions', six 'sketches', one 'miscellany', and one 'history'. Smollett's contains six 'fictions' and one 'sketch'. The total word-tokens in the corpus amount to 5.8 million, with the Dickens component containing 4.7 million tokens and the Smollett component totalling 1.1 million word-tokens. The present project was initiated as a study based on a comprehensive collection, not a sample corpus, of texts by the targeted authors. Therefore, the imbalance in the number of texts as well as tokens is inevitable. However, due attention will be paid in the choice of variables to minimize a potential effect of the differences in the population of the two sets. All the texts in the corpus have been annotated with the POS tags, using Eric Brill's Rule-Based Tagger (also known as the Brill Tagger). Manual post-editing has been conducted to eliminate a number of ill-assigned tags.
In an early successful attempt at a computational description of literary style, Milic compared the style of Jonathan Swift with the writings of his contemporaries, with special reference to the relative frequencies of word-classes in the texts and to grammatical features such as seriation and connection. Cluett (1971 & 1976) adopted a similar approach to conduct a diachronic study of prose style across 4 centuries: from the16th to the 20th centuries. Brainerd's works (1979 & 1980) are ambitious attempts to apply discriminant analysis to the question of genre and chronology in Shakespeare plays. Takefuta's approach to text typology, or register variation, is among the first to successfully employ factor/cluster analysis to the lexical differences between registers. His pioneering work, however, is not widely acknowledged because it was written in Japanese. Since Burrows (1987) and Biber (1988), it has become popular practice to employ multivariate techniques in quantitative studies of texts. Biber carried out factor analysis (FA) on 67 linguistic features to identify co-occurring linguistic features that account for dimensions of register variation. A series of research projects based on Biber's Multi-Feature/Multi-Dimensional approach have been successful in elucidating many interesting aspects of linguistic variation, such as language acquisition, ESP, diachronic change of prose style, and differences between conversational styles in British and American English, to give a handful of examples (Biber & Finegan; Conrad & Biber eds.).
The Biber model is one of the most sophisticated approach by far. Yet it is not without its critics. Nakamura (1995) raises a major objection. He argues that Biber's variables are "quite arbitrarily selected with no definite criterion and mixed levels" (1995, 77-86). Further, Sigley (1997) notes that almost half of Biber's 67 linguistic features are too rare in texts of 2,000 words.
Burrows (1987), on the other hand, applied a Principal Component Analysis (PCA) to the thirty most common words in the language of Jane Austen. The method demonstrates that differing frequency patterns in these very common words show significant differentiations among Austen's characters, and that the statistical analysis of literary style may lead not only to a deeper understanding of the novel itself but may also contribute to our deeper appreciation of it. In this use of a PCA, the frequencies of common words are used as variables. The Burrows method seems to have higher replicability and feasibility; since it focuses on common words, most of the variables are frequent enough to produce stable statistical results. In addition, it does not require a multi-layered tagging scheme optimised for Biber's MF/MD approach.
A particular strength of the Burrows methodology is in testing cases of disputed authorship and national differences in the English first-person retrospective narrative, known as 'history'. Among the most successful applications are Burrows (1989, 1992 & 1996), Craig (1999a, b, & c). The Burrows approach or similar methodology has been applied to Bible stylometry. Some scholars like Linmans, Merriam, and Mealand use Correspondence Analysis (CA) instead of PCA. In the context of text typology, Nakamura (1993) applied CA to the frequency distribution of personal pronouns to visualize association between personal pronouns and 15 text categories in the LOB corpus.
My earlier work (Tabata) also used CA to analyse the distribution patterns of Part-of-Speech in Dickens's 23 texts and identified a contrast between serial fiction and sketches. The present study is different from the Burrows model in that it extends the range of variables to include low-frequency words, or rare words, by applying CA in the analysis of -ly adverbs. CA is one of the techniques for data-reduction alongside PCA and FA. Unlike PCA and FA, however, CA does not require intervening steps of calculating correlation matrix or covariance matrix, and can therefore process the data directly to obtain solution. CA allows examination of the complex interrelationships between row cases (i.e., texts), interrelationships between column variables (i.e., adverbs), and association between the row cases and column variables graphically in a multi-dimensional space. It computes the row coordinates (word scores) and column coordinates (text scores) in a way that permutes the original data matrix so that the correlation between the word variables and text profiles are maximized. In a permuted data matrix, adverbs with a similar pattern of distribution make the closest neighbours, and so do texts of similar profile. When the row/column scores are projected in multi-dimensional charts like Figures 1 to 4, relative distance between variable entries indicates affinity, similarity, association, or otherwise between them. One advantage CA has over PCA and FA is that PCA and FA cannot be computed on a rectangular matrix where the number of columns exceeds the number of rows, a concern of the present study. Yet CA can handle such types of a data table with, for example, the row cases consisting of thirty texts and the column variables consisting of hundreds of adverbs.
Figure 1: Correspondence Analysis of —ly adverbs in Dickens & Smollett based on 1,278 types that appear in two or more texts: Text-map showing interrelationships between 30 texts
Figure 1: Correspondence Analysis of —ly adverbs in Dickens & Smollett based on 1,278 types that appear in two or more texts: Text-map showing interrelationships between 30 texts
Figure 2: CA: Word-map showing interrelationships between 1,278 types of —ly adverbs
Figure 2: CA: Word-map showing interrelationships between 1,278 types of —ly adverbs
Figure 3: Correspondence Analysis of —ly adverbs in Dickens & Smollett based on the most common 99 types: Text-map showing interrelationships between 30 texts
Figure 3: Correspondence Analysis of —ly adverbs in Dickens & Smollett based on the most common 99 types: Text-map showing interrelationships between 30 texts
Figure 4: CA: Word-map showing interrelationships between 99 —ly types of adverbs
Figure 4: CA: Word-map showing interrelationships between 99 —ly types of adverbs
Figures 1-4 summarise the results of applying a CA model in the frequency analysis of -ly adverbs in texts. Figures 1 and 2, based on 1,278 -ly adverbs which occur in more than one text, clearly differentiate between the Dickens and Smollett sets. The pattern along the horizontal axis allows quite straightforward interpretation. A more sceptical mind, however, might attribute it to the imbalance in the number of texts between the authorial sets as well as in the number of types of adverbs with the Dickens corpus at 4 times the size of the Smollett corpus. One might be able to respond to such a scepticism with Figures 3 and 4, which are based on the most common 99 -ly adverbs used by both Dickens and Smollett. Despite the decrease in the number of variables from 1,278 to 99, the configuration of Figure 3 is remarkably similar to that of Figure 1. Of further interest is that, in each of the two authors’ sets, earlier works tend to be found towards the bottom of the chart with later works in the upper half of the diagram. Additionally, in the Dickensian territory of Figures 1 and 3, serial fiction texts occupy the right end while other genres, such as sketches and history, are located slightly towards the left. The series of results seems to illustrate how the authorial difference, text genre, and chronology are reflected in the frequency pattern of -ly adverbs in the texts written by Dickens and Smollett. This pilot study might suggest the effectiveness of the stylo-statistical approach based on correspondence analysis of lower frequency words in texts.

Bibliography