Measuring the Usefulness of Function Words for Authorship Attribution

Introduction

Some forty years ago, Mosteller and Wallace suggested in their influential work on the Federalist Papers that a small number of the most frequent words in a language ('function words') could usefully serve as indicators of authorial style. The decades since have seen this work taken up in many ways including both the use of new analysis techniques (discriminant analysis, PCA, neural networks, and more), as well as the search for more sophisticated features by which to capture stylistic properties of texts. Interestingly, while use of more sophisticated models and algorithms has often led to more reliable and generally applicable results, it has proven quite difficult to improve on the general usefulness of function words for stylistic attribution. Indeed, John F. Burrows, in his seminal work on Jane Austen, has demonstrated that function words can be quite effectively used for attributing text passages to different authors, novels, or individual characters.

The intuition behind the utility of function words for stylistic attribution is as follows. Due to their high frequency in the language and highly grammaticalized roles, function words are very unlikely to be subject to conscious control by the author. At the same time, the frequencies of different function words vary greatly across different authors and genres of text - hence the expectation that modeling the interdependence of different function word frequencies with style will result in effective attribution. However, the highly reductionistic nature of such features seems unsatisfying, as they rarely give good insight into underlying stylistic issues, hence the various efforts at developing more complex textual features while respecting constraints on computational feasibility.

One especially promising line of work in this regard has been the examination of frequent word sequences and collocations for stylistic attribution, particularly Hoover's recent (2004) systematic work on clustering analysis of several text collections using frequent word collocations. A "word collocation" is defined as a certain pair of words occurring within a given threshold distance of each other (such as "is" and "certain" appearing within 5 words of each other in this sentence). Given such a threshold, the most frequent such collocations are determined over the entire corpus, and their frequencies in each text constitute its features for analysis. Hoover's analyses show the superiority, for his data set, of using frequent word collocations (for certain window sizes) over using frequent words or pairs of adjacent words.

We contend, however, that by using such a small data set (twenty samples of 10,000 words each, in one case), the discriminating power of a model based on function words will be much reduced, and so the comparison may not be fair. As has been shown for other computational linguistic tasks (see, e.g., Banko & Brill), even simple language modeling techniques can greatly improve in effectiveness when larger quantities of data are applied. We have therefore explored the relative effectiveness of frequent words compared to frequent pairs and collocations, for attribution of both author identity and national origin, increasing the number of text passages considered over earlier work.

We performed classification experiments on the twenty novels considered by Hoover, treating each separate chapter of each book as a separate text (rather than using just the first 10,000 words of each novel as a single text). Table 1 gives the full list with numbers of chapters and average number of words per chapter. We used a standard state-of-the-art machine learning technique to derive linear discrimination models between pairs of authors. This procedure gave results that clearly show a superiority of function words over collocations as stylistic features. Qualitatively similar results were obtained for the two-class problem of attributing the national origin (American or British) of a text's author. We conclude from this that larger and more detailed studies need to be done to effectively validate the use of a given feature type for authorship attribution.

Author	Book	# Chapters	Avg. Words
Cather	My Antonia	45	1826
	Song of the Lark	60	2581
	The Professor's House	28	2172
Conrad	Lord Jim	45	2913
	The Nigger of the Narcissus	5	10592
Hardy	Jude the Obscure	53	2765
	The Mayor of Casterbridge	45	2615
	Tess of the d'Urbervilles	58	2605
James	The Europeans	12	5003
	The Ambassadors	36	4584
Kipling	The Jungle Book	13	3980
	Kim	15	7167
Lewis	Babbit	34	3693
	Main Street	34	4994
	Our Mr. Wrenn	19	4126
London	The Call of The Wild	7	4589
	The Sea Wolf	39	2739
	White Fang	25	2917
Wells	The Invisible Man	28	1756
	The War Of The Worlds	27	2241

Table 1. Corpus composition.

Methodology

Given each particular feature set (frequent words, pairs, or collocations), the method was to represent each document as a numerical vector, each of whose elements is the frequency of a particular feature of the text. We then applied the SMO learning algorithm (Platt) with default parameters, which gives a model linearly weighting the various text features. SMO is a support vector machine (SVM) algorithm; SVMs have been applied successfully to a wide variety of text categorization problems (Joachims).

Generalization accuracy was measured using 20-fold cross-validation, in which the 633 chapters were divided into 20 subsets of nearly equal size (3 or 4 texts per subset). Training was performed 20 times, each time leaving out one of the subsets, and then using the omitted subset for testing. The overall classification error rate was estimated as the average error rate over all 20 runs. This method gives a reasonable estimate of the expected error rate of the learning method for each given feature set and target task (Goutte).

Results

Results of measuring generalization accuracy for different feature sets are summarized in Tables 2 and 3, which clearly show that using the most frequent words in the corpus as features for stylistic text classification gives the highest overall discrimination for both author and nationality attribution tasks.

Feature Set	Author	Nationality
Freq. Words	99.00%	93.50%
Freq. Pairs	91.60%	91.30%
Freq. Coll. (k=5)	88.94%	90.20%
Freq. Coll. (k=10)	84.00%	87.20%

Table 2. 20-fold cross-validation results for 200 most frequent words, pairs, and collocations (window size k = 5 or 10).

Feature Set	Author	Nationality
Freq. Words	93.20%	93.50%
Freq. Pairs	90.00%	88.60%
Freq. Coll. (k=5)	91.50%	92.10%
Freq. Coll. (k=10)	94.00%	92.10%

Table 3. 20-fold cross-validation results for 500 most frequent words, pairs, and collocations (window size 5 or 10).

Discussion

Our study here reinforces many others over the years in showing the surprising resilience of frequently-occurring words as indicators of the stylistic character of a text. Our results show frequent words enabling more accurate text attribution than features such as word pairs or collocations, surprisingly contradicting recent results as well as the intuition that pairs or collocations should be more informative. The success of this study at showing the power of frequent words we mainly attribute to the use of more data, in the form of entire novels, broken down by chapters. The more fine-grained breakdown of text samples for each author enables more accurate determination of a good decision surface for the problem, thus better utilizing the power of all features in the feature set. Furthermore, using more training texts than features seriously reduces the likelihood of overfitting the model to the training data, improving the reliability of results.

It is indeed possible that collocations may be better than function words for different stylistic classification tasks; however such a claim remains to be proven. A more general interpretation of our results is that since a set of frequent collocations of a given size will contain fewer different words than a set of frequent words of the same size, it may possess less discriminatory power. At the same time, though, such a feature set will be less subject to overfitting, and so may appear better when very small sets of texts are studied (as in previous studies). Our results thus lead us to believe that most of the discriminating power of collocations is due to the frequent words they contain (and not the collocations themselves), thus frequent words outperformed collocations, given sufficient data.

Conclusions

Function words still prove surprisingly useful as features for stylistic text attribution, even after many decades of research on features and algorithms for stylometric analysis. We believe that significant progress is likely to come from fundamental advances in computational linguistics which allow automated extraction of more linguistically motivated features, such as recent work on extracting rhetorical relations in a text (Marcu).

More generally, our results argue for the importance of using larger data sets for evaluating the relative utility of different attribution feature sets or techniques. As in our case of comparing frequent words with frequent collocations, changing the scale of the data set may affect the relative power of different techniques, thus leading to different conclusions. We suggest that the authorship attribution community should now work towards developing a large suite of corpora and testbed tasks, to allow more rigorous and standardized comparisons of alternative approaches.

Title: Measuring the Usefulness of Function Words for Authorship Attribution

Measuring the Usefulness of Function Words for Authorship Attribution

Shlomo Argamon argamon@iit.edu

Illinois Institute of Technology

Shlomo Levitan levishl@iit.edu

Illinois Institute of Technology

Introduction

Methodology

Results

Discussion

Conclusions

Bibliography

Title: Measuring the Usefulness of Function Words for Authorship Attribution

Measuring the Usefulness of Function Words for Authorship Attribution

Shlomo Argamon ? argamon@iit.edu

Illinois Institute of Technology

Shlomo Levitan ? levishl@iit.edu

Illinois Institute of Technology

Introduction

Methodology

Results

Discussion

Conclusions

Bibliography

Shlomo Argamon argamon@iit.edu

Shlomo Levitan levishl@iit.edu