Measuring the Usefulness of Function Words for Authorship Attribution Shlomo Argamon argamon@iit.edu Illinois Institute of Technology Shlomo Levitan levishl@iit.edu Illinois Institute of Technology Introduction Some forty years ago, Mosteller and Wallace suggested in their influential work on the Federalist Papers that a small number of the most frequent words in a language ('function words') could usefully serve as indicators of authorial style. The decades since have seen this work taken up in many ways including both the use of new analysis techniques (discriminant analysis, PCA, neural networks, and more), as well as the search for more sophisticated features by which to capture stylistic properties of texts. Interestingly, while use of more sophisticated models and algorithms has often led to more reliable and generally applicable results, it has proven quite difficult to improve on the general usefulness of function words for stylistic attribution. Indeed, John F. Burrows, in his seminal work on Jane Austen, has demonstrated that function words can be quite effectively used for attributing text passages to different authors, novels, or individual characters. The intuition behind the utility of function words for stylistic attribution is as follows. Due to their high frequency in the language and highly grammaticalized roles, function words are very unlikely to be subject to conscious control by the author. At the same time, the frequencies of different function words vary greatly across different authors and genres of text - hence the expectation that modeling the interdependence of different function word frequencies with style will result in effective attribution. However, the highly reductionistic nature of such features seems unsatisfying, as they rarely give good insight into underlying stylistic issues, hence the various efforts at developing more complex textual features while respecting constraints on computational feasibility. One especially promising line of work in this regard has been the examination of frequent word sequences and collocations for stylistic attribution, particularly Hoover's recent (2004) systematic work on clustering analysis of several text collections using frequent word collocations. A "word collocation" is defined as a certain pair of words occurring within a given threshold distance of each other (such as "is" and "certain" appearing within 5 words of each other in this sentence). Given such a threshold, the most frequent such collocations are determined over the entire corpus, and their frequencies in each text constitute its features for analysis. Hoover's analyses show the superiority, for his data set, of using frequent word collocations (for certain window sizes) over using frequent words or pairs of adjacent words. We contend, however, that by using such a small data set (twenty samples of 10,000 words each, in one case), the discriminating power of a model based on function words will be much reduced, and so the comparison may not be fair. As has been shown for other computational linguistic tasks (see, e.g., Banko & Brill), even simple language modeling techniques can greatly improve in effectiveness when larger quantities of data are applied. We have therefore explored the relative effectiveness of frequent words compared to frequent pairs and collocations, for attribution of both author identity and national origin, increasing the number of text passages considered over earlier work. We performed classification experiments on the twenty novels considered by Hoover, treating each separate chapter of each book as a separate text (rather than using just the first 10,000 words of each novel as a single text). Table 1 gives the full list with numbers of chapters and average number of words per chapter. We used a standard state-of-the-art machine learning technique to derive linear discrimination models between pairs of authors. This procedure gave results that clearly show a superiority of function words over collocations as stylistic features. Qualitatively similar results were obtained for the two-class problem of attributing the national origin (American or British) of a text's author. We conclude from this that larger and more detailed studies need to be done to effectively validate the use of a given feature type for authorship attribution. Table 1. Corpus composition. AuthorBook# ChaptersAvg. WordsCatherMy Antonia451826Song of the Lark602581The Professor's House282172ConradLord Jim452913The Nigger of the Narcissus510592HardyJude the Obscure532765The Mayor of Casterbridge452615Tess of the d'Urbervilles582605JamesThe Europeans125003The Ambassadors364584KiplingThe Jungle Book133980Kim157167LewisBabbit343693Main Street344994Our Mr. Wrenn194126LondonThe Call of The Wild74589The Sea Wolf392739White Fang252917WellsThe Invisible Man281756The War Of The Worlds272241 Methodology Given each particular feature set (frequent words, pairs, or collocations), the method was to represent each document as a numerical vector, each of whose elements is the frequency of a particular feature of the text. We then applied the SMO learning algorithm (Platt) with default parameters, which gives a model linearly weighting the various text features. SMO is a support vector machine (SVM) algorithm; SVMs have been applied successfully to a wide variety of text categorization problems (Joachims). Generalization accuracy was measured using 20-fold cross-validation, in which the 633 chapters were divided into 20 subsets of nearly equal size (3 or 4 texts per subset). Training was performed 20 times, each time leaving out one of the subsets, and then using the omitted subset for testing. The overall classification error rate was estimated as the average error rate over all 20 runs. This method gives a reasonable estimate of the expected error rate of the learning method for each given feature set and target task (Goutte). Results Results of measuring generalization accuracy for different feature sets are summarized in Tables 2 and 3, which clearly show that using the most frequent words in the corpus as features for stylistic text classification gives the highest overall discrimination for both author and nationality attribution tasks. Table 2. 20-fold cross-validation results for 200 most frequent words, pairs, and collocations (window size k = 5 or 10). Feature SetAuthorNationalityFreq. Words99.00%93.50%Freq. Pairs91.60%91.30%Freq. Coll. (k=5)88.94%90.20%Freq. Coll. (k=10)84.00%87.20% Table 3. 20-fold cross-validation results for 500 most frequent words, pairs, and collocations (window size 5 or 10). Feature SetAuthorNationalityFreq. Words93.20%93.50%Freq. Pairs90.00%88.60%Freq. Coll. (k=5)91.50%92.10%Freq. Coll. (k=10)94.00%92.10% Discussion Our study here reinforces many others over the years in showing the surprising resilience of frequently-occurring words as indicators of the stylistic character of a text. Our results show frequent words enabling more accurate text attribution than features such as word pairs or collocations, surprisingly contradicting recent results as well as the intuition that pairs or collocations should be more informative. The success of this study at showing the power of frequent words we mainly attribute to the use of more data, in the form of entire novels, broken down by chapters. The more fine-grained breakdown of text samples for each author enables more accurate determination of a good decision surface for the problem, thus better utilizing the power of all features in the feature set. Furthermore, using more training texts than features seriously reduces the likelihood of overfitting the model to the training data, improving the reliability of results. It is indeed possible that collocations may be better than function words for different stylistic classification tasks; however such a claim remains to be proven. A more general interpretation of our results is that since a set of frequent collocations of a given size will contain fewer different words than a set of frequent words of the same size, it may possess less discriminatory power. At the same time, though, such a feature set will be less subject to overfitting, and so may appear better when very small sets of texts are studied (as in previous studies). Our results thus lead us to believe that most of the discriminating power of collocations is due to the frequent words they contain (and not the collocations themselves), thus frequent words outperformed collocations, given sufficient data. Conclusions Function words still prove surprisingly useful as features for stylistic text attribution, even after many decades of research on features and algorithms for stylometric analysis. We believe that significant progress is likely to come from fundamental advances in computational linguistics which allow automated extraction of more linguistically motivated features, such as recent work on extracting rhetorical relations in a text (Marcu). More generally, our results argue for the importance of using larger data sets for evaluating the relative utility of different attribution feature sets or techniques. As in our case of comparing frequent words with frequent collocations, changing the scale of the data set may affect the relative power of different techniques, thus leading to different conclusions. We suggest that the authorship attribution community should now work towards developing a large suite of corpora and testbed tasks, to allow more rigorous and standardized comparisons of alternative approaches. Bibliography Baayen, H. Halteren, H. van Tweedie, F. Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution Literary and Linguistic Computing 11.3 121-132 1996 Banko, M. Brill, E. Scaling to very very large corpora for natural language disambiguation Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics 26-33 2001 Biber, D. Conrad, S. Reppen, R. Corpus Linguistics: Investigating Language Structure and Use Cambridge University Press Cambridge 1998 Burrows, J. Computation into Criticism: A Study of Jane Austen's Novels and an Experiment in Method Clarendon Press Oxford 1987 Goutte, C. Note on free lunches and cross-validation Neural Computation 9.6 1246-9 1997 Hoover, D.L. Frequent collocations and authorial style Literary and Linguistic Computing 18.3 261-28 2004 Joachims, T. Text categorization with Support Vector Machines: Learning with many relevant features Machine Learning: ECML-98, Tenth European Conference on Machine Learning 1998 137-142 Marcu, D. The Rhetorical Parsing of Unrestricted Texts: A Surface-Based Approach Comp. Ling. 26.3 395-448 2000 Matthews, R. Merriam, T. Neural computation in stylometry: An application to the works of Shakespeare and Fletcher Literary and Linguistic Computing 8.4 203-209 1993 Mosteller, F. Wallace, D.L. Inference and Disputed Authorship: The Federalist. Reading Addison Wesley Reading, Mass 1964 Platt, J. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines 1998 Microsoft Research Technical Report MSR-TR-98-14 Stamatatos, E. Fakotakis, N. Kokkinakis, G. Computer-based authorship attribution without lexical measures Computers and the Humanities 35 193-214 2001