A Prototype for Authorship Attribution Software

Patrick Juola

juola@mathcs.duq.edu

Duquesne University


The task of computationally inferring the author of a document based on
its internal statistics — sometimes called sylometrics, authorship attribution, or (for the completists) non-traditional authorship attribution is an active and vibrant research area, but at present largely without use. For example, unearthing the author of the anonymously-
written Primary Colors became a substantial issue in 1996. In 2004, anonymous published Imperial Hubris, a followup to his (her?) earlier work Through Our Enemies' Eyes. Who wrote these books?  Does the author
actually have the expertise claimed on the dust cover (a senior U.S.
intelligence official with nearly two decades of experience)? And, why
haven't our computers already given us the answer?

Part of this lack of use can be attributed to simple unfamiliarity on the
part of the relevant communities, combined with a perceived history of
inaccuracy 
       [Note 1: See, for example, the discussion of the cusum technique
(Farrington) in (Holmes 1998).]
     . Since 1996, however, the popularity
of corpus linguistics as a field of study and vast increase in the amount of
data available on the Web (Nerbonne) have made it practical to use
much larger sets of data for inference. During the same period, new and
increasingly sophisticated techniques have improved the quality (and accuracy)
of judgements the computers make.

As a recent example, in June 2004, ALLC/ACH hosted an Ad-hoc Authorship
Attribution Competition (Juola 2004b). Specifically, by providing a
standardized test corpus for authorship attribution, not only could the
mere ability of statistical methods to determine authors be demonstrated,
but methods could further be distinguished between the merely successful
   and very successful, and analyzed in particular into possible areas of individual success.

The contest (and results) were surprising at many levels; some researchers
initially refused to participate given the admittedly difficult tasks
included among the corpora. For example, Problem F consisted of a set of
letters extracted from the Paston letters. Aside from the very real issue
of applying methods designed/tested for the most part for modern English on
documents in Middle English, the size of these documents (very few letters,
today or in centuries past, exceed 1000 words) makes statistical inference
difficult. Similarly, problem A was a realistic exercise in the analysis
of student essays (gathered in a freshman writing class during the fall of
2003) — as is typical, no essay exceeded 1200 words. Despite this extreme
paucity of data, results could be stunningly accurate. The highest scoring
   participant was the research group of Vlado Keselj, with an average success
rate of approximately 69%. (Juola's solutions, in the interests of
fairness, averaged 65% correct.)  In particular, Keselj's methods
achieved 85% accuracy on problem A and 90% accuracy on problem F, both
acknowledged to be difficult and considered by many to be unsolvably so.

However, the increased accuracy has come at the price of decreased clarity;
the statistics used 
       [Note 2: E.g. linear discriminant analysis of common function
words (Burrows, Baayen et al; Juola & Baayen), orthographic
cross-entropy (Juola, 1996), common byte N-grams (Keselj, 2004).]
      can be hard
to understand, and perhaps more importantly, difficult to implement or to
use by a non-technical scholar. At the same time, the sheer number of
techniques proposed (and therefore, the number of possibilities available
to confuse) has exploded. This limits the pool of available users, making
it less likely that a casual scholar — let alone a journalist, lawyer, or
interested layman — would be able to apply these new methods to a problem
of real interest.

I present here a prototype and framework for a user-friendly software system
(Juola & Sofko) allowing the casual user to apply authorship
attribution technologies to her own purposes. It combines a generalized
theoretical model (Juola, 2004b) built on an inference task over event
sequences with an extensible, object-oriented inference engine that makes
the system easily updatable to incorporate new technologies or to mix-and-
match combinations of existing ones. The model treats linguistic (or
paralinguistic) data as a sequence of separable user-defined events, 
for instance, as a sequence of letters, phonemes, morphemes, or words.
These sequences are treated to a three-phase process:
                                                                                

•Canonicization — No two physical realizations of events will ever
be exactly identical. We choose to treat similar realizations as identical
to restrict the event space to a finite set.

•Determination of the event set — The input stream is partitioned into
individual non-overlapping events. At the same time, uninformative
events can be eliminated from the event stream.

•Statistical inference — The remaining events can be subjected to a
variety of inferential statistics, ranging from simple analysis of event
distributions through complex pattern-based analysis. The results of this
inference determine the results (and confidence) in the final report.


As an illustration, the implementation of these phases for the Burrows
method would involve, first, canonicization by norming the documents
of interest. For example, words with variant capitalization (the, The, THE)
would be treated as a single type. More sophisticated canonicization
procedures could regularize spelling, eliminate extraneous material such
as chapter headings, or even "de-edit" (Rudman) the invisible hand
of the editor. During the second phase, the appropriate set of function
words would be determined and presented as a sequence of events, eliminating
words not in the set of interest. Finally, the appropriate function words are
tabulated (without regard to ordering) and the appropriate inferential
statistics (principle component analysis) performed. However, replacement
of the third stage (and only the third stage) by a linear discriminant analysis
would produce a different technique (Baayen et al.).

This framework fits well into the now-standard modular software design
paradigm. In particular, the software to be demonstrated uses the Java
programming language and object-oriented design to separate the generic
functions of the three phases as individual classes, to be implemented
as individual subclasses.

The user can select from a variety of options at each phase, and the system
as a whole is easily extensible to allow for new developments. For example,
the result of event processing is simply a Vector (Java class) of events.
Similarly, similarity judgement is a function of the Processor class,
which can be instantiated in a variety of different ways. At present,
the Processor class is defined with a number of different methods
       [Note 3: For
   example, crossEntDistance() and LZWDistance().]
     . A planned improvement is
to simply define a calculateDistance() function as part of the Processor
class. The Processor class, in turn, can be subclassed into various types,
each of which calculates distance in a slightly different way.

Similarly, preprocessing can be handled by separate instantiations and
subclasses. Even data input and output can be modularized and separated.
As written, the program only reads files from a local disk, but a relatively
easy modification would allow files to be read from a local disk or from
the network (for instance, Web pages from a site such as Project Gutenberg
or literature.org). Users can therefore select functionality as needed
on a module-by-module basis both in terms of feature as well as inference
method; the current system incorporates four different approaches (Burrows; Juola 1997; Kukushkina et al.; Juola 2003).

From a broader perspective, this program provides a uniform framework under
which competing theories of authorship attribution can both be compared
and combined (to their hopefully mutual benefit). It also form the basis of
a simple user-friendly tool to allow users without special training to apply
technologies for authorship attribution and to take advantage of new
developments and methods as they become available. From a standpoint of
practical epistemology, the existence of this tool should provide a starting
point for improving the quality of authorship attribution as a forensic
examination — by allowing the widespread use of the technology, and at the
same time providing an easy method for testing and evaluating different
approaches to determine the necessary empirical valididation and limitations.

On the other hand, this tool is also clearly a research-quality prototype,
and additional work will be needed to implement a wide variety of methods,
to determine and implement additional features, to establish a sufficiently
user-friendly interface. Even questions such as the preferred method of
output — dendrograms?  MDS subspace projections?  Fixed attribution
   assignments as in the present system? — are in theory open to discussion
and revision. It is hoped that the input of research and user
 such as the present meeting will help guide this development.



Bibliography


Baayen, R. H.
Van Halteren, H.
Neijt, A.
Tweedie, F.
An experiment in authorship attribution
Proceedings of JADT 2002
St. Malo
2002
29-37

Baayen, R. H.
Van Halteren, H.
Tweedie, F.
Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution
Literary and Linguistic Computing
11
121-131
1996

Burrows, J.
Questions of authorship : Attribution and beyond
Computers and the Humanities
37.1
5-32
2003

Burrows, J.
'An Ocean where each Kind. . . ': Statistical analysis
                        and some major determinants of literary style
Computers and the Humanities
23.4-5
309-21
1989

Farringdon, J.M.
Analyzing for Authorship: A Guide to the Cusum
                        Technique
University of Wales Press
Cardiff
1996

Holmes, D. I.
Authorship attribution
Computers and the Humanities
28.2
87-106
1994

Holmes, D. I.
The evolution of stylometry in humanities computing
Literary and Linguistic Computing
13.3
111-7
1998

Juola, P.
What can we do with small corpora? Document categorization via cross-entropy
Proceedings of an Interdisciplinary Workshop on Similarity and Categorization
Department of Artificial Intelligence, University of Edinburgh
Edinburgh, UK
1997
n. pag

Juola, P.
The time course of language change
Computers and the Humanities
37.1
77-96
2003

Juola, P.
Ad-hoc authorship attribution competition
Proceedings of the 2004 Joint International Conference of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities (ALLC/ACH 2004)
Goteborg, Sweden
2004a
175-176

Juola, P.
On composership attribution
Proceedings of the 2004
                        Joint International Conference of the Association for Literary and
                        Linguistic Computing and the Association for Computers and the Humanities (ALLC/ACH 2004)
Goteborg, Sweden
2004b
79-80

Juola, P.
Baayen, H.
A controlled-corpus experiment in authorship attribution by cross-entropy
Proceedings of ACH/ALLC-2003
Athens, GA
2003
n. pag

Juola, P.
Sofko, J.
Proving and Improving Authorship Attribution Technologies
Proceedings of CaSTA-2004
Hamilton, ON
2004
n. pag

Keselj, V.
Cercone, N.
CNG Method with Weighted Voting
Ad-hoc Authorship Attribution Contest Technical Report

Kukushkina, O.V.
Polikarpov, A.A.
Khmelev, D.V.
Using literal and grammatical statistics for authorship attribution
Problemy Peredachi Informatii
37.2
172-184
2000Translated in 
Problems of Information Transmission.

Kucera, H.
Francis, W.N.
Computational Analysis of Present-day American English
Brown University Press
Providence
1967

Nerbonne, J.
The data deluge
Literary and Linguistic Computing
Forthcoming[In Proceedings of the 2004 Joint International
                     Conference of the Association for Literary and Linguistic Computing and
                     he Association for Computers and the Humanities (ALLC/ACH 2004).]