Title: Towards an Automatic Index Generation Tool

Author: Patrick Juola
Statement of responsibility:
Marked up by Martin Holmes
Patricia Baer
Marked up to be included in the ACH/ALLC 2005 Conference Abstracts book.
Source(s):
None
Text classification:
Keywords:
paper
Keywords:
  • index generation
  • text-processing software
  • text analytics
  • MDH: Created from John Bradley's XML April 2005
  • MDH: Marked up 11 April 2005
  • MDH: Proofed and passed without changes by RS 25 May 2005
  • MDH: Entered proofing corrections from PGL 26 May 2005

Towards an Automatic Index Generation Tool

Patrick Juola    juola@mathcs.duq.edu

Duquesne University

Almost every non-fiction author has been faced from time to time with the generation of an index. Most novice authors (myself included) are taken aback by the magnitude of the task and the limited amount of computational and software support available.
The current state of the art is significantly improved from the days of 3-by-5 'index cards', (a telling term?), but only in mechanical, not intellectual terms. Modern publishing practice typically involves the author delivering a machine-readable 'manuscript', written in a document-processing system such as LaTeX. Index entries are defined as specific term/location pairs by the author. For example, an index entry written in LaTeX, might look as follows
The \index{Pittsburgh!University of} University of Pittsburgh was established in \index{Pittsburgh!city of} Pittsburgh, Pennsylvania, in the year....
This will create an index entry on the 'current page', under the heading "Pittsburgh, University of" (as opposed to "Pittsburgh, city of," which would be the second entry, a related but separate subentry). Although guidelines for a good index (Northrup; University of Chicago Press Staff) are commonly available, the process of producing a good index is still largely unsupported, even by major and relatively sophisticated publishing companies such as Prentice-Hall.
What differentiates an index from a mere concordance?
There are at least six cognitive tasks (Maislin; Saranchuk) related to the production of a good index, as follows. Current standard support covers only the last.
I will present a framework for the development of a 'machine-aided index generation system'. This bears the same relationship to an automatic indexer that machine-aided translation (MAT) does to machine translation (MT), in that it provides suggestions and reduces the overall workload for the human, but post-editing will still be necessary. Specifically, recent results in corpus linguistics (Charniak; Manning & Schuetze), including the development of taggers for part of speech (Cutting et al.; Schmid) the availability of ontologies and semantics networks, plus the light semantic analysis capabilities of latent semantic analysis (Landauer et al.), can be combined in a multi-phased iterative framework and implemented as user-level software. This paper presents some aspects of "good" indices (Northrup; University of Chicago Press Staff) and illustrates how they can be achieved computationally.
In general, following the University of Chicago's dictum that "it is always easier to drop entries than to add them, err on the side of inclusiveness," (rule 18.120) we start by assuming that every term is a potential index entry and look for criteria by which to eliminate enough terms to produce a reasonably-sized index. (5-15 references/page, between 2% and 5% of the length of the final work, according to rule 18.120.) For example, rule 18.8 states that "the main heading of an index entry is normally a noun or noun phrase---the name of a person, a place, and object, or an abstraction." A first pass, then, can use the results of a part-of-speech tagger and eliminate all terms that do not appear as a noun in the document. Within this set of nouns, I suggest two possible heuristics for further pruning; first, common nouns that are too common or too rare are unlikely to be useful index terms, and second, words that are too uniformly distributed are unlikely to be useful index terms. On the other hand, a case can be made that all proper nouns should be included. Other suggested heuristic will be discussed.
Within a single index term, "an entry that requires more than five or six locators is usually broken up into subentries" (rule 18.9). This can be treated as an example of word-sense disambiguation, for example, between Pittsburgh (University of) and Pittsburgh (city of). Again, I conjecture (and present supporting evidence) that existing technology can provide a useful and helpful basis for later human editing. Specifically, existing semantic representation techniques can model the context, and therefore the meaning, of each index token. For truly polysemous terms, cluster analysis of the set of token representations should yield a set of clusters equivalent to the degree of polysemy; by setting the separation threshhold to an appropriate level, the analysis can be forced to produce clusters of maximum size at most 5-6. At the same time, passing and uninformative references can be expected to produce isolated 'clusters' containing a single outlier — a strong candidate for omission. Once a list of index terms is collected, tokens not on that list can be compared in their semantic representation for similarity with existing index terms; any word with near-identical meaning is a potential synonym and a candidate for a cross-reference.
Unfortunately, the evidence to be presented is largely heuristic and exploratory in nature. We are currently developing a prototype system, using LSA (Landauer et al.) and elementary corpus statistics such as TF-IDF to identify index terms. We also have a well-developed and intuitive GUI wizard for ease of use by a non-technical user. At present, the planned heuristics may or may not be sufficiently reliable to use without a human post-editor. However, if they can be shown to substantially reduce the work load on the human author, the resulting tool may still be of interest. I present the results of some prototype-scale experiments, plus some ideas about usability testing and the directions of future development.

Bibliography