Marked up to be included in the ACH/ALLC 2005 Conference Abstracts book.
None
Encoding standards such as TEI give scholars a great deal of flexibility in annotating texts to meet the particular needs of a study or project. Researchers necessarily make choices about which features of a text to highlight, what kinds of additional information to add, and what facts are left to be inferred from other sources of evidence apart from markup.
Among the factors to be considered in designing or adopting text encoding procedures are the prospects for:
This paper discusses considerations motivating the ongoing design of a software framework for analysis of rhyming schemes in 19th century Russian poetry. At its simplest, the application receives as input poems marked up like the example in Figure 1 and produces output like the following:
Figure 1
Programs producing output such as the example above could be written in any of a variety of different programming languages. They might employ different strategies for integrating linguistic and orthographic processing rules with evidence encoded more directly using markup. For example, we discuss an earlier approach to the current project in Adams & Birnbaum. Our current implementation, however, is written in Prolog as an application of the BECHAMEL system for markup semantics analysis (Dubin et al.). The motivation for this choice was our wish to plan from the beginning for extensions to other encoding schemes and generalization to other kinds of analysis.
Prolog is a declarative language of rules and assertions, and BECHAMEL is a collection of predicates supporting the declaration of object classes, properties, relations among objects, and the execution of inference rules based on information extracted from XML documents. An example of a Prolog clause is shown below: it is part of our application's logic for determining that two sequences of phonemes at the ends of a pair of orthographic lines all agree with each other (i.e., that the sounds at the end of the lines rhyme with each other).
In the clause, P1, P2, P3, and P4 are variables representing phoneme
objects, and C1, C2,C3, and C4 are variables representing character
objects. The predicate all_agree(P1,P2) will be satisfied if each of the predicates following the implication sign can be satisfied. The
logic of the clause can be read as follows: Phonemes P1 and P2
A major advantage of Prolog's declarative approach is the flexibility to define logic for separate cases in separate clauses for the same rule. For example, in the clause shown above, it is presupposed that in both lines there will be a simple one-to-one mapping from characters in sequence to phonemes. But accommodating more complex cases need not complicate the expression of the simple case: if the simpler clause cannot be satisfied then Prolog's inference engine will search for a different clause of the same rule that can be satisfied.
Reasoning about poems like the one in the example above requires that we model their contents at both a phonemic and orthographic level. The rules for Russian pronunciation include not only the way that particular vowels and consonants sound, but also how those phonemes are cued by the way the text is written (as with, for example, the palatalizing effect of the soft sign and the soft vowel letters). The BECHAMEL system's definition predicates for object classes, properties, and relations gives us the ability to model each of these levels with declarations such as the following:
In our approach, the phonemic properties of vowels and consonants are distinct from the orthographic properties of characters, written words, and lines. But it can occasionally be convenient for users to ignore these distinctions, particularly in comparing different proposed models for the same data. We therefore aim to let the models govern as much processing of our raw data as possible. For example, both phonemic and orthographic properties of letters are recorded in a data file using the same predicate as shown below:
In this example, id, name, case, and charclass are all properties of characters, while voicing, place, and manner are
phonemic properties. As individual characters and phonemes are
instantiated, they acquire only those properties that are appropriate
for their class. This is accomplished through general-purpose rules
that match on the basis of our property declarations. So if we were to
decide (for example) that name should be a property of the phoneme
rather than the character, we need only change the declaration, and
the property value recorded in the data file would be assigned to
phoneme objects rather than character objects.
BECHAMEL supports superclass and subclass relations, which allows us
to declare that vowels and consonants are subclasses of phoneme, and
that letters, marks, and spaces are subclasses of character. Since the
conventional superclass/subclass relation can prove awkward in some
situations, BECHAMEL includes class declaration expressions similar to
those found in ontology languages such as OWL (W3C).
For example, place and manner of articulation in consonants may be used to describe classes of phonemes, not merely features of them (the class of stops, the class of velar consonants, etc.). It would be awkward to declare each consonant as a subclass of both its place and manner of articulation. Instead we use a BECHAMEL predicate that permits us to declare membership in a class based on the value assigned to a particular property:
An alveolar, therefore, is anything that takes the value alveolar on
its place property, a nasal anything that takes nasal for its
manner property, and so on. These class identities are in addition
to the one that instantiated the object. A related feature of BECHAMEL
is the ability to define class membership based on a Boolean
expression. The following example declares that an obstruent is either
a stop, a fricative, or an affricate:
All of these features are employed with the aim of making our
understandings, models, and simplifications regarding the rules of
Russian pronunciation as clear and as explicit as possible. We express
them in the form of declarative rules so as not to entangle
implementation details of our code with aspects of our model that
should be open to criticism, revision, and extension. For example, our
rule governing the devoicing of word-final obstruents states that if
an obstruent O is written with word-final letter L, then O should take
a value of voiceless on its voicing property (unless it already has
that property value):
Taking this approach, even a limited application, such as determining
which lines of a poem rhyme, requires a large number of these
declarations and rules; there is a very real sense in which we are
doing it