I'm pretty sure we're going to be able to have nesting/hierarchy both in our feature structure declarations and in our feature structures inside entries. I have one possible simple instance relating to clitic particles which will make a good test case, if ECH confirms what I believe to be the intended analysis based on her text.
Following ECH's analysis of the feature structures, I'm refactoring them, breaking out the general categories into smaller ones. Once that's done, we can see about how we might recombine them to create feature structures at higher levels, both in the declaration and in the documents.
There's a proliferation of groups of bibls which apply to the same preceding item (seg, phr, etc.), so I've written some XSLT to collapse them to a single bibl. My plan to move from bibl/@corresp to @resp on the preceding element has been thwarted by the fact that <phr> cannot take @resp. I'll need to re-think. I don't like <bibl>, which doesn't really make sense, and for quotes we can use @ref instead, but I'd like to use a unified mechanism for <seg> and <phr>.
Read through the latest version of LR's paper and sent comments. Now working on ECH's feature structure paper, and it looks as though we're going to break down our feature/value sets into smaller groups.
I went through ECH's document, Moses Feature System.odt, and noted several cases where we may need to further refine the feature system.
Is it our goal that every affix (well, besides lexical suffixes and lexical prefixes) have a unique feature structure? If so, we will need additional features to disambiguate the following:
- the applicative suffixes ɬ-DIR, tuɬ, and xit. (MLW 2003 could not identify differences in usage for these sufffixes.)
- ɬ-DIR and ɬ-EP can be distinguished with the symbol values "applicative" and "external-possession", respectively, but their meanings overlap somewhat. They are currently combined into one entry, "ɬt" in the database.
- all locative prefixes (kat-LOC, k-LOC, kɬ-LOC, kʼɬ-LOC, n-LOC, niʔ, t-LOC)
- the directional prefixes (ʔal, c, lc)
- the nominalizer-instrumentals min-INST and tn.
- the "count" prefixes: kʼɬ-DER (for counting sacks or bags), kɬ-DER (for ordinal numbers), and k (for counting people)
- the irrealis mood prefixes: kas for verb stems, and kaɬ-PR for noun stems. (Should these be allomorphs within one entry? Or two entries, cross-referenced?)
Snippet for same, producing tab-delimited data for pasting into spreadsheet:
declare default element namespace "http://www.tei-c.org/ns/1.0"; for $e in //entry[descendant::symbol[@value='proper-noun']] order by $e/@xml:id return concat($e/@xml:id, ' ', $e/ancestor::TEI/@xml:id, '.xml')
Snippet to generate data for pasting into spreadsheet:
declare default element namespace "http://www.tei-c.org/ns/1.0"; for $t in //@target[parent::ref[ancestor::xr]] let $id := substring-after($t, 'm:') where not(//entry[@xml:id=$id]) order by $t/ancestor::TEI/@xml:id, $t/ancestor::entry/@xml:id return concat($t/ancestor::TEI/@xml:id, '.xml ', $t/ancestor::entry/@xml:id, ' ', $t)
To make this (and other things) simpler, I added @xml:id attributes to the root elements of all our files. For the record, to save a couple of minutes, here's the code to get duplicate ids for pasting into a spreadsheet. Don't forget to set the number of results high enough.
declare default element namespace "http://www.tei-c.org/ns/1.0"; for $id in distinct-values(//entry/@xml:id) let $c := count(//entry[@xml:id=$id]), $docs := if ($c gt 1) then //TEI[descendant::entry[@xml:id=$id]]/@xml:id else () where $c gt 1 order by $id return concat($id, ' ', $c, ' ', string-join($docs, ', '))
Wrote some XSLT to generate new @xml:id values from the phonemic rather than phonetic representations, and ran it on all the unedited or rescued documents. This saves the editor a lot of work, but actually has resulted in an increase in the number of duplicate ids across the project, which is now pegged at 1401. Gawd.
SK found a new informant, VH, so I've added her to the odd file and schema.