This took a bit longer than I expected because of some problems encountered with missing data, but I think I've completed all the instructions SMK provided on Tuesday night for the new dictionary entry format. Generally I think it looks pretty good. Some notes:
- The test PDF, which runs to 185 pages right now, is built using the same test set that SMK and I have been using for developing the root-based index -- the full list of included files is below. It's basically all the completed entry files along with any incomplete files containing morphemes required by items in the completed files.
- "Name" entries have been excluded from the dictionary. At present, this means all entries which have the "name" feature set to true. This is too crude, because it will also include entries for flora and fauna, but it looks as though the feature structures will need to be made a bit more sophisticated to allow us to exclude people's names more reliably and keep the other ones.
- I've set it up so that it automatically generates orthographical forms where required, based on the phonemic pron, and it also sorts based on these forms. This saves having to preprocess all the files to add orths before generating the dictionary. If an orth already exists, it will use it (so when you we get around to adding orths, they'll be used in place of the generated ones).
- Where "orth?" appears in the middle of an entry, it's from a quotation which has no phonemic <phr>, so there's nothing to generate an orth from. There are 2,204 of these in lex-suf.xml alone. Perhaps auto-phonemicization can help here.
- There are problems with cross-references which contain refs pointing at entries which are excluded from the dictionary (see #2), so any such cross-references are ignored. This means that some legitimate cross-references are excluded because they share an <xr> tag with unusable ones.
- I think that enclosing the two versions of the hyph (the regular hyph and the "translated" hyph) in the same set of paired square brackets is a bit confusing; I think there needs to be a more obvious delimiter between them, or perhaps they should be bracketed separately:
[[c-ka-√ƛʼaʔá-s-n c-kas-√ƛʼaʔa-stu-(3Obj, 1SgSubj)]] should be something like: [[c-ka-√ƛʼaʔá-s-n ◆ c-kas-√ƛʼaʔa-stu-(3Obj, 1SgSubj)]] or [[c-ka-√ƛʼaʔá-s-n]] [[c-kas-√ƛʼaʔa-stu-(3Obj, 1SgSubj)]](I do think the translated hyph is a great idea though.)