Skyped with LR about the possible next steps for TEI as an LMF serialization.
Tweaked the XSLT for actual use (adding utility templates to preserve whitespace between PIs etc.), and SK has now tested this with real data. She has some suggestions for improvements, one of which might be achievable, although it will require a structural rewrite, but results are good so far, and most cases of deviation from desired results are in situations where there is no mechanically-detectable context which could be used to change the outcome. Can this now be extended to pron/segs, where there's even less context because hyphenation is not there?
Still working on this, with a slow but fruitful discussion on the LMF list helping me to confirm that what I thought were limitations in LMF interoperability really are so. I'm coming round to LR's view that TEI would be a better serialization format.
Much frustration involved in my belated (re?)discovery that neither word-boundaries nor lookarounds are supported in the XPath implementation of regular expressions. Grrr. But now working fine, with lots of help and a test set from SK. We can start testing it on whole files tomorrow.
Note to selves for the future: how will we deal with English homophones for sorting the Eng-Nx word list? If we remember, let's list the ones we come across here. So far we have:
fire - "flames" vs. "dismiss employee"
fast - "quick" vs. "abstain from food"
hide - "skin" vs. "conceal"
cold - adj or noun "coldness" vs. "illness"
saw - "tool" vs. past tense of see (replaced with <gloss subtype="i">see</gloss>, SMK 21May13)
close (near) vs. close (shut)
stern (back of boat)
game (recreation)
watch (observe) vs. watch (wristwatch)
back (body part)
top (toy) vs. top (of something)
fish vs. fish (catch fish)
ECH's goal for the search engine in the web database is that, if a user searches for "fat", s/he will get results including fat, fatten, fattening, fatty.
Our current settings, and our policies for adding inferred glosses, seem to be accomplishing this nicely. An entry which has "fatty" in its def is found by a search for "fat", because it also has an inferred gloss "fat".
Searching for "fat*" also returns defs including fat, fatten, fattening, fatty ... but also fatal, fathom, father.
We reviewed our gloss-tagging policies yesterday, and concluded that yes, we are placing inferred gloss tags correctly for the purposes of generating the English-Nxa'amxcin word list, both in the web display and for the future print dictionary.
I summarize our notes about the Eng-Nx section of the print dictionary here, so we can remind ourselves in the future!
-The Eng-Nx section in the print dictionary should be considered a (fairly detailed) index to the Nx-Eng side, not a full Eng-Nx dictionary. It will be comparable to what MDK did in his Chehalis dictionary.
-Ours will go one step further than the Chehalis dictionary, in that, for example, a Nxa'amcxin word with "fattening" in its def will be found under fat, fatten, and fattening (not just the lemma, fat).
-Our print version will be like our current Eng-Nx wordlist view in the web interface, expanded to the first level of detail - e.g.
fatten
kn sacqʼʷúcnctəxʷ fattening
ʔacqʼʷúcn fattening
ʔacqʼʷúcts fattened
-Inferred glossed will be hidden in both the web view and the print dictionary, although they are important for the "behind the scenes" generation of the Eng-Nx wordlist.
-Our gloss-tagging process should provide at least one English key for each Nxa'amcxin word. It currently almost accomplishes this. There is just the occasional def in which it is impossible to figure out what the gloss-tag should be - e.g,
<seg>someone who goes fishing or hunting and does not get anything; poor fisherman; poor hunter</seg>
Got some good work in today, and it feels like it's coming together. 8 pages done, about another 8 to do, I think, and some diagrams required.
See autophonemicizer2.doc in moses/trunk/docs. It can be done with XSLT and regular expressions.
For every paragraph I write, I seem to have to find and read two more papers...