ECH and I spent some time today looking at the complexities of orthing, with the assistance of the document SMK has been working on. We added a few simple rules to the existing algorithm (removing secondary stresses), and there's one more I can add (removing unstressed schwas at the beginning of the process).
But beyond that, the basic conclusion we came to is this:
Most of the hard decisions for the awkward cases have to be made on the basis of hyph or feature structure information. Right now, I'm not inputting that information into the orthing algorithm; it just operates on a plain phonemic transcription. In the case of
<pron> orthing, it would be possible to pass in the relevant hyph and fs data. However, the orthing algorithm also has to be applied to
<cit>s, and in that case, there is no hyph or feature structure info; any phr might contain a word from a completely different entry, and we wouldn't have any way of knowing about it.
So we think the best approach is this:
- We do the best we can to improve the simple algorithm.
- We run it against the existing files and create hard-coded
<phr type="orth">) in the actual entry files.
- We also write code which is designed to detect (but not handle) circumstances under which a manual fix is likely to be required (where, for instance, the sequence əxʷ appears, which might indicate m:mix or m:ulˀəxW). When we detect those cases, we add a comment to the effect that someone needs to fix the results.
Our quick tests suggest that the problem entries are likely to be in the low hundreds rather than the thousands, and a lot of the checking and fixing could be done by a good workstudy student.