Greg and I have just spent a little while with the IPA and Unicode specs, and I think we have a good strategy for this.
First, you're absolutely right that the IPA specifies a raised comma to indicate an ejective. Therefore these consonants should be written with a comma: c, k, p, q, t, epiglottal.
Sonorants/resonants should be written with the glottal diacritic in the data, to conform with IPA. We're not sure of the exact list of consonants that fall into this category, though -- we guess: belted l, barred lambda, m, n, r, w and y.
If we've got any consonants in the wrong sets above, let us know. Now, even though we're using the raised glottal diacritic in the data, that doesn't mean we have to show it in the output; it's trivial to replace the raised glottal with a raised comma in the output. This way, we end up with a traditional "orthography" without sacrificing IPA conformance in the data.
The next question is which raised comma character to use. Unicode has these two candidates:
- \u02bc: MODIFIER LETTER APOSTROPHE, "glottal stop, glottalization, ejective"
- \u0313: COMBINING COMMA ABOVE, "Americanist ejective or glottalization"
The first is clearly the best option, and it's the one specified by the IPA. The description of the second shows that it is frequently used in Americanist transcriptions. We want a modifier letter, not a diacritic above, so we should choose the first. Incidentally, on the question of why we used that character in the data, Greg says he based the choice on your handwritten alphabet, which clearly shows the w and y with commas above them, not following them.
I'll summarize by showing the character combos I believe we should have in the data, and those we will generate in the output routines (hoping that they'll show up correctly in the browser font):
In the data:
cʼ, kʼ, pʼ, qʼ, tʼ, ʕʼ
ɬˀ, ƛˀ, mˀ, nˀ, rˀ, wˀ, yˀ
In the output for readers:
cʼ, kʼ, pʼ, qʼ, tʼ, ʕʼ
ɬʼ, ƛʼ, mʼ, nʼ, rʼ, wʼ, yʼ
If you agree, the next question is how best to go back and normalize all the data. Some of this can be done programmatically or with simple search-and-replace -- for example, we can just replace all instances of comma-above with modifier-glottal. Similarly, we can replace sequences of cons+modifier-glottal with cons+modifier-raised-comma. Slightly more problematic will be any instances of REAL apostrophes in the data, since these will always be wrong in Moses text, but may be fine in English text. I can show you some techniques for searching with XPath in oXygen to find all those instances.
Let me know what you think.