Much discussion over the last few weeks regarding the placing of gloss tags for generating the Eng-Nx wordlist. I attempt to summarize our conclusions here for future reference.
1) Why do we place inferred glosses (<gloss subtype=”i”>)?
At various times, we have placed inferred glosses for augmenting the search engine on the website, and for generating the English word list.
We concluded that from here on, we ONLY need to place gloss tags for generating the English word list. Inferred glosses do sometimes enhance the web search engine, but now that the stemming analyzer is in place, we don't need to do any further markup to help it out.
2) How should we tag inflected English words?
Until last week, we had been inferring the root word (or stem where relevant) when a def is an inflected or derived form of an English word, e.g.
<def>
<seg>he is <gloss>fattening</gloss> it up</seg>
<bibl corresp=“psn:JM”>JM 1.2.3</bibl>
<seg><gloss subtype=“i”>fatten</gloss></seg>
<bibl corresp=“psn:ECH”>ECH</bibl>
<seg><gloss subtype=“i”>fat</gloss></seg>
<bibl corresp=“psn:ECH”>ECH</bibl>
</def>
This encoding means that this entry will show up three times in the English-Nxa’amxcin wordlist: under fat, under fatten, and under fattening. This seems like overkill, especially when these three words will sort one after the other in the English wordlist anyway.
ECH and SMK decided we would like to see the “fat” entries as follows in the print dictionary:
fat: fat
fatten: fatten, fattened, fattening
fatty: fatty
To accomplish this, we need to reduce the number of gloss tags we place in each entry. Inflected English forms (-ed, -ing) should not be gloss tagged; only their root or stem should be gloss tagged.
So “fattening” would now be gloss-tagged as:
<seg>he is <gloss>fatten</gloss>ing it up</seg>
MDH confirmed that the search engine is ignoring gloss tags, so the stemmer will operate on <gloss>fatten</gloss>ing the same as it would on <gloss>fattening</gloss>. (That is, it will continue to return all results with the stem “fatten” when someone searches for fatten, fattened, or fattening.)
MDH has created two sample Eng-Nx word lists based on the 6 files with “complete” status, one using all the gloss tags, and one omitting the inferred gloss tags. They are in moses/trunk/docs/glosses. We concluded that we don't want to programmatically ignore the inferred glosses, because many of them – especially the synonyms – are worth including. But we can refer to these lists to identify the inflected English words whose gloss tags need to be revised.
3) How should we tag English phrasal verbs?
Where appropriate, English phrasal verbs will be enclosed in a single gloss tag - e.g, <gloss>go after</gloss>. This will allow us to organize the headwords in the Eng-Nx word list as follows:
go
go after
go down
go up
, etc.
4) How can we distinguish English homophones in glosses?
English homophones in glosses will be distinguished with a secondary word (or phrase) in an @n attribute on the <gloss> tag, e.g.<gloss n="conflagration">fire</gloss>, <gloss n="back of boat">stern</gloss>. These will then be rendered as follows in the print dictionary:
fire (conflagration):
stern (back of boat):
We decided not to use parts of speech for @n values. We will always use synonyms. We need to select synonyms that will be clear to readers in the community.
I have now disambiguated the English homophones listed here, and updated the Notes on Definitions and Gloss Tagging document accordingly. Where one homophone was far more common in the data than the other, I only added an @n value on the less common one - e.g. watch (wristwatch).