More on inferred glosses
I am posting this exchange about inferred glosses so that I don't have to think it through all over again in the future!
SMK wrote:
Regarding the search engine, I blogged on 12/12/12:
"ECH's goal for the search engine in the web database is that, if a user searches for "fat", s/he will get results including fat, fatten, fattening, fatty. Our current settings, and our policies for adding inferred glosses, seem to be accomplishing this nicely. An entry which has "fatty" in its def is found by a search for "fat", because it also has an inferred gloss "fat". Searching for "fat*" also returns defs including fat, fatten, fattening, fatty ... but also fatal, fathom, father."
However, we also noticed the converse on 16/04/13:
When I searched for the inflected form “fired”, I also I got all the entries with “fire”.
BUT when I search for “fatty” or “fatten”, I don’t get all the entries with “fat”. What is the difference here?
MDH replied:
I think you're just discovering that a stemming analyzer is not an educated human. It doesn't understand semantics; it just knows how to strip off (some) inflectional endings and index the resulting stems, and then how to stem the search input and search the stemmed index with it. You will never find an automated search engine that gives you perfect results.
Right now, the search is paying no attention to whether things are in gloss tags or not; as I understand it, the purpose of the gloss tags is to construct and English-Nxa’amxcin list, not to aid in searching.
The situation with "fatty" is definitely a bit odd; it appears that if you search for that word, you it doesn't get stemmed prior to the search, whereas if you search for "fired" it does. Perhaps the stemmer avoids stemming -tty inputs because there are many which shouldn't be stemmed? ("batty", "natty", "patty", for instance.)
SMK continued:
OK, so when I search for fatten, fattened, or fattening, I get the same 5 hits – 3 for “fattening”, one for “fattened”, and one for “fatten” – i.e. everything with the stem “fatten”. It doesn't go all the way down to the root “fat”, and that's fine.
When I search for “fatty”, all I get is the one entry for “fatty”, as you explained above. That's fine too.
We had been adding inferred glosses for the uninflected English stems and roots of attested glosses, e.g.
<def>
<seg>I am <gloss>fattening</gloss> it up</seg><bibl corresp="psn:W">W10.138</bibl>
<seg><gloss subtype="i">fatten</gloss></seg><bibl corresp="psn:ECH">ECH</bibl>
<seg><gloss subtype="i">fat</gloss></seg><bibl corresp="psn:ECH">ECH</bibl>
</def>
Here, <gloss subtype="i">fatten</gloss> adds nothing to the search capabilities, because the stemmer can find “fatten” within “fattening”.
But does this entry with “fattening” get found when I search for “fat” because of the stemmer, or because of the <gloss subtype="i">fat</gloss>? It must be because of the inferred gloss, because the stemmer only stems as far as “fatten”.
In the case of “fatty”, where we know the stemmer doesn't operate on it, it still gets found when I search for “fat” because of the <gloss subtype="i">fat.
(“fattening” and “fatty” do NOT get found when I search for “fat” just because they contain the string f-a-t, because “fatal” and “father” are NOT found by a search for “fat”. To find anything with the string f-a-t, I would need to search for “fat*”.)
So the inferred glosses do play a role in improving the search. That said, I don't think we should be going out of our way to add inferred glosses for this reason.