Search Engine

Posted by on 18 Nov 2010 in Activity log

Martin and I have been discussing what the search engine looks for. I'm posting our discussion here for reference, because I know I will be confused about it again in the future!

SMK:

The search engine doesn't always find a string within a longer string. For example, some, but not all, instances of "dropped" come up if I search for "drop". I'm not sure why this would be, if it's searching in all text fields. Or does it only find "dropped" if there is a <gloss type="u"> included with "drop"?

MDH:

It searches for words, so if it's looking for "drop", it won't find
"dropped". You don't want to search for "bed" and find "bedazzled".

Currently, the search uses Lucene's StandardAnalyzer to tokenize the
text, which means that it tokenizes on word-boundaries, and does no
stemming.

SMK:

I see. There are a couple of cases in the result set for "drop"
where "drop" is highlighted within "dropped". I'm guessing this is
because it's gloss-tagged? If this is the case, won't we have
effectively stemmed everything on the English side once we have
edited and gloss-tagged the rest of the entries?

MDH:

If you have this:

<gloss>drop</gloss>ped

then it would see the tag as a word-boundary, and find "drop". That's not a bad thing, I suppose.

SMK:

Yeah, so when we have gloss-tagged all the entries, most cases of a given stem, like "drop" will be split off from their affixes. So someone might search for "dropped" and not get any hits - or perhaps only the entries where "dropped" is in a dicteg. And then the user would have to be clever enough to work backwards and realize he should be searching for "drop" too.

This entry was posted by and filed under Activity log.