Research and planning for search

Posted by on 10 May 2007 in Activity log

This post represents the results of research Greg and I have been doing on the issue of searching complex Unicode data.

The search interface for this project presents a range of interesting challenges. Searching the English components of entries is no problem at all, but when it comes to searching the Moses text fields, we have to deal with the issue of diacritics. For instance, take the word "ṣạpḷị́ḷ". If your browser is using an appropriate font, you'll see this sequence of characters:

s with dot below
a with dot below
p
l with dot below
i with acute accent and dot below
l with dot below

We can envisage two extremes of user. On the one hand, a researcher familiar with the language might be searching for exactly this string of characters, with all the accents in place. In this case, we have only one type of problem. Consider the "i with acute accent and dot below". This could legitimately be created through any of the following sequences:

i + combining acute accent + combining dot below
i + combining dot below + combining acute accent
(composite) i with acute accent + combining dot below
(composite) i with dot below + combining acute accent

These all look the same, and are equivalent; there's no way to know which one the user typed in, and there's also no way to be sure which form might be in the database. This makes matching a search string against fields in the database rather difficult.

Unicode provides a solution for this, which will work pretty well for Moses. In Unicode, each combining character is assigned a combining class -- a number between 0 and 255. The classes are based on the position of the diacritic with respect to the character it's joined to. (See here for more info.) Now, Unicode provides a set of normalization forms for any string. Among them is NFKD, or "Compatibility Decomposition". What this process does is:

Decompose all composite characters into character + diacritic.
Apply canonical ordering to diacritics, based on their combining class.

The result of applying NFKD normalization to any of the sequences above would result in the same output:

i + combining dot below + combining acute accent

because they would all be decomposed into three components, and then the i would come first (as the base character), then the dot below (combining class 220), and finally the acute accent (because its combining class is 230, which is higher than that of the dot below).

Therefore we have a solution to the problem of our advanced searcher: if we perform NFKD normalization on BOTH the strings in the db against which we're searching, AND the search string entered by the user, then we'll be able to do a useful search.

The second type of user, a casual surfer, or someone who is not linguistically sophisticated or familiar with the language, presents a different type of problem. They most likely have no idea what diacritics should go where, and even if we provide onscreen buttons or keystroke methods for entering diacritics or modifier letters, they won't be able or willing to use them. Their search for "ṣạpḷị́ḷ" is likely to be entered as "saplil". Nevertheless, they'll still want to get results.

Another application of Unicode normalization form NFKD, followed by some extra processing, can solve this problem. First of all, it will split off the combining diacritics. We can then remove them from the string, turning "ṣạpḷị́ḷ" into "saplil". If we do this for the search string entered by the user, and for the db strings against which we're searching, then we can obtain meaningful matches whatever the user enters.

In addition to splitting out the combining diacritics, compatibility decomposition will also convert some characters into their "compatibility" equivalents. For example, modifier letter w (raised w) will be converted to a regular w. This solves yet another problem: people will tend to type a w rather than finding the keystroke or button for raised w. Some characters, however, do not have compatibility equivalents where we would want them. Modifier raised glottal, for instance, doesn't have a compatibility equivalent, even though there is a regular glottal character. When we process the string to strip out the diacritics, though, we could do that conversion too.

These are the conversions we would need to make in order to create an "ascii representation" of a Moses string:

Split out combining diacritics (NFKD does this).
Convert characters to their compatibility equivalents (NFKD does this for some characters).
Discard combining diacritics.
Convert raised w to w.
Convert raised glottal to glottal.
Convert belted l to l.
Convert barred lambda to l.

Now we have a decision to make. There are two non-ascii (potentially "confusing-to-the-layman") characters still in the mix: glottal and epiglottal. We could either leave them there, or we could replace them with something more bland. If we replace them, we need appropriate replacements. One option would be to replace both by the apostrophe; another would be to use a convention such as X-SAMPA, which replaces the glottal by a question mark, and the epiglottal by "?\". A decision on this should be guided by our sense of what semi-sophisticated users (such as band members familiar with the basic orthography) might be expected to use.

So we have a situation where we need to map two representations of a search string against two representations of the data. The data itself does not contain any normalized representations at all, and we would prefer to avoid cluttering it up with them; furthermore, because they can be generated programmatically, it makes no sense to have them in the data anyway. However, generating two representations of every Moses string in the db every time a search is done makes no sense either; it would make searching ridiculously slow.

The solution to this is to create an index to the entries, consisting of a series of parallel entries. Each one would have the same xml:id attribute as the entry it's based on, as well as a copy of the representative string for that entry (which is currently the first <seg> in the first <pron> in the first <form> element). It would then have two blocks of tags, one for the NFKD forms of all Moses strings in that entry, and one for the ascii-ized forms. This index can be loaded into a special subcollection on the database. Searching can be done against the index, to retrieve the xml:id and the representative string directly from the index, instead of from the db itself. A list of such hits can be returned to the user, who can then click on any entry to retrieve the real data for that entry from the database.

This index would have to be generated manually from the existing data. The best way to do this would probably be to write a Java application which can be passed a list of markup files. It loads each XML file into a DOM, then generates an index entry for each entry in the source file; it then uses a node iterator to run through each of the tags containing Moses text, and generates one of each type of compatibility form for them. This index file can go into the database, and eXist range indexes can be defined for it, making searching even more effective.

That solves the problem of creating searchable indexes; now we have to deal with the issue of the search string, which cannot be processed ahead of time, because we don't know what the user will type in. We need to render this string into two forms as well, to search against the two types of item in the index. There are two possible ways to handle this:

We could try to produce the two forms using XSLT, by predicting all possible variants of characters and diacritics, and searching/replacing on them. This sounds like it would be inefficient and complicated, but since search strings will typically be very short, it might be a perfectly good approach.
We could write a new Transformer for Cocoon using Java, which does the same job, and which can be called from a Cocoon pipeline. This would most likely be a little faster, and more reliable since we could depend on Java to do the NFKD normalization for us. However, it would involve learning how to write and deploy a customized transformer, which would take some time. On the other hand, knowing how to do this would be very handy in the future.

Whatever method we use to massage the search string, we'll have to integrate it into a pipeline, which will pass the results as two parameters to an XQuery page. The XQuery code will then perform two searches, with the normalized form against the equivalent forms in the index, and the ascii-ized form against its equivalent. This will result in two sets of hits; the first set should be prioritized because they're more likely to be more precise matches. Duplicates would need to be discarded, and then the resulting set of hits (xml:id and representational form for the entry) would be returned to the pipeline to be styled as hits and be inserted into the search page. This would be done through AJAX.

This will obviously need more thought to work out the fine details, but I think we have the basis of an excellent strategy here, enabling us to provide fast searching which combines precision with fuzziness, and which should suit any type of user.

This entry was posted by Martin and filed under Activity log.