How the current search engine works
Posted by mholmes on 14 Oct 2011 in Activity log
I've done an analysis of the search engine on the current site, with a view to identifying aspects of it that should be maintained, and possible missing features or limitations that we might be able to overcome.
Key features:
- Long s substitution. The search engine does a simultaneous search for the search term itself, and a version of it in which s has been replaced by ſ. However, see limitations below.
- Wild-card searches using * and ?. These are provided by the original eXist search engine.
- Special handling for variants in name elements. A cron job that runs every day analyzes all the name elements in the system, and creates a separate entry for every word token, linking it to the @xml:id of the target element, as well every distinct complete content element. This index is then searched, and all other tokens which are in name elements pointing to the same @xml:id are added to the search list. This is quite effective.
Limitations:
- Long s substitution is crude and does not create the variations that are likely to occur. For instance, if you search for "sinfulness", the search will also look for "ſinfulneſſ". However, such a form is unlikely to occur anywhere, since long s is not typically used at the end of a word; more likely is "ſinfulneſs". A more realistic substitution algorithm can be created, which avoids replacing final s with long s, although that still may not allow for all the possible variants that may occur.
- The nightly cron job for building the variant spelling index is a bit fragile, and also means that changes to the db are not reflected in the variant spelling index for several hours. It might be easier to build Lucene indexes that will enable us to do the same sort of search on-the-fly.
- No keyword-in-context results are returned; all that comes back is the title of the document as a link. In the case of the long documents which are split into sections for display, clicking on this only shows you a TOC or an introduction, which may not be the part of the document in which the hit appears. This problem would be obviated by my plan to stop breaking these documents up, but JJ hasn't yet responded to that.