Developing local search engine
Posted by mholmes on 16 Mar 2018 in Activity log
One of our project goals is to investigate the practicality of developing a local search "engine" which does not require any server-side support, to find out whether it is possible to do this, and if so, how large a site can be before it's impractical. Today I did the first half of this work, with the Keats site as a pilot (because it's modern(ish) English, it's of a size which is not huge but not trivial, and it doesn't have any back-end and probably shouldn't). This is what I've got so far:
- XSLT tokenizes all the content files, duplicating some bits to create simplistic weighting. I attempt to preserve proper names by retaining capitalization for all words which don't appear in a small English word-list (40,000 words).
- A python Porter stemmer stems all the non-proper-name tokens.
- XSLT amalgamates all the token-counts and their source documents.
- XSLT generates a separate JSON file for each token, containing a list of all documents containing it, and how many hits there are in that document.
Next, we write a search engine interface in which we use JavaScript to:
- stem each search term (unless it's a proper name). I've found a JS implementation of the Porter stemmer.
- retrieve the JSON files for each of the search terms
- unify them to get hit counts for each individual document
- display (paged?) results
Should be doable in a few hours.
This entry was posted by Martin and filed under Activity log.