It's might simple and it's fast. More features I'm thinking about:
- In addition to the stemmed version of a word, we could also index the unstemmed version; that would give documents which have the unstemmed version a higher score than those which only have a related form, at the cost of probably doubling (or more) the number of the JSON files.
- When I rank documents for relevance right now, I'm just counting any occurrence of any of the words; we could give extra points to documents which include more of the words.
- It would actually be possible to support + and - (must include and must not include) with the current infrastructure; is it worth it?
- It is theoretically possible to build an ngram index in the same way, but of course the JSON scale would be hugely increased. But if we're not supporting quoted phrases, we should warn of this, or perhaps simply strip quotes out before doing the search.
- Should we warn about stopwords, or strip them out?