Next-gen search: first steps
One of the problems we face in building our next-generation search engine is the issue of archaic spellings and modern equivalents. In an effort to understand the scale of the problem before we begin tackling it, I've written some scripts which are in the process of compiling a list of all the distinct word-like tokens in the corpus which do not appear in a modern spelling dictionary. Right now, it's it's up to the Rs, and at around 35,000. I'll stay this evening till it completes, because I want to see the final tally.
Once we have the complete list, we'll be able to work out how many of them could be dealt with by means of normalization algorithms (such as switching long s to s, and normalizing other spelling variant patterns known to be common). Following that, we'll have an idea of how many tokens will actually have to be provided with equivalents by a human reader.