More research on historical spelling variance
Posted by mholmes on 09 Mar 2012 in Activity log, Academic
I now have a collection of a dozen or so papers I'm reading and annotating, and some ideas are getting clearer. At the moment (although I still have a lot of reading and consulting to do), this kind of approach looks promising:
- Run XSLT on collection to create parallel collection in which each significant block (not clear what a block is yet) is converted to a modernized textual representation with an XPath pointer that points back to the original block in the original doc. In this process, linebreaks would be dealt with.
- Each modernized block includes the original variants as attributes or elements (if the latter, the modern indexer can be instructed to ignore them).
- Modern blocks may also be stemmed.
- Search is done on modern blocks.
- KWIC hits from search can be shown EITHER as modern OR as original sequence (reconstructed from original variants stored in modern block).
- Clicking on the hit takes you to the original text, with hits highlighted based on a new search done using the original tokens stored in the modern block as search terms.