Research on Tesseract OCR (for VPN)
Posted by mholmes on 25 Jan 2013 in Activity log
Following the first few steps on the Google Code wiki for Tesseract to learn how to train it for a new language, I've used the moshpytt box editor on a sample file, and read through the other sample data. It looks like we may be able to do something like the following, For any sufficiently large run of a journal which has consistent page-images, fonts, print quality etc.:
- Create an imagemagick script which optimizes the images for OCR.
- OCR some sample poems and use moshpytt to correct the results.
- Go through the rest of the training process, to create a complete training set for "Victorian English".
- Use a standard dictionary, but tweak it to remove any modernity that's distracting, etc.
- Add XSLT for markup to the end of the toolchain.
- Add a step to pull in metadata from the db.
- Run it on the whole set, and get decent TEI-encoded transcriptions out the other end.