Cleaning up the codebase and web materials following the migration to the new Allura system on SourceForge.
Worked on handout documents as we Skyped; next meeting tomorrow to discuss inclusion of MK book chapter in handout.
Took minutes.
Following TEI SourceForge conversion to the new system, with new URLs, reworked the Jenkins setup and build script, and started setting up to test the latter.
Skype call with SB, and then spent an hour preparing a second version of a handout for Oxygen because we cannot get a straight answer on which version of Oxygen will be available in the lab we'll be teaching in. This is incredibly frustrating.
Completed a handout sheet for DHSI. My Brown login is now working so I can use the SVN.
A bit of work on TEI tickets (removing data.code, which is now obsolete).
Telco with SB and plans for changes to materials and course outline.
Worked on some TEI tickets.
Following the first few steps on the Google Code wiki for Tesseract to learn how to train it for a new language, I've used the moshpytt box editor on a sample file, and read through the other sample data. It looks like we may be able to do something like the following, For any sufficiently large run of a journal which has consistent page-images, fonts, print quality etc.:
- Create an imagemagick script which optimizes the images for OCR.
- OCR some sample poems and use moshpytt to correct the results.
- Go through the rest of the training process, to create a complete training set for "Victorian English".
- Use a standard dictionary, but tweak it to remove any modernity that's distracting, etc.
- Add XSLT for markup to the end of the toolchain.
- Add a step to pull in metadata from the db.
- Run it on the whole set, and get decent TEI-encoded transcriptions out the other end.