Academic

March 20, 2013

TEI work

Posted by on 20 Mar 2013 in Activity log

Cleaning up the codebase and web materials following the migration to the new Allura system on SourceForge.

Skype meeting with SB for DHSI prep

Posted by on 20 Mar 2013 in Activity log

Worked on handout documents as we Skyped; next meeting tomorrow to discuss inclusion of MK book chapter in handout.

March 18, 2013

MVP Board meeting

Posted by on 18 Mar 2013 in Activity log

Took minutes.

March 13, 2013

TEI work

Posted by on 13 Mar 2013 in Activity log

Following TEI SourceForge conversion to the new system, with new URLs, reworked the Jenkins setup and build script, and started setting up to test the latter.

March 12, 2013

DHSI work

Posted by on 12 Mar 2013 in Activity log

Skype call with SB, and then spent an hour preparing a second version of a handout for Oxygen because we cannot get a straight answer on which version of Oxygen will be available in the lab we'll be teaching in. This is incredibly frustrating.

February 22, 2013

DHSI work

Posted by on 22 Feb 2013 in Activity log

Completed a handout sheet for DHSI. My Brown login is now working so I can use the SVN.

February 6, 2013

TEI work

Posted by on 06 Feb 2013 in Activity log

A bit of work on TEI tickets (removing data.code, which is now obsolete).

February 5, 2013

DHSI prep

Posted by on 05 Feb 2013 in Activity log

Telco with SB and plans for changes to materials and course outline.

February 1, 2013

TEI work

Posted by on 01 Feb 2013 in Activity log

Worked on some TEI tickets.

January 25, 2013

Research on Tesseract OCR (for VPN)

Posted by on 25 Jan 2013 in Activity log

Following the first few steps on the Google Code wiki for Tesseract to learn how to train it for a new language, I've used the moshpytt box editor on a sample file, and read through the other sample data. It looks like we may be able to do something like the following, For any sufficiently large run of a journal which has consistent page-images, fonts, print quality etc.:

Create an imagemagick script which optimizes the images for OCR.
OCR some sample poems and use moshpytt to correct the results.
Go through the rest of the training process, to create a complete training set for "Victorian English".
Use a standard dictionary, but tweak it to remove any modernity that's distracting, etc.
Add XSLT for markup to the end of the toolchain.
Add a step to pull in metadata from the db.
Run it on the whole set, and get decent TEI-encoded transcriptions out the other end.