Confed debates: OCR workflow in progress

Posted by on 03 Dec 2015 in Activity log

DH posted the first set of page-images to GitHub, so I started testing the process and building the toolchain. I've used Tesseract to generate HOCR and PDF files following the methods piloted for ScanCan and documented there; I've written XSLT that builds a content-editable version of each page, with the page-image displayed, for editors to use; I've built the individual pages into a single uncorrected text-over-image PDF; I've learned how to set up large file services (LFS) on my local repository to enable me to commit the resulting file (which is over the GitHub repo limit of 100MB); I've tested checking out the repo on a machine without LFS configured, and discovered that you simply get a text file which is a pointer/placeholder for the large file; and I've begun work on an XSLT file to convert the corrected HOCR back to something that hocr2pdf can use, by cleaning out the extras we've added and fixing edited lines. Good progress, all probably portable straight over to the ScanCan project.

This entry was posted by Martin and filed under Activity log.