Finished and tested the ant file (easier than Python in the end), and ran it on volumes 1, 2 and 14; the last is ready for our correction test tomorrow. Everything is being added to svn.
I've written both a Python script and an ant script to automate the OCR work, and I'm testing each. The ant script is not finished yet; the Python script fails its last task, which is generating the single-volume PDF.
The discovery that there's no way to save changes to contenteditable fields without JS and submission to a server (or horrible hacks with local storage) means that we can't go that route with this project, so we're back to editing HOCR in Oxygen (which is perfectly OK). I've reconfigured the project based on that. KS has also rescanned all the volumes into individual JPGs at higher res, and that stuff is all in svn now. I've also written some additional XSLT which can hopefully combine split words in response to the deletion of spaces between them, as well as eliminating line elements, so we have some hope of getting working hocr2pdf output. Progress of sorts.
After much experimentation, we have a workflow, which goes like this:
- Scan to PDFs in sets. At this stage, we might actually do this article by article, but it isn't really important.
- Split out the page-images with
pdfimages -j vol14.pdf vol14_1
- Convert any pbm images in the result to jpg:
for f in *.pbm ; do convert ./"$f" ./"${f%}.jpg"; done; rm *.pbm
- OCR the images to create HOCR files with Tesseract:
for i in *.jpg ; do tesseract $i $i hocr; done;
Note: I'm experimenting with using multiple dictionaries for this:for f in *.jpg; do tesseract ./"$f" ./"${f%}" -l eng+fra+isl+nor+swe hocr; done
This depends on first having installed the additional dictionaries:sudo apt-get install tesseract-ocr-fra tesseract-ocr-dan tesseract-ocr-fin tesseract-ocr-swe tesseract-ocr-nor tesseract-ocr-isl
- Manual correction of HOCR pages word-by-word using the Author Mode stylesheet I've created.
- Stripping of the line-level spans added by Tesseract, which cause problems for the hocr2pdf tool:
saxon -s:withlines.hocr -o:nolines.hocr -xsl:strip_lines.xsl
- Building single-page text-searchable PDFs using hocr2pdf:
hocr2pdf -i pageX.jpg -o pageX.pdf < nolines.hocr
- Combining the resulting pages into article- and review-length PDFs using GhostScript:
gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress -sOutputFile=article.pdf $(ls page*.pdf)
We also now have the option of building simple TEI from the HOCR files, which is quite exciting.
Met with KS and HT to discuss the progress on OCR. All volumes have been done, and combined into PDFs, one or two for each volume depending on size. KS has shared the results with us on Google Drive.
Volume 14 is going to be our pilot test for OCR and correction. I've done the following with volume 14:
- Downloaded the two PDF parts
- Pulled out all the images:
pdfimages -j vol14_part1.pdf vol14_1 pdfimages -j vol14_part2.pdf vol14_2
- This resulted in one JPG and a set of pbms, so I converted the pbms to JPGs:
for f in *.pbm ; do convert ./"$f" ./"${f%}.jpg"; done; r, *.pbm
- Used Tesseract to generate individual OCRed text files for each page:
for i in *.jpg ; do tesseract $i vol14_text$i; done;
-
Used the Tesseract OCR engine to OCR each image and turn it into a text-over-image PDF for that individual page
tesseract *.jpg vol14 pdf
- Used Ghostscript to combine all those PDFs into a single PDF for the volume:
gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress -sOutputFile=vol14.pdf $(ls *.jpg.pdf)
The result is 127MB in size; that's quite large for a single download, even though the images are b/w rather than grayscale, so I think our plan to create separate PDFs for each article makes sense.
I posted the file on Google and shared it with KS and HT. I also shared a text file which is the actual OCR results from one of the pages with a block of Icelandic in it. (This is the text from page 12 of the document.)
The results of the OCR are not as bad as I feared, but they're not great when it comes to the non-English text. In the text-over-image PDF, you don't see the "text" in the background, of course, unless you select and copy-paste into another document, so the errors are not so apparent. However, if we use the text output for searching, and return fragments with search results highlighted, we'll be showing any errors that are still in there.
Our plan goes in the following stages:
- Get full-document text-over-image PDFs created and posted to the website for download (my job). Downloads will be fed from NFS, not hosted inside Cocoon. There is an existing index page for these, so we can just add these links.
- KS will correct the OCRed pages for vol 14 and gather them into subfolders with their images for each article and review. PROBLEM: if we work from corrected OCR text files, we have lost the initial position information that would be recorded in an HOCR file and used to create text-over-image PDFs from the results. TASK for me:
- See if Tesseract or GhostScript can be used to combine an HOCR file with an image to create text-over-image (after the HOCR has been manually edited). If so:
- Try to write an author-mode CSS that enables safe correction of the HOCR file. If workable:
- Set up a workflow to use this method of correcting and then re-combining the OCR and images to create article-level indexed texts.
- Based on the results from above, we will either:
- Index and make searchable the HOCR files in eXist/Cocoon; or
- Index and make searchable the OCR output text files in eXist/Cocoon.
...entered. This one's now ready for publication, I think.
Met with HT, KS and GN to plan the scanning of old versions. Ahead of the meeting, I recovered old spine-clipped issues of all the journals except issue 1 from our storage; we checked these and they're good. HT will investigate getting hold of issue 1. GN has started the process of acquiring the scanner, and it will be initially attached to a Mac which will be next to Salander; the issues are on the shelf above that. The project should get under way while I'm away, so GN will assist with startup.
Still waiting on bio and keywords.
Nothing in my queue now for vol 23.
One only remains to be done.