Monument 2025-09-02 to 2025-09-05
to : Martin Holmes
Minutes: 380
Over the long weekend, we received a request (with a deadline of the end of Tuesday) to run the name confirmation checks on the PDF which will be used to carved the monument stones. Unfortunately, it turned out that in the PDF, the text names had been rendered into vector graphics instead of retained as text, which meant that there was no text to extract from the PDF, with the exception of some headings and numbers. Checked with JG about this, and he is going to try to find a pre-conversion version. In the meantime, I experimented with rendering the PDF to images and then OCRing the images, but the first attempt was quite inaccurate because of two things: pdftoppm’s default output is only 150 dpi, and Tesseract by default tends to use an English dictionary, which means that anything ambiguous gets borked because the names are romaji and do not map to English words. On Tuesday, I tried another approach, first rendering to 600dpi pngs (to avoid lossy compression), then suppressing dictionaries when running Tesseract. That was remarkably unsuccessful too; the error rate was one in every eight names.
Later in the day we received a new version of the file with the text intact, so I was able to start rewriting the code to do the check; this needs substantial revision since the structure and formatting has changed considerably. By the end of the day I was about half-way there, generating good clean text output with no extraneous content, with the only bit that remains to be done being parsing out the places and namelists correctly.
Continuing work on this into Wednesday, I finally figured out all the data-cleaning mechanisms needed, and got a working tagged version of the PDF text; the comparison threw up a handful of errors, which I debugged and reported to JG.
On Friday, got a revised PDF and ran it through the check; no problems were found.