Log in

HCMC Journal

Monument 2026-02-23 to 2026-02-27

to : Martin Holmes
Minutes: 850

At the weekend and on Monday, spent many hours working on checking a vector version of the data against the original dataset. The biggest problem is that while plain OCR is full of errors, many of them are mechanicially correctable; with the AI input, there are also many errors but most of them are plausible and the amount of manual checking required is just as great. Found one error in the original dataset, but by the end of the day, I still had a huge amount of checking to do.

I also worked on integrating more translations into the Map JS, which is a bit tricky.

On Tuesday, spent a couple of hours finishing the PDF check work, which was very tedious, but found no more errors. Then continued with the work to abstract captions from the map JS.

On Thursday, began receiving the stonecutter’s drawings, and rendering them to images for OCR. They will need to be appropriately cropped. Also spent some time discussing layout and rendering issues with PS. Then reworked person pages to include a heavily-weighted but invisible paragraph containing all variants of the person’s name, to allow for more accurate searching, and fixed a couple of inconsistencies in the place-listing rendering.

On Thursday, got the last set of images, and MM and I cropped all the images to the same size, so we can use them on the site. Then I ran the OCR, did some correction of the obvious misreadings, and ran the result through the check process; the result was that around 500 mismatches remain to be checked manually, and most will be straightforward at first glance.