HCMC Journal: PWFC and LOI 2025-07-28 to 2025-08-01

PWFC and LOI 2025-07-28 to 2025-08-01

28 July 2025 to 01 August 2025: Martin Holmes
Minutes: 240

On Monday, discussed encoding and rendering strategies for vertical Japanese text with AN and NA.

On Tuesday, had a long meeting with the team and discussed progress and work that is needed from me. My main task is to add a full build of the proofing versions of documents, along with listing pages, so that anyone can easily get an overview of the state of the document collection. This should also validate the HTML generated, so we catch issues early. These renderings are not intended to be the final end-user views, but those may descend from them, so tightening them up nice and early is ideal.

I also met with NA to get the report on her work examining the LOI/NNM missing-record problem, which I summarize here for the record (and this was also sent to JSR). There are three main issues:

A collection of fonds were explicitly excluded from the LOI repo at LU’ss request (reasons not given, but presumably for privacy, irrelevance, or sensitivity?). The links to these fonds in the LOI repo were commented out, but the metadata files were left in place, meaning that they show up as bad links in our diagnostic (and they are bad links on the site). This covers over 900 of the 1402 problem links shown in the diagnostic, and I can remedy this to some extent programmatically by extracting the list of commented-out fonds and processing the relevant metadata files to flag them for exclusion from the repository (I think).
There are some cases (about two dozen so far from NA’s sampling) where PDFs are missing from LOI but do exist in the NNM DropBox. Those cases can be fixed by copying over those files to LOI, after checking that NNM didn’t want to exclude them for some reason.
There are some cases (again, about two dozen so far) where our metadata has a note to the effect that No digital copies of the records were acquired by the Landscapes of Injustice Research Collective between 2014 and 2018., and/or This record was not digitized. The former is a variation on the standard The digital copies of the records were acquired by the Landscapes of Injustice Research Collective between 2014 and 2018, and the switch to No suggests that at some point, it was decided that digital surrogates of these records shouldn’t/couldn’t/wouldn’t be acquired for some reason. There are over 500 instances of this text in the metadata, and it looks like just under 400 of them may be associated with links to PDFs, but because the record structure is rather arcane, it’s going to take human intervention to determine how many of these are cases where a PDF link was erroneously left in place after it was determined that we wouldn’t be acquiring a PDF.

There are probably also cases where the NNM DropBox contains digital surrogates which were intended to be ingested into LOI but never were; I don’t imagine it will make sense to try to seek those out, because they aren’t actually errors in LOI.

On Thursday, made two significant changes to the PWFC setup. First, there are maps and a blueprint that will need geometric zone encoding, so I customized a copy of BreezeMap so that it outputs zone elements containing well-formed GeoJSON objects (following our GIS WG decision on how GeoJSON should be handled) to enable encoders to use <surface> and <zone> to encode these documents. There is a lot more work to do here, especially with regard to rendering. Secondly, I enhanced proofing XSLT so that it can output all the documents along with an index to them, and added this to the regular build process, so that we can now look at our progress.

On Friday, dealt with some schema fallout from the decisions on GeoJSON encoding, and raised a TEI ticket to complain about text not being allowed in <zone> elements in <facsimile> because of Schematron, although the content model allows it.