Log in

HCMC Journal

Ben Jonson 2025-10-28 to 2025-10-31

to : Martin Holmes
Minutes: 595

On Tuesday, continued work to get well-formed XML versions of the HTML so we can actually usefully process it. The sticking-point is inline JavaScript which contains invalid constructs such as ampersands, and so needs to be turned into CDATA; because it’s slightly different in all the files, it’s difficult to make process it with regex tools, so a certain amount of the work is manual.

I finished this work on Wednesday, so I was able to make a basic start on the XSLT that will make them not only well-formed but valid and functional; I also started adding lib resources to the GH repo to support build processes, including a fresh build of the validator, built from the repo.

On Thursday, set up a full workflow with XSLT and validation of output, and then began working on the remediation XSLT. Made a lot of progress.

On Friday, finally worked my way towards actual valid XHTML, with a lot of work, but 99.9% of it is happening in XSLT and only a tiny number of changes were made to the source from the crawl. TO DO: