Ben Jonson 2025-10-28 to 2025-10-31
to : Martin Holmes
Minutes: 595
On Tuesday, continued work to get well-formed XML versions of the HTML so we can actually usefully process it. The sticking-point is inline JavaScript which contains invalid constructs such as ampersands, and so needs to be turned into CDATA; because it’s slightly different in all the files, it’s difficult to make process it with regex tools, so a certain amount of the work is manual.
I finished this work on Wednesday, so I was able to make a basic start on the XSLT that will make them not only well-formed but valid and functional; I also started adding lib resources to the GH repo to support build processes, including a fresh build of the validator, built from the repo.
On Thursday, set up a full workflow with XSLT and validation of output, and then began working on the remediation XSLT. Made a lot of progress.
On Friday, finally worked my way towards actual valid XHTML, with a lot of work, but 99.9% of it is happening in XSLT and only a tiny number of changes were made to the source from the crawl. TO DO:
- Images from the crawl have things such as commas and
%2Fin their folder/pathnames. These must be fixed in the crawl itself, as well as in links to them. Most images must retain most of their paths, since the file names tend to be identical (default.jpg. - All assets need to be gathered up and copied to the appropriate locations in the output structure. This will probably throw up a range of assets we actually don’t yet have, and will have to acquire.
- The functionality and value of the paginated versions of texts will have to be looked at. My instinct is that there’s no value in these many fragments, and that we should just remove all pagination and serve full documents. They were probably introduced to spare compute power on the back-end, which is not a concern for us.