Log in

HCMC Journal

rescueTagSoup 2024-10-01 to 2024-10-04

to : Martin Holmes
Minutes: 450

Over the weekend, got the basic process working and tests passing, although there is a lot more work to do on two fronts: first, identifying and remediating expected issues with HTML5 input, and second, fixing up old HTML4-and-below, which is not strictly necessary but would be very nice to have. Also I raised a ticket for a possible optional step to make output CORS-friendly.

On Tuesday got back into it with some more realistic data testing with GN. Also fixed issues with maintaining nested structure, and added many more templates for minor HTML5 compliance issues.

On Wednesday, worked more on the HTML4 and below issues, and made significant progress; ran a test on an entire WP site scrape and the results were encouraging. However, I also discovered that the VNU validator is not handling modern CSS, including nested rulesets, which will present a problem for us.

On Thursday, started on the process which will retrieve external files, and got a good way into it. There will need to be de-duping and renaming because many files are basically queries with parameters and result in awful filenames when w-gotten. But the process seems quite feasible.