Starting the rescueTagSoup project
: Martin Holmes, Greg Newton
Minutes: 165
Following initial experiments with Saxon and VNU, GN and I discovered that we can use the nu.validator project’s parsehtml-1.4.jar directly with an invocation like this:
java -cp htmlparser-1.4.jar nu.validator.htmlparser.tools.HTML2XML inputsoup.html output.html
to generate well-formed XML in the XHTML namespace. It only gets us part of the way to what we want, but the remaining fixes are going to be easily manageable with a regular XSLT identity transform, so it’s just a question of building out those templates based on real-life examples, and writing a wrapper framework so we can transform an entire scraped site automatically, leaving only a few manual fix-ups to be done at the end. I’ve created a new repo on GitHub at UVicHCMC/rescueTagSoup, and started adding the things we need.