Fixed a huge number of errors in the source XML, and added some pre-processing to fix many of them. Set up three levels of validation for the incoming XML and the generated XML as part of the build process, and then started hacking away at the remaining errors in the HTML output. What's left, as far as I can see, is just the problem of milestone elements showing up as direct children of e.g. list elements. That's on the table to be fixed next week.
Category: "Activity log"
After some investigation, I found that the best option for validating XHTML5 is not the RNC schemas from whattf, which depend on datatype definitions in their own namespace that I can't seem to find a jar implementation of; instead, the VNU project provides a single JAR file which can do it. I've integrated that into the project and I'm now doing validation of XHTML5 pages produced by the build.
Two sets of errors were easily fixed: first, the encoders seem to have been making up language codes as they went along, so there were hundreds of invented ones. I've fixed all those. Second, the files are not in Unicode NFC. I have XSLT to fix that in the XML, but I haven't run it yet; instead, I built normalization into the pre-processing of the XML, which solves the problem for the XHTML output, and also protects against bad data coming in in the future. But I will fix the normalization in the source XML files soon.
What's left is about 1100 errors of the expected type (divs inside spans and the like), each of which will have to be looked at in the hope that generic fixes that cover lots of them can be found. Meanwhile, validation of XHTML5 is now a solved problem, very usefully, and I'll be able to port this fix to other contexts. I don't know if I can find a way to make it validate all the fragment files, though. One option there would be to build them all into a single file, validate that, and then delete it.
We now have a working job on our HCMC Jenkins box which builds the static Mariage site, thanks to anonymous read-only checkout from SVN.
This is part of the cleanup of XML prior to the new site building.
Over the weekend and today I've got the transformation working so that it now builds a completely updated version of the XML which complies with the current P5 (and some upcoming changes/deprecations), as well as abstracting all the styles into rendition elements. I've build an Ant build file which generates all this stuff automatically from the original XML, and then uses that to start building the web output. That means I can focus on website generation work without having to update the existing site code to take account of changes in XML, which is a relief. I'm building this so that it will run outside of Oxygen, in anticipation of its running on a build server we plan for HCMC.
I now have primary source text documents rendering, although in many cases the results are not pretty. There are some fixes to be made to the original documents, I think. We may also decide on putting menus at the top rather than down the side, which will involve some tweaks (but not much, really, because the new layout is designed to detach the site structure from the document display as much as possible). Tested the build on the laptop and it works great.
I've started the process of re-building the Mariage schema, XML and file structures. I have a "pre-process_xml.xsl" stylesheet which reworks the XML, moving everything from @style to @rendition/rendition elements in the header, and this is partially tested; it has a couple of bugs still to work out, in that it generates duplicate @xml:id elements for a handful of fw elements. The ODD file itself has been partly changed, but is not yet generating a fully-working schema (@rendition does not seem to be available), so I'm working on that now. Once the schema is working, it will disallow various old habits such as biblScope/@type in favour of @unit.
Todo:
- Fix the bug in xml:id generation for fw elements.
- Fix the ODD file so a schema allowing @rendition everywhere is created.
- Test the transformation and make sure it's producing fully-valid and unbroken XML files from the source documents.
- Update all XSL that looks at biblScope/@type on the site so it can handle @unit alongside it.
- Test-upload some transformed documents and ensure that they are displaying correctly in the interface. Fix any bugs arising out of that.
- Commit all converted files in one batch, after checking with CC for a good time to do it.
- Warn CC about changes.
- Upload everything into eXist and test extensively.
Mariage is my pilot project for the notion of static site-building, and I've started on that work today. I've reconfigured part of the repo to provide a location for developing the code, and written rendering for all the AJAX fragments; I'm now starting on the main documents, beginning with the simplest, articles.
My approach is to take advantage of the HTML5 data- attributes to preserve as much of the original TEI info as possible (source element name, attribute values, etc.), so that I'm not trying to predict what bits I will and won't need for the purposes of styling and interactivity. It's quite straightforward and elegant to do this, and rather a relief to get away from the heavily contingent style of rendering that I've ended up getting used to on projects that pre-date HTML5. No major gotchas so far.
Per CC, we've changed the default font to Georgia (although on one Mac, in FF, it has trouble with italicized long-s characters), set the default style to white, and removed the background graphic from it. This gives a plainer, more readable site.
This long-outstanding task is still waiting to be done. I worked through all the documents except the Varin, and was able to automate some of it by using regexes. There are some situations in which a <ref>
and a correction coincide, where the consequences for rendering would be difficult to figure out, so in those cases I've left a note in place for the moment.
Met with CC, SA and some folks from the library to discuss a possible grant application, details of which will be fleshed out in due course. Did some research first and follow-up afterwards; meeting on Friday by which time I need to have a basic list of bullet-points to discuss.