Continuing to port content over; updating links; created accordions
This wasn't dead trivial to set up, so I'm documenting it. What's needed, first, is a read-only user on the database. That user needs to have read permissions (I included views here), and also needs Lock Tables (presumably everything is locked during the dump process). Finally, that user needs to have explicit rights to "localhost"; the generic "%" will not do.
Once the user is set up, create a script on the db server which first deletes the old version of the file you want to dump to (otherwise the dump will fail); then runs the dump command. That command looks like this:
/usr/bin/mysqldump --xml --user=[readonlyuser] --password=[password] landscapes_live > /home1t/[user]/www/[outputfile].xml
In this case, we put it in www so that it can be retrieved easily by a Jenkins build process which makes use of it.
Finally, crontab -e to edit your crontab, and add the following line (this dumps every midnight):
0 0 * * * /home1t/user/[script].sh
Forgot to post my extra hour last night.
After much back and forth with the Cumberland folks, we decided to test running a more permissive version of my code which would accept an image as a match if it began with the C-number, but just had some additional stuff on the end. This resulted in 104 candidate matches, which was more promising, but after looking at them (I have the script building a web page with links), they concluded that what they had thought were irrelevant suffixes (S for small, for instance) were actually significant, and none of these additional matches was actually correct.
So the plan now is to wait for the new DVD with full-size unedited images, all with correct filenames, to arrive in the mail, and to run it against those images and see what emerges. At least my script is now fully developed.
Meanwhile, TS added a bunch of stuff to the Guidelines document, and I've added some bits myself, and created a new distribution of the instructions (nice to have that done by ant, even including the upload). Note to self: always script this kind of thing if you can; it'll end up saving you hours.
One of the important things we want Jenkins to do for us is validate all our XML and send an email to the last committer when a file is broken. There are several steps to this, and I'm working through them; this post will record how I set that up. Everything is run as an ant task in a file called utilities/validate.xml.
Regular RNG validation
This is handled by Jing, through its built in support for ant tasks. You use <taskdef> to define the task, pointing at the classname Jing provides; then you invoke that task and pass it filesets containing the files to be validated. One additional requirement is that ant needs to know where to find Jing, so I'm passing it -lib /usr/share/java on the command line.
Schematron validation
schematron.com provides an open-source library called (currently) ant-schematron-2010-04-14.jar, which can be used in the same way as Jing; you create a <taskdef> giving it the classname and classpath (I point directly at the jar), then invoke the task with <schematron schema="../db/data/rng/london_all.sch" failonerror="false" queryLanguageBinding="xslt2">, again passing filesets.
Validation of embedded Schematron inside the RNG
schematron.com provides XSLT tools to extract the Schematron and convert it to a full Schematron file, so I'm using their ExtractSchFromRNG-2.xsl with Saxon to generate another Schematron file, then validating our tree against that.
Diagnostics
Our regular diagnostics process is now quite sophisticated, and that's also running and producing an archived report in HTML format.
One minor improvement over the TEI setup is that I can store the log parse rules file directly in the repo, meaning that a fix to it is automatically inherited at the next checkout/job run. Right now I'm not doing anything useful in that file, but I'm sure as we continue to enhance our Schematron, especially with regard to bibls, we will need to suppress some specific messages.
Continuing porting content to new site.
Added tertiaries; updated links.
Python + XML = hours of frustration.
It turns out the DVD I have of Cumberland images is incomplete, so I'm waiting for another one by mail, but int he meantime I've decided against doing an automated import; instead, I think it's much safer to provide mapping for the editors from a simple web page which enables them to vet each image before adding it to a record. So I've been refining my Python script, and in the process learning how awful XML/XPath handling in Python is; it's worse than useless, so I'm pre-processing the XML dump of the database to create a much simpler XML lookup file that the Python can use to retrieve the appropriate slugs for each record, based on identifier. I also installed Wand and I'm using that to interface with ImageMagick to create scaled-down versions of the images where necessary (they don't want to upload full-size ones). This is all now working, and I can run it again when I have the DVD to create a page for them to work from.
Spent some time with JT refining the work he's done on pulling in bibls of our own documents, and also discussed more significant changes to the way bibls are rendered and popups work.
Continuing to port content to new site.