I've finished writing and testing a PHP script to import an XML file (validated against http://web.uvic.ca/~lang02/bailey/schema/bailey_trialfile_proofing.rng ) into the MySQL database. The script using SimpleXMLElement to load and loop through the file. The database handling is done by Zend_Db, since that's already being used in the main Bailey website.
I wrote the importer in such a way that new foreign keys (e.g. crime_normalizeds, judges, etc.) don't need to be entered into the MySQL database beforehand. Since the XML is validated against the schema, I can be reasonably certain that all of the values in the data are intentional. So, if the importer comes across a value that's not already in the MySQL database (say, a new judge), it simply inserts that value and continues with the import.
The importer isn't yet online, but once it is I'll publish its URL to this blog.
While working on the import process to bring the XML data into the website database, it occurred to me that, after all of the changes to the schema, the <case> element doesn't serve much of a function anymore. While it used to group together multiple trial files, all of the useful grouping information (namely the trial dates) has been moved to the <trial_file> elements themselves. So, we don't need the extra level in the hierarchy.
For that matter, we also don't need the <regular_cases> element that wraps up the <case> elements, since it's also an extra level of hierarchy. The schema really just needs to be: data -> one or more trial files
Thus, I've removed <regular_cases> <case> from the schema and modified the XSLT transformation scripts accordingly. I also modified the data that's already been processed.
These changes will make the PHP script that converts the XML to MySQL simpler.
I've decided to scrap the eXist rebuild for a few reasons:
The XML schema was not meant as a final destination for the data. SA created it as an ad hoc intermediary step between SD's raw data and the MySQL database. So, the schema doesn't lend itself well to being the backbone of an eXist architecture, particularly when it comes to searching and generating human-readable values. The structure of the data is highly relational and is a good fit for SQL, which, of course, was SA's original intention.
That said, my original motivation for the eXist rebuild still stands. The current method of translating the XML to SQL involves using XSLT transformations to generate SQL statements. However, every time the schema is updated with new structure or values - which is happening a lot - the XSLT stylesheets need to be updated as well. So, to get around these difficulties, I'm going to write a PHP script that will convert the XML to MySQL, likely use SimpleXML. It won't be the fastest script I've ever written, but since it will be a one-use-per-dataset kind of thing, speed isn't a big issue.
After some consultation with Greg I've decided to scrap the PHP/MySQL version of Bailey that I built and write the site in eXist. After building FrancoToile and the Lansdowne Lecture site in eXist I'm comfortable working with it, and this way the Bailey XML data can be used directly without being shoehorned into a MySQL database. The allowed values for crimes, outcomes, judges, etc. is constantly changing in the schema, so maintaining those keys in MySQL would be a long-term pain. Using eXist cuts out the XML-to-MySQL middle man. I don't expect the basic functionality of the site to take long - it'll likely be done next week sometime.