CGWP data
We've decided to use PostGreSQL for this project due to its more sophisticated static views ('materialized views') and the efficient updating of materialized views ('refresh concurrently').
Downside? These features don't show up until v9.4, so I installed 9.4 and PHPPGadmin on my VM.
I now have a reasonable first pass at a schema - for now it only addresses the core data sets (soldiers, war diaries, and ancillary tables that help flesh out those main tables).
In order to clean up as much as possible, Martin is going to write some (streaming) XSLT to smooth out some of the inevitable bumps that exist in data sets of this size. To produce a MySQL dump in XML format, you can use PHPmyAdmin or mysqldump on the CLI. Doing it from PHPmyAdmin I ended up with a 908MB file, whereas the command 'mysqldump --xml -u root -p cgwp > /tmp/cgwp.xml' produces a 729MB file. I haven't investigated why this is, but the header of the documents suggest that PHPmyAdmin has its own take on dumping XML.