Lots of work recently on getting the Migration build and schema progressing. Currently, I have a process (in ~/production/utilities/migration) that migrates all of the TEI data from the trunk/ directory, re-organizing and normalizing where possible; this includes:
- The personography is now split into alphabetical files (personography_a, et cetera); kept whole, the personography takes forever to validate on the best of machines, so this will allow better working conditions. Plus, in whatever interface we eventually create, the index of people will necessarily need to be split into alphabetical files for browsing etc.
- The RADish files are now broken into smaller bits, with special processing-instructions in the fonds list that explain the lower-level items that nested within a fonds; this will make working in these files simpler, too.
I've also started work on a new schema. After discussion with GL and SA, I think the best approach will be to create a harmonized schema that's written from scratch but embeds all of the previous work done by MH, SA, and GL on the various cluster schemas.
The approach I've taken is to create a new schema (production/data/schema/LOI.odd) that will now serve as the master ODD file. I've included all the modules we need and am slowly working my way through extending the schema where necessary. There's a few departures from the TEI schema: item can be a root element (see above), list can thus be empty (insofar as it is populated just with processing instructions), address can be mixed content, and that seg/@subtype can be data.enumerated rather than a single token. The last two were additions in the loiDirectories.odd, which I'll need to talk with MH about. Address having mixed content is everywhere in the directories, so I think that's a good rule to retain, but enumerated subtypes only appear a handful of times, so there might be a better solution here.
There's lots left to do here and many decision to be made, but the work is progressing steadily; I've also made a spreadsheet for SA and GL that attempts to account for and organize the various datasets so we can strategize on how to bring everything into coordination.