SF and I have been working on generating and checking lists of records which we believe can be deleted from the db. I've made a landscapes_backup db and cloned the current content into it before we start deleting; it looks like we'll be removing over 2,000 title records, but we're still doing some checking; then we'll remove associated unlinked items.
Met with JSR and SF to discuss refining data in the LTD. First, we create a new duplicate of the existing db. SF will generate a list of known-good titles (in that they've been fully edited using the final protocols). I'll then generate lists of titles that don't match that set, which will be candidates for deletion; she will check those. Then we delete those. Then we generate lists of now-unlinked people (owners and sellers), other documents, and legal descriptions, which again are assessed as candidates for deletion.
Before next summer, the db should have these features:
- Created and last-modified dates auto-added to each record (presumably not editable, but visible; that's not trivial, I think, but a good extension to the Adaptive DB code).
- Other docs should be a "soft" text field, initially populated from the relational table data, to save input time.
I'm now pulling owner names from the sql data and adding them as additional info for each title in the titles column.
I'm now pulling in land title data from the XML dump of the SQL db, for each address. Next step: popups with owner names.
Worked on integrating some of the GIS data into the directories work, by using the XML exports from the ArcGIS data to tie the block and lot information to the addresses we have from the directories. This works pretty well, and can be massaged to work a bit better if we accept that fractional house numbers should be lumped in with their integer components for the purposes of block/lot identification (not sure whether that's always true or not).
Following that, I spent some time trying to figure out how we can get the binary data in the <Shape>
element into some usable format. I can't find anything outside of Arc, and I can't get our Arc license to let me launch the product. The data as it is can't be used. I think we need to bring in an Arc person for a day just to export everything that was done in a format that's usable outside of it. Most of the data is quite accessible in the XML, and other stuff can be read in MDB databases and exported from them in Access, but this crucial <Shape>
element is a complete roadblock.
JC is going to investigate this problem and report back. Meanwhile, I've enhanced the Japanese name detection by adding an exception list, which works pretty well for broad-overview purposes.
It's not too difficult to find all the names that match Japanese patterns, by modelling the syllable, so I've done that. There are some false positives, but I can actually exclude them with a list of exceptions. Also fixed a couple of bugs.
This is now basically working quite well, and looks reasonable. We found lots of minor issues with data and encoding and fixed them -- this is a good way to reveal them -- and we have some outstanding questions to answer, but it's remarkable how revealing just this one set of tables is.
We're now producing a web page from the encoded directories. Lots more to do, including rendering the info attractively, and handling the nested addresses (rooming houses etc.), but we're definitely making progress.
The original texts include things best classified as notes, and we also add editorial notes. Today I set up a system for assigning responsibility for notes, and documented it. I'll run the team through it tomorrow.
Today we implemented a system for methodical encoding of kyuujitai and shinjitai variants of kanji, so that we can transcribe the original text as it appears, but still generate modernized versions where we need to, and annotate the use of traditional forms automatically. At the same time, I created some transformation scenarios for generating schema and documentation from the ODD file, and put them into an xpr file which is also in svn. We're slowly moving towards a good set of documentation.