Examined the various data types from our different clusters, and discussed ways of visualizing data geographically. JC has some very nice maps based on fire insurance info, and will generate a couple more; these are thoroughly geo-reffed, and the base map can be created as a vector, with the digitized and geo-reffed rasters of hand-drawn maps layered on the top. We talked about doing an OpenLayers map with this stuff. JC now has a copy of the Land Title data in XML, and access to the db; we'll be trading info in an attempt to generate a set of ideas and pilot implementations of data realization.
Based on a clarified set of instructions from JS-R, I've now assigned Japanese_Provisional, EastAsian_Other and UC_Chinese ethnicities, using XQuery code to generate SQL insert statements from an XML dump of the db. I've left institutional owners completely alone, on the basis that the UC Database can tell us nothing about their ethnicity. I haven't deleted Asia_OtherUnknown or Unknown yet because there are 26 records which have the former and one which has the latter; I think it would make sense for someone to look at these, and assign them to Other if appropriate, or something else if not, manually.
Skype with SF and JS-R, after which SF made the final assignments for gen 1 custodian sales, and I generated the candidates for gen 2. I also added a new field to the titles table in which we can record the generation at which a chain became "polluted" by the admixture of properties that were not part of the original set (in this case, custodian sales), so we can track that.
We also discussed a procedure for assigning tentative ethnicity values based on the UC database; I'm not very clear on how or why we're doing it this way, so I've laid out the steps as far as I understand them to get confirmation before proceeding. I've also pre-generated lists of strong-probability Chinese and Asian (all of them) names, for use when it's clear what we're doing.
Based on an initial assignment of 46 properties to "generation 1" custodian sales (there are a few more to be decided, so more will probably be added), I've written an XSLT transformation that discovers and lists details of all subsequent titles which are from the "next" generation, in that they include properties which were transferred as part of the previous generation sales.
There are some complications: some titles include a mix of custodian and non-custodian properties. Still waiting to hear from JS-R and SF about how they would like to handle this; also, there's a strong possibility that even "pure" titles may have lost properties that were on the preceding title.
Getting a list of "first generation" sales by the Custodian is easy; we just find titles on which the Custodian (owner #777) is the seller. There are 46. But I also took a good look through every mention of the Custodian, and found a whole pile of interesting, puzzling and anomalous titles which I've listed out with explanations for JS-R and SF to take a look at. Some look like they're basically Custodian sales mediated through third-party law firms; in other cases, Japanese owners actually acquire property somehow, and this is then very rapidly disposed of by the Custodian or by proxies. Interesting stuff.
Following a check from SF, I've now deleted orphaned owners, and checked to ensure that this did not result in any more titles without owners (there are ten very old ones with no details and no owners due to their age).
Wrote some XSLT to generate lists of owners who are candidates for deletion because they're a) old (predating the summer data work) and b) not linked to any title as owner or seller. Sent the list to SF; on confirmation, I'll run the SQL which is also generated by the XSLT to delete those owners.
Added new fields for generation tracking:
ALTER TABLE `titles` ADD COLUMN `ttl_gen_cust` int(11) default 0 AFTER `ttl_transferdate`; ALTER TABLE `titles` ADD COLUMN `ttl_gen_noncust` int(11) default 0 AFTER `ttl_gen_cust`;
Updated local_classes.php accordingly. Tested in dev, implemented in live.
Met with SF and JS-R to discuss three main issues: the generation variables (there will be two, one for cust and one for non-cust); removal of obsolete owners, and deduping of remaining owners; and ethnicity values and assignment. Details are in my notes...
Also tested FineReader running against one of our photo-image PDFs. Took 1.5 hours, but produced a really quite impressive result from shabby typescript.
The server is now too paltry to manage the generation of a spreadsheet from our title data without timing out or running out of memory, so after setting all owners identified as definitely Japanese to Japanese ethnicity, I wrote some XSLT to generate a spreadsheet of title data for titles related to the Custodian, and forwarded this to SF.