Provisionals down to 122. No word on what to do about the OEA provisionals yet.
Down to 134 provisionals now.
I've been pushing ahead with the provisional Japanese names; I've knocked off all the low-hanging fruit so I'm down to 152 names, but now they're taking a bit longer for each name. One common pattern is that there are transcription errors from the awful scans of the 1941 directory; when that looks likely, I have to go back to the original images and cross-check between the person and street indexes to sort it out. At the current rate, it'll take me about three weeks to get through them. In the absence of any decision about the Jisho site, I've used it in the following way:
- If the site lists a particular name as a surname or forename, it provides kanji for it;
- I check that each of the kanji used in the name appears in O'Neill, in a similar type of name, in a similar position;
- If so, I add it to the list of known Japanese forenames or surnames.
This seems pretty rigorous to me.
Meanwhile, we still have the issue of provisional other-east-asian assignments. I don't remember having any involvement in the decisions on how to deal with those; we don't really have a rigorous protocol for them as we do for the Japanese names, but:
- there are only 73 of them;
- a handful are Japanese or likely Japanese (I've dealt with a couple already);
- some are obviously European;
- most are unambiguously east Asian, in my inexpert view.
Waiting for JSR's response on whether I should go ahead with these.
I've been hacking away at the provisional Japanese assignments and done all the low-hanging fruit; I'm now down below 180 remaining, but they're each taking longer and longer now.
I've written a new improved version of our protocol for assigning ethnicity probabilities based on name, using the Chicago db, our list of known exceptions, and a new list of known-good names derived from those we've accepted in the Land Titles project. This is done as a transformation of the Powell St directories files into a temp location, and those output files are then processed by diagnostics. Results:
1656 "definitely Japanese" names. These all look good. 292 "provisionally Japanese" names. The majority of these look good, but each one will have to be looked at individually if we're following our protocol. At 2 minutes per name to check in the Japanese Names book, that's about ten hours of work. 313 "definitely Other East Asian" names. These look pretty good, but there are inevitably some which are a toss-up; "Louie" for instance is in the Chicago db as Chinese, but it's obviously not always so. 75 "provisionally Other East Asian" names. Most of these look pretty good too, but I'm no expert on Chinese names.
It looks like it'll be on me to do the follow-up on the provisionals, but I'm waiting for JSR to confirm. In the process of doing all this, I found and fixed numerous encoding and transcription errors, and added some Schematron to avoid some of them in the future.
The Bird Commission materials are now on home1h/loi, in their own folder, which is symlinked from home1t/loi/bird; there are ACLs on them to allow three specific users (plus HCMC staff) read access to them. use getfacl to see the effective permissions.
The addition of the new 1949 content broke the build in interesting ways, due to the presence of some anomalies such as empty num elements. After fixing the errors and adding some Schematron to avoid the problem in the future, we eventually got everything building again, and now we can see the streets files with the additional year. I had not updated the name exclusion list (names which match the Japanese name regex template but which are not Japanese) since the addition of 1943, so I went through both those years, expanding the list considerably. This is important ahead of the auto-tagging we're planning to do.
Found one bug triggered when fields are completely missing, and tweaked some other bits. Sent first output to JSR.
Per JSR, added code to extend the comparison process with global statistics by field.
Wrote the XSLT to do the automated comparison between title records. JSR has added a new feature request (global stats at the end), so it'll need some tweaking on Monday, but it's basically there.