Assigning ethnicity by name to directories transcriptions
Posted by mholmes on 11 Jul 2016 in Activity log
I've written a new improved version of our protocol for assigning ethnicity probabilities based on name, using the Chicago db, our list of known exceptions, and a new list of known-good names derived from those we've accepted in the Land Titles project. This is done as a transformation of the Powell St directories files into a temp location, and those output files are then processed by diagnostics. Results:
1656 "definitely Japanese" names. These all look good. 292 "provisionally Japanese" names. The majority of these look good, but each one will have to be looked at individually if we're following our protocol. At 2 minutes per name to check in the Japanese Names book, that's about ten hours of work. 313 "definitely Other East Asian" names. These look pretty good, but there are inevitably some which are a toss-up; "Louie" for instance is in the Chicago db as Chinese, but it's obviously not always so. 75 "provisionally Other East Asian" names. Most of these look pretty good too, but I'm no expert on Chinese names.
It looks like it'll be on me to do the follow-up on the provisionals, but I'm waiting for JSR to confirm. In the process of doing all this, I found and fixed numerous encoding and transcription errors, and added some Schematron to avoid some of them in the future.