Reported as follows to JS-R and SF:
I've examined all the Japanese results, and I can confirm:
- All of the 213 identified as definitely Japanese are Japanese.
- Of the 219 identified as possibly Japanese, only a proportion (less than half) actually are. They are obvious in the list, and can be assigned to Japanese manually. The reason for this is that their names match the Japanese regular expression template for name forms (as do many non-Japanese names), but they don't happen to appear in the Asian Surnames list. Maikawa is a good example of this. There are blocks of these (15 Maikawas, for instance) where a name with a strong presence in Powell Street is missing from the Asian Surnames list.
- Of the some-probability-Chinese list, I'm no expert on Chinese names but if our assumption is correct that there are few-to-zero Koreans or Vietnamese to worry about at the time and place in question, then I would suggest they're pretty much all Chinese. The very few instances of surnames which could be non-Chinese (e.g. Gee, Lee) are solved by obviously Chinese forenames.
So this is what I think we can do, with confidence:
- Run the code to assign all the definite 213 Japanese to Japanese.
- Pull up all the possibly-Japanese, sort by surname, check the Japanese ones and set them to Japanese manually in blocks.
Re the Chinese assignments, I would be less confident because here we're depending on our assumptions about possible other nationalities, but the results look pretty solid to me; we could certainly assign them all to Chinese_Provisional, but to be honest I think we might as well make them Chinese.
Finally, we should then retrieve all instances with no nationality assignment and go through them manually. I don't know how many that would be, but it'll be mainly people of European origin, so they can probably all be classified as Other in blocks, with a few exceptions. There will be some Chinese whose names don't actually appear in the Asian Surnames list (e.g. Lim and Chong), but I don't think there will be any Japanese, unless they're the result of misspelling in the original documents or the data entry.