New name resource; work on ethnicity
I started the initial notes for a blog post documenting our ethnicity assignment process for Japanese names, and in the process of some background research I identified a resource I had not known about before, the Japanese Multilingual Named Entity Dictionary. This is a fabulous resource that comes in (in XML) at 150MB, and includes not only personal but also placenames and similar entity names. I've downloaded it and written a processor which produces a much-reduced version which will suit our purposes for personal name identification; the result is just over 30MB, so it should be usable in a Saxon process as a lookup. It will provide one more piece of strong evidence that a form is known to be a Japanese name (last or first, male, female or gender-neutral), and will provide the additional potential for finding kanji representations for romanized forms in an automated fashion.
Meanwhile, work on the blog post is stalled while I integrate this resource into the process.