Auto-markup of named entities: methodology
Posted by mholmes on 05 Mar 2015 in Activity log
One major problem we have with adapting the procedure used for MoM to MoEML is that in the Stow 1633, many names, but more frequently parts of names, have been tagged with <hi>
(no attributes), to signify that they are blackletter in the original. This would disrupt our tagging capabilities, so this is what I propose:
- Identity transform which replaces opening no-att hi tags with → (right-pointing arrow followed by space), and closing tags for same with ← (space followed by left-pointing arrow).
- Named entity regex construction code includes the two arrow characters alongside spaces as delimiter in a character class for each regex fragment. This means they will not prevent matches (assuming they wrap at word-boundaries, which is the norm).
- Text with arrows is tagged by identity transform as planned.
- perl search-and-replace puts the
<hi>
elements back.
Potential issues include the last phase, where we might get overlapping tags instead of clean nesting. We'll have to see if that happens; if so, the perl process might be able to fix it, or a subsequent processing step might.