Automating markup of common abbreviations
JMH is concerned that we have accurately transcribed the abbreviations of common words such as "Govt" with superscripts, that appear all over the place. It has occurred to me that we could enable expansion of these quite easily using a search-and-replace, like this:
[Gg]ov<hi rend="[^"]+super+[^"]+">[^<]*t</hi>
which finds all the various abbreviations for "government" without finding those for "governor". We can do this, for a range of common abbreviations. I've already written some XSLT to make them into mouseovers as we do with the <choice>
/<sic>
/<corr>
sets.
The above could be captured and used as a backreference with <abbr>
wrapped around it, with the following replacement:
<choice>
<abbr>
$0</abbr>
<expan>
Government</expan>
</choice>
Initially I thought it might be simpler to replace instances beginning with a capital and those beginning with a lower-case letter separately, so that we can provide the accurate expansion in each case, but see below.
EDIT: I've refined the regex so that it won't operate on an instance that has already been processed, since we'll probably have to run it on files multiple times. This seems to be the best way to do it, using a negative lookbehind assertion, and capturing the first letter separately so we can reproduce capital or lower-case:
(?<!abbr>)([Gg])(ov<hi rend="[^"]+super+[^"]+">[^<]*t</hi>)
<choice><abbr>$1$2</abbr><expan>Government</expan></choice>
This seems to be working, but I'll need to do some more careful testing before setting it loose.