This post details a method used to fix the following problem:
XML files where downloaded through a web browser from revision.hcmc.uvic.ca. They were XML in UTF-8, but Apache didn't know that so it presented them as 8859-1, and the browser happily believed it. All Japanese characters inside the files therefore got borked.
Work proceeded without this being noticed, resulting in a pair of -ography files in the RADish part of the repo which had lots of useful work invested in them, but whose Japanese characters were all completely broken.
Googling around found the Python tool FTFY, which works a treat; it can disembork borken Unicode with remarkable effectiveness. The only situations where it failed were cases of a single isolated character which itself was an archaic or obsolete form.
So the question was how to present the isolated blocks of broken text to FTFY, get them fixed, and then re-integrate them into the file. This is how:
First I cloned the FTFY repo, and ran
python3 setup.py install to get a command-line tool (this works better than running stuff in the python interpreter). Then I wrote two XSLT files:
RADish/xsl/fix_chars_1_extract_text_nodes.xsl processes the original file, and finds all text nodes which are ancestor::*[@xml:lang='ja'] or parent::g. It replaces each text node with a temporary <distinct> element with a unique xml:id. It also creates a separate text file which consists of a list of each of those ids, followed by a colon, followed by the borken texten. Then:
ftfy -o outputfile.txt inputfile.txt
fixes almost all of the text inside that text file.
RADish/xsl/fix_chars_2_reinsert_text_nodes.xsl then reads the external file and builds a hashmap from it, then processes the temporary <distinct> elements back into text nodes containing the unborken Japanese.
NH is now in the process of manually fixing the few dozen remaining issues, after we devised some XPath to discover them and fixed a hundred or so together to get the hang of it.