Met the rest of the Landscapes team today. Chatted with GL and SA about various plans going forward, which might mean some repository reorganization and the possibility of ODD chaining. Lots of work to do, but I think it will all be worthwhile for future linked data possibilities.
(Posting hours from yesterday). Met with GL to catch up and to discuss future plans. Plotted out some actions and then sent out a few questions regarding the TINA_NNM.
Time: 90 min
This post details a method used to fix the following problem:
XML files where downloaded through a web browser from revision.hcmc.uvic.ca. They were XML in UTF-8, but Apache didn't know that so it presented them as 8859-1, and the browser happily believed it. All Japanese characters inside the files therefore got borked.
Work proceeded without this being noticed, resulting in a pair of -ography files in the RADish part of the repo which had lots of useful work invested in them, but whose Japanese characters were all completely broken.
Googling around found the Python tool FTFY, which works a treat; it can disembork borken Unicode with remarkable effectiveness. The only situations where it failed were cases of a single isolated character which itself was an archaic or obsolete form.
So the question was how to present the isolated blocks of broken text to FTFY, get them fixed, and then re-integrate them into the file. This is how:
First I cloned the FTFY repo, and ran
python3 setup.py install to get a command-line tool (this works better than running stuff in the python interpreter). Then I wrote two XSLT files:
RADish/xsl/fix_chars_1_extract_text_nodes.xsl processes the original file, and finds all text nodes which are ancestor::*[@xml:lang='ja'] or parent::g. It replaces each text node with a temporary <distinct> element with a unique xml:id. It also creates a separate text file which consists of a list of each of those ids, followed by a colon, followed by the borken texten. Then:
ftfy -o outputfile.txt inputfile.txt
fixes almost all of the text inside that text file.
RADish/xsl/fix_chars_2_reinsert_text_nodes.xsl then reads the external file and builds a hashmap from it, then processes the temporary <distinct> elements back into text nodes containing the unborken Japanese.
NH is now in the process of manually fixing the few dozen remaining issues, after we devised some XPath to discover them and fixed a hundred or so together to get the hang of it.
Created a new spreadsheet for MO at JSR's request, based on titles sold by the Custodian and their preceding titles. May need to do some additional map work, to integrate multiple existing JSON files to create a single one for each row. 180 minutes.
The FODS tables to TEI tables conversion is now working; a quick spot check seems that the results are correct, but we'll need to run some diagnostics on it both to ensure that the conversion worked correctly and to identify any oddities in the source data. Added a bunch of documentation, too.
Spent some time evaluating the spreadsheets and thinking through the best process for turning them into RADish. Got a basic build set up with a conversion process. So far, the process looks like so:
- Copy the files into a temporary directory and, in doing so, clean up the filenames (no brackets, no spaces)
- Take those files and convert them to FODS using soffice (there's lots there so it takes a while)
- Then, take those FODS and convert them into a TEI table; I'm still working on this bit, but, so far, I have a process whereby the XSLT gets all the fods in a particular directory (using collection()) and then combines them into TEI document with multiple tables. This might not be the right approach in the long run, but I think it makes the most sense for now, particularly if the various spreadsheets in a collection need to be reconciled
Coded the third new spreadsheet, which wasn't as straightforward as the others. Updated documentation and data dictionary. The final spreadsheet has been dropped from the plan because it doesn't actually make sense.