Annotations and collations

Long day with lots of work done on annotations and collations. We have a fairly solid structure up and running. ise:rdg/@resp are becoming tei:rdg/@wit that will point to a tei:witness/xml:id in the header of the document that @corresp to a series, etc. Annotations have become a list of tei:notes with spans, glosses, and other notes. Still have to figure out iembeds, which will be worth while when converting the XWiki documents. This is all coming together nicely.


ISE3 mtg (JJ, MT, DJ)

Had a whole day meeting with JJ, MT, and DJ to go over ISE3 implementation. Lots of great stuff discussed, most of which documented through Asana and the GitHub repo. And good headway made with annotations and collations; I've think we've discussed it enough to start writing some code that processes annotations and collations.


Work on Annotations

Talking at length with MT about annotations and collations. He gave me a good run-through of how apparatus work and we made some progress thinking about how it will be implemented. We're still not sure about which method of annotation is best for the project, particularly since we're sort of wedded to string-matching. We've been wading through the TEI guidelines trying to find the most appropriate method for attachment.


ISE3 Conversion (420+ 375)

Post for June 14 and 15: working on ISE3 TEI conversion. Long time spent trying to deal with unicode characters that were being garbled in OSX--solved by creating a small XSLT for conversion that used analyze string to tokenize each character and fn:string-to-codepoints to check whether or not the string should be escaped or not. Seems to be working well now. Discussed file structure with JJ and MT to come to the conclusion: each edition gets its own folder (with the ISE work id without the 'ise' prefix) that has documents, etc. Note: we are not future proofing the ISE to think about more than 1 edition.


Added handling for glyphs and respStmts into the first and second passes (respectively) of the build process, which seem to be working well. Chatted with MT about linebreaks and he explained the various ways the ISE have used milestones and linebreaks. Summary of points made by MT:
  • TLNs, QLNs, L, and MS tags are all milestone units and not end of line breaks
  • Editors can make up their own types of milestones with their own numbering systems if they choose
  • White space line breaks (\n) are significant and represent the linebreaks for that particular edition
This poses some questions to ask MH and JJ at some point. Should all milestone like things become milestones with '\n's becoming <lb edRef="#thisEd"/>? MT also mentioned that since we're now using the ISE tools to manipulate the IML, we don't the multiple regex conversions that we're currently using.


Continued working with taxonomies, and spent some time with JJ formalizing and finishing the responsibility taxonomy. Didn't have time to encode it today, but it's in an easily processable form, so it should be quick to add to the taxonomies document. Also discussed glyphs and chars with MH and MT; we decided that only glyphs that had standard correlations (most ligatures) would be encoded in the taxonomy. Diagraphs, accents, and other characters that are untranslatable would not.


Unicode, glyphs, and chars

Working with MH and MT in figuring out special character stuff. MT already had some utilities built in to deal with characters encoded like {s} in the unicode, so there's a starting point. I had to wrestle with the sgml to tei code a bit to get it work. Started to build a taxonomies doc in ISE3's SVN repo with the chars in it. Note: OSX's documentation seems to be false when it says that by default it encodes in UTF-8. It was producing wonky output unless you appended the argument: -b=UTF-8.


More work done on the SGML to TEI conversion. Prose and verse seem to be being handled well and the creation of the listPers in the particDesc seems right as well. I began looking at the code mapping for ligatures and other special characters. Not sure what exactly needs to happen there. Created a bash script that looks through the IML files to investigate how often that character is used; I couldn't use XSLT or anything like that since the ampersands, etc, are difficult to search with for text. I now have a script that lists the characters by number of use. It needs to be finished, however, so that we can see which files contain these characters and how we should work with them.


Coding standards meeting

Met with MT and JT. We established coding and documentation standards for XSLT, Schematron, and Ant (to some degree), and figured out how the ISE2, ISE3 and GitHub repos will relate to each other and to the Jenkins build. Documentation in Github repo.

Setting up ISE3

Worked on H5 and got it validating in TEI and worked a bit on the personography. Meeting with MH and MT discussing code standards, which was very interesting and helpful.

