Lots of work on the apparatus conversion. I've included MH's character code into the XSLT, which seems to be working well. Most of the plays are being handled well. One small issue is that all of the other annotations from the XWiki docs are being included as well, but those have already been converted to be inline on the documents. One solution would be to create a list of all the documents that don't need to be brought over.
Also began refactoring the process for attaching the standoff annotations to the texts. It's complicated business, since there's a lot of attempting to find the right documents to attach the annotations to. Currently, the process runs like so:
- Create a list of documents and their associated annotations
- Then iterate through that list
- Tokenize the base text and add ids to each character
- Attempt to match the apparatus files to the base text using character ids
- Then, add anchors in the base text where the apparatus ought to attach
- Finally, untokenize the text and just leave the anchors
A better and more flexible process might be to fork on type of text using the ISE document types. If the document is a primary source, then tokenize; otherwise, leave it. Then, for any apparatus documents, see which document it is attempting to match (encoded in its relatedItem in the header) and then look for the tokenized version. It will take longer in the long run, but it is simpler than nested for-each lists in ANT.
Regardless, the match_lemma module was (as MH rightly noticed) complicated and difficult to debug. I've refactored it now into multiple functions and added a "verbose" switch for very detailed bug reports. There's still lots of fine-tuned error checking and documenting to be done, but it makes more sense than it did before.