Met with CC to discuss the grant application and the TRUTH presentation in September, and also fixed a couple of things in the db (publishing Le Blanc).
Category: "Activity log"
Met with CC to go over plans for the application, and tweak the French translation of the technical description we wrote the other week.
Met with CC to write a preliminary draft of a section of the grant application dealing with the proposed normalization and search functionality. This was a useful exercise, forcing me to make all the details explicit, and explain them in clearer terms than I have been doing to myself. The plan still looks good, and I'm looking forward to making more detailed plans based on this (especially plans for the creation of normalization rules, and an automated system for testing them and evaluating the results.
Tested out Franscriptor.com with some sample text from our content, to see what it's doing and to try to deduce how (it's a black box). It offers to "dissimiler" and "détilder" the text, but it's not clear exactly what that means. This is what I've learned:
- It does nothing with long s, so that has to be normalized before submission.
- It expands ligatures such as œ.
- It does quite a good job with u/v normalization, although it failed with "oeuures".
- Many anacronistic spellings survive unchanged ("luy", "bastir", "tousjours"), so it's clearly not trying to do modernization.
GM is now linking from the Ville-Thierry to existing references.
Met with CC and examined some of the outcomes from our rulesets. There's obviously a huge amount of tuning still to do, but it's also clear that before each rule is run, the word needs to be checked against the dictionary in case it's already OK; if it is, then we don't need to keep working on it. I've now implemented that by turning the spell-check dictionary into an XML file which is then indexed with xsl:key (I tried other string-finding methods but they were much slower). The transformation now takes substantially longer than it used to, but it's clearer what's happening. One issue might be archaic forms in the spell-check dictionary, of course.
Another issue is u/v variation. When we change one to the other, we often end up changing it back in a later rule. It seems likely that a better approach would be to change all u/v to another unused symbol, and then write rules based on context for changing that symbol to the appropriate output.
I've been doing preliminary work on the text-extraction and normalization problem. I've completed the initial rather difficult task of extracting the text and linking it back to the original locations in the source document, and I've started playing around with normalization rules. I took the long series of substitutions CC sent me in an earlier email and encoded them as search/replace operations; I'm using a spreadsheet to store them, and generating the required XML block automatically from it. I've since tweaked a few of the rules. I'm working with duchesse_de_milan.xml as a test document initially, and I've hooked in a CSS stylesheet which makes it almost readable in original and "normalized".
More often than not, our current rules take a good word and turn it into something incorrect. That's partly because many of the rules are, as yet, underspecified; for instance, some rules should only act at the beginning of a word, and others only in very specific contexts. Working through the rules to improve them, based on the errors, will help a lot, and I think we'll also be able to improve the output by putting them in a particular order.
The other thing that's missing, at the moment, is a check on the word before it's normalized; I should be checking each word against a modern dictionary before anything is done to it, and only making changes if it turns out not to be a good dictionary word.
But I think we can see the scale of the task ahead of us. It'll take some months to refine our ruleset to the point where we're getting consistently good results.
I now have my XSLT module successfully reconstituting a line-broken word on both sides of the break, like this:
<ab corresp="mar:textnode#xpath(/*/*/*/*/*/text())"><seg> </seg><w corresp="mar:offset#xpath(substring(., 22, 3))"><choice><orig>ant</orig><reg type="joined-2">imagiant</reg></choice></w><seg> </seg><w corresp="mar:offset#xpath(substring(., 26, 3))"><choice><orig>que</orig></choice></w><seg> </seg><w corresp="mar:offset#xpath(substring(., 30, 4))"><choice><orig>Vous</orig></choice></w><seg> </seg><w corresp="mar:offset#xpath(substring(., 35, 5))"><choice><orig>pren-</orig><reg type="joined-1">prendrez</reg></choice></w></ab><ab corresp="mar:textnode#xpath(/*/*/*/*/*/text())"><seg> </seg><w corresp="mar:offset#xpath(substring(., 22, 4))"><choice><orig>drez</orig><reg type="joined-2">prendrez</reg></choice></w><seg> </seg><w corresp="mar:offset#xpath(substring(., 27, 7))"><choice><orig>quelque</orig></choice></w><seg> </seg><w corresp="mar:offset#xpath(substring(., 35, 8))"><choice><orig>intereſt</orig></choice></w><seg> </seg><w corresp="mar:offset#xpath(substring(., 44, 1))"><choice><orig>à</orig></choice></w></ab>
It's nasty-ugly but it's only intended for machines to read. Having the full form of the word on both sides of the linebreak means we'll be able to do n-grams properly, and having the two joined forms labelled differently (joined-1 and joined-2) means we'll be able to ignore one of them if we're reconstituting a continuous string.
I've had to resort to a second pass through the data to count offsets, and that's now working reliably. I've also got the reconstitution of hyphenated words at linebreaks working, but only most of the time; for some reason, when the linebreak precedes a
<fw> element, the reconstitution fails. I'm still working on that, but it's very mysterious. I'll probably have to create some test data rather than working on real files until I get it sorted out.
All in all, though, very promising progress.
I now have the XSLT breaking down each text node into a series of components: either whitespace (passed through as plain text), punctuation sequences (tagged with
<pc>) or word[-fragment]s (tagged with
<w>, with much more tagging due in subsequent phases).
My current problem is the requirement to record the offset and length of each word in the original text node, so that a search engine can find its way from the modernized source back to the original text. Length is easy, but offset is proving difficult. I have a question posted on the XSLT list in the hope of some help, but it may be that we have to go in two stages: pre-process to create the
<ab> element, which is stored in a variable, and then post-process, where the
<ab> element and its contents are re-analyzed and additional tagging is added based on that analysis, before the resulting enriched ab is output.