I've been doing preliminary work on the text-extraction and normalization problem. I've completed the initial rather difficult task of extracting the text and linking it back to the original locations in the source document, and I've started playing around with normalization rules. I took the long series of substitutions CC sent me in an earlier email and encoded them as search/replace operations; I'm using a spreadsheet to store them, and generating the required XML block automatically from it. I've since tweaked a few of the rules. I'm working with duchesse_de_milan.xml as a test document initially, and I've hooked in a CSS stylesheet which makes it almost readable in original and "normalized".
More often than not, our current rules take a good word and turn it into something incorrect. That's partly because many of the rules are, as yet, underspecified; for instance, some rules should only act at the beginning of a word, and others only in very specific contexts. Working through the rules to improve them, based on the errors, will help a lot, and I think we'll also be able to improve the output by putting them in a particular order.
The other thing that's missing, at the moment, is a check on the word before it's normalized; I should be checking each word against a modern dictionary before anything is done to it, and only making changes if it turns out not to be a good dictionary word.
But I think we can see the scale of the task ahead of us. It'll take some months to refine our ruleset to the point where we're getting consistently good results.