Plans and schemes for a critical edition
CC and I have been discussing the possibility of using the Sonnet de Courval material as a testing ground for automated critical-edition building. Our (currently very vague) plan is to start with the Satyre Ménipée, which exists in several editions. What we need to do initially is to find a plausible and usefully sophisticated algorithm for generating a similarity score between lines; I think this will need to be based on the kind of algorithms used (for instance) in the sciences, to measure similarities between protein structures, etc. We'd need to process every line in each text, first normalizing it, stripping punctuation and lower-casing it; then compare it to every other line in all the variants, computing a similarity score. The scores would then have to be re-processed to weight them, based on the similarities of the surrounding lines. At the end of this (very computationally-intensive) process, we would have scores for every line vs every other line (although it would probably be logical to discard scores below a certain level, to reduce the quantity of data). Based on this, we could generate a "critical edition" which allowed you to choose any text as a base text, and view the others as variants based on it.
The difficult part for me is the similarity metric we need to use. I've only just started my reading, but so far I haven't found anything in the humanities that makes sense. I'll need to do a lot of reading to get up to speed with this.
UPDATE: It looks to me that CompLearn is the answer to this. The arguments in the 2005 paper that describes it are very compelling, and my tests with small text files and single lines suggests that it works extremely well, in that it basically agrees with my own common sense view of how similar two strings are. It gives a score between 0 (identical) and (presumably) 1 for completely different -- although the most I've managed to score is 0.529412 with short strings in latin characters; comparing a short English string with a short Japanese string gives 0.789474. Interestingly, I get the same score whatever the English string is; the meaning of this is slighly beyond me at the moment, but it's been a long day...