Worked through Cara's comments on the two collation files she's processed, and sent this back to her (archived here because I need to keep track of it):
Just working through the comments in your file:
I noticed that you've marked the orthographic readings rather than the substantive ones. I think that's the reverse of what we agreed; since orthographic is more common, it should be left unmarked, and only the substantive variants should be tagged as type="substantive". That's what my app does (I got it working yesterday). If you think that's wrong, let me know why. It just seems to me to make more sense to have less code to maintain.
With regard to this:
<!-- I'm not sure what to do with items like this where something has been deleted. -->
<app loc="2">
<lem><del>was</del></lem>
<rdg wit="#LDev145">/</rdg>
</app>
I'm not sure what's actually being encoded here. I'd read the XML like this:
- The base text had "was", but "was" was then deleted in the base text.
- The witness had a diagonal stroke (slash character) instead of the deleted "was".
If that's correct, then you don't need to do anything. If that's not what's represented by this app tag, could you tell me what it should be?
In this case:
<app loc="2">
<lem>taken</lem>
<rdg wit="#LDev145" type="orthographic">takin</rdg>
<rdg wit="#STC13860_12">was added</rdg>
<!-- Can we change this so it reads "was taken" and the "was" is tagged as an addition? -->
</app>
you could change the second <rdg> tag like this:
<rdg wit="#STC13860_12"><add>was</add> taken</rdg>
Remember that what is in the <rdg> tag is taken to be a replacement for what is in the <lem> tag. Here, that means that "was" has been ADDED in the witness (it wasn't in the witness originally, but a scribe has added it). If what you mean is that the word "was" appears in the witness, but not in the base text, then this is all you need:
<rdg wit="#STC13860_12">was taken</rdg>
meaning that where the base text has "taken", the witness has "was taken".
On your comment here:
<app loc="5">
<lem>they slave to remain</lem>
<rdg wit="#AAH06">thie servant to remayne</rdg>
<!-- I'd break this one up into word-by-word comparisons. -->
</app>
if it is a good idea to break it up into separate <app> tags, then you'll have to do it manually, unless you can get Collate to do it for you initially. My program can't really deduce whether a long phrase should be broken up or not. In a case like this:
<app loc="9">
<lem>my desyer</lem>
<rdg wit="#LEge20">thy desire</rdg>
<!-- split up: thy is substantive, desire is orthographic -->
</app>
you'll need to split up the code manually into two app tags. I could write a routine that would enable you to specify an app tag and have it broken up automatically, assuming that it will always be the case that every word in the lemma has a matching word in the reading (i.e. they both have exactly the same number of words), and they should be split at word-boundaries. This would be worth doing if this situation is likely to arise frequently; if it's more often the case that the word-numbers won't exactly match, or the app tag should be split, but not into a new tag for every word, then it's going to be more straightforward for you to do it manually, I think.
On this one:
<app loc="12">
<lem>ploues</lem>
<rdg wit="#LDev145" type="orthographic">plowithe</rdg>
<rdg wit="#AAH06" type="orthographic">Plowithe</rdg>
<rdg wit="#STC13860_12 #LEge20" type="orthographic">Ploweth</rdg>
<!-- this is a mistake that Collate keeps making. "Ploweth" is omitted in LEge20 -->
</app>
Looking at the original, I think this might actually be a mistake my app is making, not Collate. However, it's triggered by an inconsistency in the Collate output:
ploues ] plowithe LDev145.txt , Plowithe AAH06.txt , Ploweth STC13860_12.txt; omitted LEge20.txt
Here, the ] delimits the lemma from the readings, and a comma should delimit each of the individual readings. However, Collate seems to have arbitrarily decided it's going to use a semicolon to delimit the last reading, from the LEge20 text. Is this configurable? If so, can we make it consistently stick to the comma? My app is conflating the last two readings because it assumes the delimiter is the comma. In the meantime, I can try to rewrite my code so that it will work with mixed delimiters. I suspect that will be a bit more error-prone, but maybe not.
Whatever happens, I think the workflow is going to require checking from you at each stage. We can make it go more quickly, but I don't think we'll be able to automate it completely, especially when Collate is less than 100% reliable.
I'll go back to my app and try to rework the problem above. When I've done that, I'll work from your examples and run the same two files through my ap, using it to categorize the rdgs as type="substantive" or not. Where you've marked items which need further thought from you or Ray, I'll put your original comment back in.