LW pointed out some minor display issues with the end part of the Amboise part 2, so I've reworked a little of the markup, and refined a little of the XSLT, to make things show up the way they're supposed to. I should go back and do the same for volume 1 too...
The markup in this work had been diverging from validity for a long time, so a whole stack of errors had accumulated, and the overall hierarchy was no longer easily discernible. I've worked through the whole text with the primary aim of fixing the hierarchy, and we now have a valid document. However, there are a number of things that still need to be looked at. I'm recording these here so that PG and I can sit down and work through them:
- Forme works are only occasionally marked up. We need to make sure that all page numbers, catchwords and signatures are caught and tagged as
<fw>
s. - Stanza breaks are not usually marked; the run of lines continues unbroken, inside one
<lg>
tag, through what are clearly paragraph-style breaks. We need to identify all those breaks, and then close and re-open<lg>
elements. - Related to the above, there are some lines at the beginning of verse-paragraphs which are indented. These lines need
rend="text-indent: 2em;"
or whatever the measurement is. - PG made a valiant attempt to use
<salute>
,<closure>
,<sign>
and<name>
tags in the prefatory material (which is largely dedications, so has a lot of this kind of thing in it). However, the markup structures we're using make this difficult, and none of the existing formulations was valid; because of the need to render the code valid before we go any further, I've pulled out those tags and used plain<p>
tags, with some<name>
tags left in place, for those locations. However, the intent was correct, and we must revisit those areas of the text to devise and implement the correct formulations for them (and then handle them in the XSLT output).
The sad old operating system that is OSX has an annoying habit of littering hard drives with .DS_store files, and it even does this on network drives from other, non-OSX machines; these cause occasional stumbles in RSync operations, and also just get in the way and annoy the heck out of me. Finally I was bugged enough to find a solution. If you log in to the Mac with admin privileges and type this in a terminal:
defaults write com.apple.desktopservices DSDontWriteNetworkStores true
then restart the Mac, the behaviour appears to be prevented. I haven't found a way to stop the Mac from doing it on the local drive, though.
Now if only I could find a way to make the menu and dialog boxes for every application show up on the same monitor that the application happens to be running on...
Worked again on the problem of getting the stdout from a command-line application back into my Delphi app. Found an ancient component called TDosCommand which purports to do this, but it won't compile easily due to issues with the transition to UStrings; after some hacking, I got it to compile, but I still can't make it work. I get garbage back from the command, even though when I run it in a command window, I get the right result. I think I'm going to have to abandon this component, and try one of these strategies:
- Find out how to capture stdin and stdout using pipes (looks messy).
- Get the source code for ncd, specifically the c header files, and convert them for Delphi, so I can use the dlls directly (even more messy).
Both options would involve learning stuff that's potentially useful, but I think the second would give me access to more other open-source code in the long run. JEDI has good tutorials on how to convert C header files.
I've begun writing the GUI for our pilot test, and I have two XML files happily loading and being displayed, and xml:id attributes being added to all the target elements (a requirement for the subsequent processing). Now I've started trying to figure out how to run the NCD process externally and get back its return value. The problem is that the return value is a float, and in Windows it seems to be impossible to get anything back from an external process except an integer (with GetExitCodeProcess). Presumably NCD is writing to StdOut. I did find this site which suggests a possible solution, which I'll look at tomorrow.
Spent a couple of hours with Stew working on the algorithm and the specific arithmetic we need to make our critical edition generator run. The basics are straightforward, but we obviously need an application that can do the work (which will take days or possibly even weeks), and we need to finalize the formats in which we choose to store the results. I don't think the correspondence values have much relevance for TEI, so a simple concise XML format will probably do.
After this, and some discussion with the project team, I started work on a Windows application to wrap the CompLearn engine, after determining that jclUnicode has the canonical normalization functionality I need. Delphi will be fast and user-friendly, as opposed to the other alternative (Java), and it's another opportunity to put some more work into the open-source units on which Transformer and IMT depend.
CC and I have been discussing the possibility of using the Sonnet de Courval material as a testing ground for automated critical-edition building. Our (currently very vague) plan is to start with the Satyre Ménipée, which exists in several editions. What we need to do initially is to find a plausible and usefully sophisticated algorithm for generating a similarity score between lines; I think this will need to be based on the kind of algorithms used (for instance) in the sciences, to measure similarities between protein structures, etc. We'd need to process every line in each text, first normalizing it, stripping punctuation and lower-casing it; then compare it to every other line in all the variants, computing a similarity score. The scores would then have to be re-processed to weight them, based on the similarities of the surrounding lines. At the end of this (very computationally-intensive) process, we would have scores for every line vs every other line (although it would probably be logical to discard scores below a certain level, to reduce the quantity of data). Based on this, we could generate a "critical edition" which allowed you to choose any text as a base text, and view the others as variants based on it.
The difficult part for me is the similarity metric we need to use. I've only just started my reading, but so far I haven't found anything in the humanities that makes sense. I'll need to do a lot of reading to get up to speed with this.
UPDATE: It looks to me that CompLearn is the answer to this. The arguments in the 2005 paper that describes it are very compelling, and my tests with small text files and single lines suggests that it works extremely well, in that it basically agrees with my own common sense view of how similar two strings are. It gives a score between 0 (identical) and (presumably) 1 for completely different -- although the most I've managed to score is 0.529412 with short strings in latin characters; comparing a short English string with a short Japanese string gives 0.789474. Interestingly, I get the same score whatever the English string is; the meaning of this is slighly beyond me at the moment, but it's been a long day...
I've gone through Amboise 1 to add all the remaining <fw>
elements (catchwords and sigs), fix some missing line-breaks, remove spaces before linebreaks (so that removal of hyphenation works correctly), add the last couple of elements (the "Fin..." line and the library stamp on the last page), and a bit of other tidying up. Amboise is now "complete" (meaning it needs to be proofed against its web view and PDF).
As we get towards the end of the markup of the first novel, I've worked carefully through the Amboise vol 2 again, and done a number of things:
- Automated the markup of page numbers and running titles, based on the old page numbers in
@n
attributes of<pb>
tags. This will save LW a lot of time. I've also run the same code on the vol 1, to help TG. - Manually marked up the remainder of the "sig" and catchword strings in vol 2, partly to get it finished, but mainly to shake down and normalize all the relevant markup practices. They're detailed below.
- Fixed a few oddities and typos.
- Added more CSS and XSLT updates, mainly to normalize long s characters to regular s in the "continuous view" (which is becoming the "more modern view"), and to remove trailing hyphens and leading spaces from linebreaks. There are 998 instances of words hyphenated across linebreaks in the text, and a quick scan through showed none that I could see that ought to retain their hyphens, so I've gone for a global solution which may result in a few hyphens disappearing where they ought not to, but which is certainly better than a hyphen+space interrupting a word 998 times.
These are some standard formats I've used for the forme works, included here for documentation and reference purposes:
- Catchwords in these texts are always bottom-right, so they're marked up as follows, following a regular line break:
<fw type="catchword" place="bot-right" rend="float: right;">bligea</fw><lb/>
The@place
attribute is only there for tradition, really; the whole rendering instruction is in the@rend
attribute. - Signature labels (type 1): There are two types of signature numbers, one of which is simply the number in roman numerals (this is the only kind which appears in volume one). Again, these follow the last line break:
<fw type="sig" place="bot-right" rend="float: right; margin-right: 3em;">A iiij</fw>
As above, the@place
attribute is really superfluous. The@rend
attribute captures the fact that the sig identifier appears floated right, but is slightly indented from the right margin. - Signature labels (type 2): The second type of signature marks the beginning of a signature, and is more complex; it appears only in volume 2:
<fw type="sig" place="bot-center" rend="text-align: center; margin-left: auto; margin-right: auto;"><hi rend="font-style: italic;">II. Part.</hi><space quantity="4" unit="em"/>B</fw>
The whole (treated as one line) is roughly centered, but the two components are separated by a space of about 4 ems (all measurement is done in ems, for simple scalability). The margin settings are there as a conventional way to express the fact that the block is not full-width, and is located in the centre (and this CSS can be passed straight to the browser in the rendering code, to get the effect we want). - Page numbers: Whether right or left (recto or verso respectively), these come first after the page break. That has two advantages: first, we know where they are reliably, programmatically, and second, they will render correctly floated, alongside the centred running title:
<fw type="pageNum" place="top-left" rend="float: left;">18</fw>
- Running titles: These are always encoded after the page number, even if the page number is on the right, for the reasons stated above:
<fw type="head" place="top-centre" rend="text-align: center; margin-left: auto; margin-right: auto;"><hi rend="font-style: italic;">Le Comte</hi></fw>
As with other<fw>
tags, the@rend
is the key attribute, expressing the fact that this is a part-width, centred block that renders on the same line as the floated page number.
Added a new "Petits romans" menu item, resulting in a contents page for the novels. Then started hacking more seriously at the prose display, both the page-based and continuous modes. I came across a validation problem, caused by <div>
elements ending up inside <h2>
tags due to TEI <fw>
tags appearing inside <head>
s in the HTML source. I've added some testing for this kind of condition in the XSLT, so that <span>
s are used in this kind of context, with the class
attribute invoking CSS which displays them as blocks anyway, so that the rendering is not affected; the result is that the XHTML validates, but the page still looks right.
I also came across a slightly thorny problem worth blogging. Paragraphs in the novels have text-indent
settings, specified in the XML and passed into the CSS. When block-display elements such as <fw>
tags, resulting in page numbers, occur within the paragraph (as they almost always do), the block element inherits the text-indent
setting from the parent, and so indents its text. This is avoided by specifically setting the text-indent value to zero on the classes of these block elements.
Finally, I tweaked the right margin of the continuous-view texts so that there's enough space for the note popup to appear. This makes the lines shorter anyway, which makes reading easier.
I still don't have a definitive solution to the hyphens indicating word-breaks across lines, which should be eliminated in the continuous view. Still thinking about that one.