I've written the bones of an XSLT file to convert an original file to a framework for modernization and regularization. So far the code can create <ab>
elements with full working xpath references back to the source text nodes. Now I need to start on tokenization, which I think I'll do with a regex initially, but it's going to be quite complicated.
Category: "Activity log"
I've been working out my ideas a little more clearly, and beginning to evolve the idea of a working pipeline and a target format for my documents. It would look something like this:
- Original document is processed into a sort of generic structure where each text node is expressed as an
<ab>
element. At this stage,- The root text element in the new file points back to the source document using a private URI system based on the source document's
@xml:id
, like thisxml:base="mar:maladies_des_femmes"
- The
<ab>
element points back to the location of the original text node which gave rise to it, using a TEI pointer structure, something like this:<ab corresp="xpath1(*[20]/*[4]/*[3]/text()[2])">
. - The contents of the text node are tokenized. It's not clear to me yet whether we need to tag punctuation, but we definitely need to tag words, so we'll need a good tokenizer that can handle this.
- Words broken across linebreaks are reconstituted in the context of the text node preceding the linebreak, and ignored in the one following it. The reconstituted word is linked (see below) back to the original character strings in both locations, though.
- Each word is marked up with a
<w>
tag, and that tag is linked back to the original source using XPath again:<w corresp="xpath1(substring(., 36, 10))">
. - The original form of the word (reconstituted in the case of a broken word) is included as the text content of the
<w>
tag. It is also stored in an attribute (possibly@n
, or more likely a custom attribute), so that when the text content is normalized and modernized, the original form is still available.
- The root text element in the new file points back to the source document using a private URI system based on the source document's
- The resulting file is then processed again, and the text contents of
<w>
tags are run through a series of normalization rules which do things such as replacing long s. - Further processing attempts to modernized the contents of the
<w>
tags. This is going to require some serious processing, and will include algorithmic spelling modernization, dictionary lookups, etc. - The now-hopefully-modernized form is lemmatized, and the lemma is stored in an
@lemma
attribute on the<w>
tag. - These documents can now be stored in the db and indexed for searching and analysis; search hits will have available to them the original spelling of the form, and will also be able to get back to the exact place in the original document where the form is located.
For this, we'll need a range of tools, some of which exist and some of which appear not to exist yet (or, as in the case of the lemmatizer, not in an open-source form we can adapt for a Java web application).
I now have a collection of a dozen or so papers I'm reading and annotating, and some ideas are getting clearer. At the moment (although I still have a lot of reading and consulting to do), this kind of approach looks promising:
- Run XSLT on collection to create parallel collection in which each significant block (not clear what a block is yet) is converted to a modernized textual representation with an XPath pointer that points back to the original block in the original doc. In this process, linebreaks would be dealt with.
- Each modernized block includes the original variants as attributes or elements (if the latter, the modern indexer can be instructed to ignore them).
- Modern blocks may also be stemmed.
- Search is done on modern blocks.
- KWIC hits from search can be shown EITHER as modern OR as original sequence (reconstructed from original variants stored in modern block).
- Clicking on the hit takes you to the original text, with hits highlighted based on a new search done using the original tokens stored in the modern block as search terms.
Started some detailed reading on this topic, with some pointers from friends and people on TEI-L. It looks like a flurry of activity happened around 2005-2007, and there are some working examples such as EEBO with fully implemented systems, as well as lots of surveys of approaches, and some tools. It looks useful and interesting. Haven't found anything resembling a dictionary of variants for Early Modern French, though.
One of the problems we face in building our next-generation search engine is the issue of archaic spellings and modern equivalents. In an effort to understand the scale of the problem before we begin tackling it, I've written some scripts which are in the process of compiling a list of all the distinct word-like tokens in the corpus which do not appear in a modern spelling dictionary. Right now, it's it's up to the Rs, and at around 35,000. I'll stay this evening till it completes, because I want to see the final tally.
Once we have the complete list, we'll be able to work out how many of them could be dealt with by means of normalization algorithms (such as switching long s to s, and normalizing other spelling variant patterns known to be common). Following that, we'll have an idea of how many tokens will actually have to be provided with equivalents by a human reader.
I've finished the process of converting uses of <argument>
for marginal labels to the <label>
tag. I had to regenerate the schema again, because in the documents affected (the Sonnets, Forest, Le Bon Mariage and Ville-Thierry), there are now occurrences of <label>
where it did not appear before, so I regenerated the odd file:
java -jar /home/mholmes/saxon/saxon9he.jar -it:main -o:/home/mholmes/WorkData/French/Claire_data/mariage_5/mariage_2012-01-30.odd /home/mholmes/WorkData/tei/sf_repo/trunk/Stylesheets/tools/oddbyexample.xsl corpus=`pwd`/
then edited the file manually to add @type
to <label>
(I'm doing this in the TEI namespace, although strictly speaking I probably shouldn't, but I don't see why <label>
doesn't have @type
in the first place).
I've written to EGB and GM to explain the change, and I'm now going to look at the documentation to see what needs changing there.
Fixed a typo in the menu reported by CC.
Met with LSPW to get an outline of the current state of play. This is the summary:
- LSPW has created several files in the /documentation/ directory which provide full editing guides, tag lists, and reports on e.g. references that haven't yet been identified. These will be complete at the end of this week, and provide a very solid grounding on our markup practice and tag usage.
- The Le Blanc text is in the following state:
- Transcription and textual markup complete.
- References done up to the entry for Ch VI of Livre IV in the TOC.
- GMM is working on Ville-Thierry, and has almost completed the transcription and basic markup. Detailed markup (CSS, annotation etc.) will need to be done after that, and he'll need some help getting started with that. I'll also have to take over sending his hours to SL in the French department, who's doing his timesheets.
- EGB is working on Le Bon Mariage. The transcription and basic markup is done, and she's now adding CSS and references.
- There are some outstanding issues, decisions and tasks which LSPW will put into a blog post and assign to me as a task.
This task has been outstanding for a while, but I've managed to solve it in a very simple way using CSS columns. The solution is specialized to lists at the moment, but it should work identically for any other element on which we want to implement it. Notes/limitations:
- The implementation is triggered by the list having a
child::cb
, but it usesdescendant::cb
for counting the number of columns, on the assumption that some column breaks may occur within list items. It would fail to work if there were no<cb>
that was a direct child of the<list>
, though. If we were implementing a more general solution, we would need to figure out what level the columnar layout should be implemented on, and we'd probably have to require@type="columnar"
or something like that on a block-level element, to trigger the appropriate CSS. - The CSS still has to have
-moz-
and-webkit-
prefixes.
Fixed this bug, which was caused by the XSLT not expecting that notes would be placed inside page number <fw>
tags. In the process of fixing it, I realized that links from TOC page numbers to the pages concerned would not work in the continuous view because we're not showing page numbers in that view, so I added code to create empty anchors in the text; this allows links from TOCs to work as expected.