More work on modernization/searching etc.
Posted by mholmes on 13 Mar 2012 in Activity log, Academic
I've been working out my ideas a little more clearly, and beginning to evolve the idea of a working pipeline and a target format for my documents. It would look something like this:
- Original document is processed into a sort of generic structure where each text node is expressed as an
<ab>
element. At this stage,- The root text element in the new file points back to the source document using a private URI system based on the source document's
@xml:id
, like thisxml:base="mar:maladies_des_femmes"
- The
<ab>
element points back to the location of the original text node which gave rise to it, using a TEI pointer structure, something like this:<ab corresp="xpath1(*[20]/*[4]/*[3]/text()[2])">
. - The contents of the text node are tokenized. It's not clear to me yet whether we need to tag punctuation, but we definitely need to tag words, so we'll need a good tokenizer that can handle this.
- Words broken across linebreaks are reconstituted in the context of the text node preceding the linebreak, and ignored in the one following it. The reconstituted word is linked (see below) back to the original character strings in both locations, though.
- Each word is marked up with a
<w>
tag, and that tag is linked back to the original source using XPath again:<w corresp="xpath1(substring(., 36, 10))">
. - The original form of the word (reconstituted in the case of a broken word) is included as the text content of the
<w>
tag. It is also stored in an attribute (possibly@n
, or more likely a custom attribute), so that when the text content is normalized and modernized, the original form is still available.
- The root text element in the new file points back to the source document using a private URI system based on the source document's
- The resulting file is then processed again, and the text contents of
<w>
tags are run through a series of normalization rules which do things such as replacing long s. - Further processing attempts to modernized the contents of the
<w>
tags. This is going to require some serious processing, and will include algorithmic spelling modernization, dictionary lookups, etc. - The now-hopefully-modernized form is lemmatized, and the lemma is stored in an
@lemma
attribute on the<w>
tag. - These documents can now be stored in the db and indexed for searching and analysis; search hits will have available to them the original spelling of the form, and will also be able to get back to the exact place in the original document where the form is located.
For this, we'll need a range of tools, some of which exist and some of which appear not to exist yet (or, as in the case of the lemmatizer, not in an open-source form we can adapt for a Java web application).