Continued building the annotated biblio, and drafted a few paragraphs from the rescue section.
Laid out the outline for the article and divided up sections for drafting. We start writing tomorrow.
Lots of reading, some annotation and other note-taking, quote-garnering and much useful discussion of how our features map onto GOLD (not very well, because we are focused a lot on bound morphemes, and GOLD becomes sparse at that level). I've also built a complete RELISH schema, which gives me access to lots of stuff we'll need that was missing from the core, including e.g. <SenseExample>.
Discussion with ECH to prepare for writing the article next week. We have a stack of stuff to read, but the basic outline is becoming clear. There's still a lot of work to do on GOLD/ISOcat integration.
This is very slow and painful work. Half the problem is the very slow response of the clarin isocat site. I've also built a status page which shows all the mappings so far, and links them to the ISOcat definition pages, so that ECH can check them. There are some I haven't been able to map, and others that I'm very unsure about.
A bit slow and painful, but it's clarifying some of our original decisions on features and we've discarded a couple of unused ones so far.
We've discussed some code structure issues outlined below, and agreed that they're desirable, but should be put off until later because they increase the quantity of code we'll have to edit. One approach is to write XSLT to create these changes, and make the changed versions of our XML files available through the website, while we edit the unchanged versions behind the scenes; then when the time is right, we can convert everything permanently. Here are the details:
I've been reading through LR's jTEI paper with a view to bringing our encoding more into alignment with the recommendations there (which should also make it more amenable to LMF-ication), and I think we should reorganize the way we're doing citations a bit. At the moment, we have this:
<cit>
<quote>
<phr type="p" subtype="i">s-√cə́s=lqs kˀʷáʔncás</phr>
<bibl corresp="psn:ECH">ECH</bibl>
<phr type="n">s-√cə́s=əlqs kˀʷáʔəncás</phr>
<bibl corresp="psn:JM psn:AM">Y14.219,220</bibl>
<seg>a mosquito bit me</seg>
<bibl corresp="psn:JM psn:AM">Y14.219,220</bibl>
</quote>
</cit>
In this, we rely on contiguity to associate each <bibl> with its preceding element, and we rely on <phr> and <seg> to distinguish original from translation. What we might do instead would look like this:
<cit>
<cit type="example">
<cit>
<quote xml:lang="col" type="p" subtype="i">
s-√cə́s=lqs kˀʷáʔnc
</quote>
<bibl corresp="psn:ECH">ECH</bibl>
</cit>
<cit>
<quote type="n">s-√cə́s=əlqs kˀʷáʔəncás</quote>
<bibl corresp="psn:JM psn:AM">Y14.219,220</bibl>
</cit>
</cit>
<cit type="translation">
<quote xml:lang="en">a mosquito bit me</quote>
<bibl corresp="psn:JM psn:AM">Y14.219,220</bibl>
</cit>
</cit>
This is much more detailed, but it makes more things explicit. It uses nested <cit> tags to ensure that each quote is bracketed with its <bibl>, and that each <quote> has the required @xml:lang setting. The second level of <cit> is divided into @type="example" and @type="translation" (following recommendations in the TEI Guidelines), and the @type and @subtype values are realized directly on <quote>, rather than requiring the use of <phr> or <seg>.
The obvious drawback is that there's more code here. Existing <cits> should be easy to convert to this framework with XSLT, though.
Similarly, we currently have things that look like this:
<pron> <seg type="p">hámp</seg> <bibl corresp="psn:J psn:MS">J3.72-74,78; MS1.53</bibl> <seg type="n">hə́mp</seg> <bibl corresp="psn:JM psn:AM">Y24.90; Y29.179; Y6.282</bibl> </pron>
where the association between <seg> and <bibl> again depends on sequence. I wonder if we might be better off with two <pron>s:
<pron type="p"> <seg>hámp</seg> <bibl corresp="psn:J psn:MS">J3.72-74,78; MS1.53</bibl> </pron> <pron type="n"> <seg>hə́mp</seg> <bibl corresp="psn:JM psn:AM">Y24.90; Y29.179; Y6.282</bibl> </pron>
where the @type attribute is applied to the <pron> element, and the <bibl> is unambiguously associated with the appropriate <pron>?
Again, it's a bit more code, but it seems a bit cleaner, and as I try to map our data onto the sorts of structures allowed by Lexus, it looks like this sort of approach will work better.
I've finished working through LR and WW's article in jTEI re TEI and LMF, and made a couple more encoding changes as well as tidying up some rendering; I've also proposed a re-working of our <cit> encoding, using nesting and @type to tighten the specificity and make it clearer which <bibl> is attached to what. This is pending approval from ECH and SMK. I've also generated a list of cross-references which aren't actually pointing at anything yet.
<ref>/@target values have also been converted to use the m: prefix, and the rendering code is now capable of handling multiple space-separated values, and handling both m: and non-m: values.
Changed all @sameAs to @corresp, and tested and bug-fixed all the rendering code, then deployed the changes to the live db. Also updated some documentation (more to be done here).
I've now written some XSLT to convert all @sameAs to @corresp in the <m> element, and also to prefix all such values (along with those already in @corresp by virtue of being multiple) with the m: prefix (our planned private URI scheme for morpheme pointers). I've also written updates to XQuery and XSLT to take account of this, currently commented out, and I'm going to be testing everything locally tomorrow and fixing before I upload to the live db. Everything has to change at once before it will work. I also have to update documentation.