Made a number of changes to the XML markup documentation PDF, some on SK's instructions and some to clarify my implementation of the glottalization-related changes we've made in the last couple of days.
I finished my XSLT conversion for fixing the encoding of glottalization, and ran it on the files awaiting work. They're now all sitting in a directory on the server called "ready_to_edit". There are only a few oddities/problems which make some of the files invalid:
- Some entries have no xml:id at all, for some reason. Rather than make one up, I'll leave it to the editor to assign one.
- Some entries have xml:ids that begin with the standalone grave accent (`, U+0060), which is not a valid character at the beginning of an xml:id. This is because that character is at the beginning of the entry itself. We need to look at these, and decide a) if it should be there, or it's some kind of processing error; and b) if it's correct, then how we should handle it when creating xml:ids (possibly by just deleting it?).
- A couple of entries have completely borked content because the original DOS stuff was never converted over, for some reason. Those entries are few and small enough to be dealt with on a case-by-case basis; the English glosses are there, so they can be tracked back to their original data.
Wrote the beginnings of an identity transform to fix all the glottalization issues discussed two posts below this one. I have the matching basically working, and shelling out to a function; now I just have to construct the function to do all the replacements.
Up to now, we have been working on the basis that surface forms can be broken down into discrete segments constituting morphemes. This is not always the case, though; today one form surfaced in which three discrete morphemes combine to form a single-phoneme item (c = nt + sa + s).
I would have liked to use a sequence of lookups in the @sameAs attribute, separated by spaces, but that's not allowed in TEI; @sameAs can only hold one value. The obvious alternative is @corresp, so we would do this:
<hyph> <m corresp="nt sa s">c</m></hyph>
That's what we're going to do, temporarily; but in the long run, I think we need to make two changes:
- Switch all
@sameAsattributes to@correspattributes. - Think about whether we need to use hashes before the
xml:ids we're pointing to. I'm never sure about this: the items aren't necessarily in the same file, although they sometimes are; but in the context of the database they're easily discoverable just by@xml:id.
This is a summary of the global changes we'll be making to the XML data, based on a re-reading of all the relevant posts from 2007:
- The Glottalized Ejective class has the following members:
p’, t’, c’, ƛ’, k’, q’ - The Glottalized Sonorant/Resonant class has these members:
mˀ, nˀ, lˀ, ḷˀ, rˀ, wˀ, yˀ, ʕˀ - The former are currently transcribed in the already-processed files using raised glottals (e.g. tˀ). The raised glottal is U+02c0. These need to be transformed into U+02bc: MODIFIER LETTER APOSTROPHE, "glottal stop, glottalization, ejective". This letter is valid as part of an xml:id attribute, so we could do a global conversion there, using Transformer rather than an XSLT identity transform.
- However, in the partially-transformed files, it appears that all of these items have been transcribed using actual apostrophes. This means we can't use Transformer, because there are valid English sequences containing e.g. t+apostrophe; only in the context of the TEI tags which contain Moses script should the conversions take place. Therefore we will have to use an XSLT identity transform to accomplish this conversion.
The plan, therefore, is this:
- For the completed and in process files, the only conversion I think we need to make is to convert Ejective + raised glottal to Ejective + U+02bc. This can be done universally, using Transformer.
- For the partially-transformed files, we need to write an XSLT identity transform which targets a specific list of only those TEI tags which contain Moses script. The transform will map Ejectives + apostrophe to Ej. + U+02bc (modifier letter apostrophe), and will map Sonorant/Resonant + apostrophe to SR + U+02c0 (raised glottal).
Met with EC and SK, to plan the revival of the project. SK will work for six weeks starting next Tuesday, on Kale, and we'll spend some of Tuesday setting up the machine. I've had SK added to the moses group on TAPoR. We'll do all editing on the server, and I'll back up the content methodically. They will start work on the c-rtr file, and meanwhile Greg and I will analyze old blog posts to devise a replacement system to make the last few changes to Unicode representations, as previously discussed. I also need to revise the project/markup description document a little.
Good meeting today, clearing up a lot of stuff we've been confused about. The issue with gloss tags is resolved. We discovered and fixed a problem in the phar-w.xml file where <bibl> tags were children of <entry> tags, when they should have been children of <def> tags.
Before I leave, I need to go through the preceding blog posts and make a detailed plan for the search-and-replace operations we need to do on the data, to get to the transcription system that's actually correct.
Where there is no attested definition, Ewa will supply one in this form:
<def> <note resp="ECH">[The definition/explanation]</note> </def>
I'm still looking in detail at your postings below, and I'm not sure I've grasped the issue fully yet, but I think it would help if I explain how I envisage the English-NX wordlist system working (in fact, the only way I can envisage it working at the moment).
The intention, as I understand it, is to produce a wordlist, not a dictionary. In other words, the output will be a list of English words and phrases in alphabetical order, each with an equivalent NX word or phrase. The way this would be achieved is this:
- Find each
<gloss>tag which is intended to be a wordlist entry. (This means that we have to disambiguate<gloss>tags which are intended to be for the wordlist from those which aren't; that can only be done on the basis of their context, or failing that, because they have a particular attribute added to them which distinguishes them. - For each such gloss tag, find the nearest appropriate NX word or phrase in the tree which is equivalent to it. (I had understood this to mean going up the tree to the
<entry>level, then taking the first<seg>in the first<pron>in the first<form>element in the entry.
This obviously requires that any <gloss> tag we're going to use for this purpose contain an English word or phrase that IS equivalent to the <seg> element as described above. If it's not going to be equivalent, then the question arises "what is it a gloss for?"
Do you envisage the process in the same way I do? If not, how had you imagined it?
I've downloaded your completed affix.xml file, and pushed it up to the database. We now have lots more items in the database, including some (at the beginning of the list) whose entry headword is their CV pattern. I'm assuming that's what's intended, since these are items whose form is so varied that they can't really be represented by anything else. Is that right?