I've started work on a more sophisticated way of generating the English wordlist, which would show a single headword for each distinct gloss, followed by an embedded list of each of the entries which contain that gloss, and their brief definitions. My first shot at generating the XML for this is taking forever on my local machine -- it's a bit recursive, and there are thousands of entries -- so I think I'll need to reconsider. The first thing to do is to add range indexes for all the key tags; I should have done that originally. Once that's done, I can time the production of the full list, and perhaps break it down by letters of the alphabet so that it's only trying to render one letter at a time. This might take a while to do, indeed.
Category: "Activity log"
As planned, the views of Moses entries ("finished" and "all") now look like the wordlist view (term + glosses), and the Moses-English wordlist has been removed from the menu.
Productive meeting this morning, giving rise to the following ideas:
- The "entries" and "wordlist" views should in fact be merged, such that the entries are presented initially in the form that the wordlist now takes (with the headword followed by all its glosses), expanding to a full entry when clicked on.
- Duplicate glosses will be removed; there's no reason for the same word or phrase to be wrapped in a
<gloss>
tag more than once in the same entry. I'll see about generating a list of duplicates using XQuery to help in detecting them. - In the "entries" view (Moses), affixes should not display any glosses, because where
<gloss>
tags are contained inside them, they're actually glossing a word containing the affix rather than the affix itself. Instead, the feature structure information explaining the function of the affix should be displayed, in a manner which makes it clearly distinct from regular glosses, and the affix head form itself should be distinguished visually so that it's clear that it's not a regular word. - The English-Moses wordlist is to be replaced by a serious attempt to produce something along the lines of a regular English-Moses dictionary, generated automatically from the DB. This will most possibly be as unsatisfactory as previous projects which used Lexware to generate the same sort of dictionary, but we may be able to do a more sophisticated job than that. Ultimately, when the initial phase of the work on the core code is complete, we may add some markup to
<dicteg>
s which enables us to generate a more complete set of English headwords by harvesting English words and Moses equivalents from<dicteg>
s. (The example here is "flannel", inside the current entry for "cloth"; the Moses equivalent is a phrase containing "cloth", so it doesn't have its own entry in Moses, but in English it probably should.) - DONE: The Moses-to-English basic view should not use words from gloss tags; instead, it should use the content of the
<def>
/<seg>
elements as the "gloss" of the word. - DONE: Unattested glosses should never be displayed at all in the Moses-English view. They're only to be used when constructing the English-Moses view.
There are about 13.5 thousand entries.
I wrote a test unit to figure out what was happening with the RuleBasedCollator, and I've fixed it. These are the issues I came up against:
- The class was generating a
ParseException
, but Cocoon or Saxon were silently swallowing the exception, and defaulting to a normal alpha sort. Until I ran the test unit, I didn't know the exception was being triggered. - The exception was caused by the fact that I had included both decomposed and composed versions of (for instance) a with a dot below. It turns out that the RuleBasedCollator class actually does normalization itself before doing comparisons, so there was no need for this, and in fact it was throwing the ParseException because it looked as though I was asserting that a character was equal to itself (or unequal -- the exception wasn't very clear on this).
- There were a couple of typos in Unicode codepoints for raised W, which were causing the precise sort I was using for testing to fail anyway.
- There was a stray combining-dot-below, virtually invisible and hard to select, in the rule definition, making it unparseable.
The collator now seems to be working fine. The code is below:
package ca.uvic.hcmc.moses; /** * * @author mholmes */ import java.text.ParseException; import java.text.RuleBasedCollator; public class MosesCollation extends RuleBasedCollator{ public MosesCollation() throws ParseException { super(mosesRules); } /* * Commented-out statements below are replaced by simpler ones, because the * RuleBasedCollator automatically does Unicode normalization before it does * its comparisons; including "parallel" versions of characters was triggering * a ParseException. */ private static String glottal = new String("\u0294"); private static String a = new String("a,a\u0301,a\u0300,\u00e1,\u00e0"); //private static String aDot = new String("\u1ea1,a\u0323,\u1ea1\u0301,a\u0323\u0300,\u00e1\u0323,\u00e0\u0323"); private static String aDot = new String("\u1ea1,\u1ea1\u0301,\u1ea1\u0300,\u00e1\u0323,\u00e0\u0323"); private static String cDot = new String("c\u0323"); private static String cApos = new String("c\u02bc"); private static String schwa = new String("\u0259,\u0259\u0301,\u0259\u0300"); private static String schwaDot = new String("\u0259\u0323,\u0259\u0323\u0301,\u0259\u0323\u0300"); //private static String hDot = new String("\u1e25,h\u0323"); private static String hDot = new String("\u1e25"); //private static String hDotW = new String("\u1e25\u02b7,h\u0323\u02b7"); private static String hDotW = new String("\u1e25\u02b7"); private static String i = new String("i,i\u0301,i\u0300,\u00ed,\u00ec"); private static String iDot = new String("\u1ecb,\u1ecb\u0301,\u1ecb\u0300,i\u0323\u0301,i\u0323\u0300,\u00ed\u0323,\u00ec\u0323"); private static String kApos = new String("k\u02bc"); private static String kW = new String("k\u02b7"); private static String kAposW = new String("k\u02bc\u02b7"); //private static String lDot = new String("\u1e37,l\u0323"); private static String lDot = new String("\u1e37"); private static String lGlot = new String("l\u02c0"); //private static String lDotGlot = new String("\u1e37\u02c0,l\u0323\u02c0"); private static String lDotGlot = new String("\u1e37\u02c0"); private static String lBelt = new String("\u026c"); private static String barLamApos = new String("\u019b\u02bc"); private static String mGlot = new String("m\u02c0"); private static String nGlot = new String("n\u02c0"); private static String pApos = new String("p\u02bc"); private static String qApos = new String("q\u02bc"); private static String qW = new String("q\u02b7"); private static String qAposW = new String("q\u02bc\u02b7"); private static String rGlot = new String("r\u02c0"); //private static String sDot = new String("\u1e63,s\u0323"); private static String sDot = new String("\u1e63"); private static String tApos = new String("t\u02bc"); private static String u = new String("u,u\u0301,u\u0300,\u00fa,\u00f9"); private static String uDot = new String("\u1ee5,\u1ee5\u0301,\u1ee5\u0300,u\u0323\u0301,u\u0323\u0300,\u00fa\u0323,\u00f9\u0323"); private static String wGlot = new String("w\u02c0"); private static String xW = new String("x\u02b7"); private static String xDot = new String("x\u0323"); private static String xDotW = new String("x\u0323\u02b7"); private static String yGlot = new String("y\u02c0"); private static String phar = new String("\u0295"); private static String pharGlot = new String("\u0295\u02c0"); private static String pharW = new String("\u0295\u02b7"); private static String pharGlotW = new String("\u0295\u02c0\u02b7"); private static String mosesRules = ("< " + glottal + " < " + a + " < " + aDot + " < c " + " < " + cDot + " < " + cApos + " < " + schwa + " < " + schwaDot + " < h < " + hDot + " < " + hDotW + " < i " + " < " + iDot + " < k < " + kApos + " < " + kW + " < " + kAposW + " < l < " + lDot + " < " + lGlot + " < " + lDotGlot + " < " + lBelt + " < " + barLamApos + " < m " + " < " + mGlot + " < n < " + nGlot + " < p " + " < " + pApos + " < q < " + qApos + " < " + qW + " < " + qAposW + " < r < " + rGlot + " < s " + " < " + sDot + " < t < " + tApos + " < " + u + " < " + uDot + " < w < " + wGlot + " < x " + " < " + xW + " < " + xDot + " < " + xDotW + " < y " + " < " + yGlot + " < " + phar + " < " + pharGlot + " < " + pharW + " < " + pharGlotW); }
Created simple wordlist output formats, so we can get a handle on how best to deal with <gloss>
tags. They're drawing from the complete set of entries, edited or not. They show a couple of things that need fixing, most of which we know about, but they also show that there's a tendency to include the same gloss multiple times in an entry. For instance, in an edited entry we have this:
ṣə̣́nṣə̣nt: tame, gentle, quiet, tame, tame, gentle
The entry shows why: there are multiple <gloss>
tags containing the same words sprinkled through the <entry>
.
It's possible for me to write the code so that it ignores these duplicate entries, but there are a couple of problems with that: first, SMK and ECH would be doing lots of extra tagging that we're ignoring, making the entries more complicated, and second, the generation of wordlists will take much longer because it'll have to check every <gloss>
to see if it's a duplicate. So I think a good policy would be to make sure we only tag a particular word or phrase once as a gloss in any given entry.
I've rewritten the RuleBased Collator -- source code is below. However, when invoked in the Moses web application, it appears not to be working at all. The next stage will be writing a test package to test it within the Java environment to see if it's the Java that's broken, or something in the XSLT or Saxon which is failing to use it properly. Source code:
/* * To change this template, choose Tools | Templates * and open the template in the editor. */ package ca.uvic.hcmc.moses; /** * * @author mholmes */ import java.text.ParseException; import java.text.RuleBasedCollator; public class MosesCollation extends RuleBasedCollator{ public MosesCollation() throws ParseException { super(mosesRules); } private static String glottal = new String("\u0294"); private static String a = new String("a,a\u0301,a\u0300,\u00e1,\u00e0"); private static String aDot = new String("\u1ea1,a\u0323,\u1ea1\u0301,a\u0323\u0300,\u00e1\u0323,\u00e0\u0323"); private static String cDot = new String("c\u0323"); private static String cApos = new String("c\u02bc"); private static String schwa = new String("\u0259,\u0259\u0301,\u0259\u0300"); private static String schwaDot = new String("\u0259\u0323,\u0259\u0323\u0301,\u0259\u0323\u0300"); private static String hDot = new String("\u1e25,h\u0323"); private static String hDotW = new String("\u1e25\u02b7,h\u0323\u02b7"); private static String i = new String("i,i\u0301,i\u0300,\u00ed,\u00ec"); private static String iDot = new String("\u1ecb,\u1ecb\u0301,\u1ecb\u0300,i\u0323\u0301,i\u0323\u0300,\u00ed\u0323,\u00ec\u0323"); private static String kApos = new String("k\u02bc"); private static String kW = new String("k\u02b7"); private static String kAposW = new String("k\u02bc\u027b"); private static String lDot = new String("\u1e37,l\u0323"); private static String lGlot = new String("l\u02c0"); private static String lDotGlot = new String("\u1e37\u02c0,l\u0323\u02c0"); private static String lBelt = new String("\026c"); private static String barLamApos = new String("\u019b\u02bc"); private static String mGlot = new String("m\u02c0"); private static String nGlot = new String("n\u02c0"); private static String pApos = new String("p\u02bc"); private static String qApos = new String("q\u02bc"); private static String qW = new String("q\u02b7"); private static String qAposW = new String("q\u02bc\u027b"); private static String rGlot = new String("r\u02c0"); private static String sDot = new String("\u1e63,s\u0323"); private static String tApos = new String("t\u02bc"); private static String u = new String("u,u\u0301,u\u0300,\u00fa,\u00f9"); private static String uDot = new String("\u1ee5,\u1ee5\u0301,\u1ee5\u0300,u\u0323\u0301,u\u0323\u0300,\u00fa\u0323,\u00f9\u0323"); private static String wGlot = new String("w\u02c0"); private static String xW = new String("x\u02b7"); private static String xDot = new String("x\u0323"); private static String xDotW = new String("x\u0323\u02b7"); private static String yGlot = new String("y\u02c0"); private static String phar = new String("\u0295"); private static String pharGlot = new String("\u0295\u02c0"); private static String pharW = new String("\u0295\u02b7"); private static String pharGlotW = new String("\u0295\u02c0\u02b7"); private static String mosesRules = ("< " + glottal + " < " + a + "a < " + aDot + " < c " + " < " + cDot + "̣ < " + cApos + " < " + schwa + " < " + schwaDot + " < h < " + hDot + " < " + hDotW + " < i " + " < " + iDot + " < k < " + kApos + " < " + kW + " < " + kAposW + " < l < " + lDot + " < " + lGlot + " < " + lDotGlot + " < " + lBelt + " < " + barLamApos + " < m " + " < " + mGlot + " < n < " + nGlot + " < p " + " < " + pApos + " < q < " + qApos + " < " + qW + " < " + qAposW + " < r < " + rGlot + " < s " + " < " + sDot + " < t < " + tApos + " < " + u + " < " + uDot + " < w < " + wGlot + " < x " + " < " + xW + " < " + xDot + " < " + xDotW + " < y " + " < " + yGlot + " < " + phar + " < " + pharGlot + " < " + pharW + " < " + pharGlotW); }
Built the collation file using the hard-coded Unicode sequence as shown in the previous post, and tested it, but it seems to fail completely, which is a little puzzling. I'm now thinking that hard-coding the actual Unicode characters into the text, as opposed to defining them ahead of time using escapes, is probably causing the problem. In any case, I have to add extra handlers for the acute and grave variants of all the vowels, so I have to go back into the code and work on it. I've made a start, but run out of time for today.
This is the collation rules sequence, produced by a little app I wrote from the input string:
("< ʔ < a < ạ,ạ < c " + " < c̣ < cʼ < ə < ə̣ " + " < h < ḥ,ḥ < ḥʷ,ḥʷ < i " + " < ị,ị < k < kʼ < kʷ " + " < kʼʷ < l < ḷ,ḷ < lˀ " + " < ḷˀ,ḷˀ < ɬ < ƛʼ < m " + " < mˀ < n < nˀ < p " + " < pʼ < q < qʼ < qʷ " + " < qʼʷ < r < rˀ < s " + " < ṣ,ṣ < t < tʼ < u " + " < ụ,ụ < w < wˀ < x " + " < xʷ < x̣ < x̣ʷ < y " + " < yˀ < ʕ < ʕˀ < ʕʷ < ʕˀʷ ")
EDIT: This is wrong: ignore it!
This is the moses collation info: original sequence, canonical-composed sequence, and canonical-decomposed sequence:
ʔ \u0660 \u0660 a a a ạ \u7841 a\u0803 c c c c̣ c\u0803 c\u0803 cʼ c\u0700 c\u0700 ə \u0601 \u0601 ə̣ \u0601\u0803 \u0601\u0803 h h h ḥ \u7717 h\u0803 ḥʷ \u7717\u0695 h\u0803\u0695 i i i ị \u7883 i\u0803 k k k kʼ k\u0700 k\u0700 kʷ k\u0695 k\u0695 kʼʷ k\u0700\u0695 k\u0700\u0695 l l l ḷ \u7735 l\u0803 lˀ l\u0704 l\u0704 ḷˀ \u7735\u0704 l\u0803\u0704 ɬ \u0620 \u0620 ƛʼ \u0411\u0700 \u0411\u0700 m m m mˀ m\u0704 m\u0704 n n n nˀ n\u0704 n\u0704 p p p pʼ p\u0700 p\u0700 q q q qʼ q\u0700 q\u0700 qʷ q\u0695 q\u0695 qʼʷ q\u0700\u0695 q\u0700\u0695 r r r rˀ r\u0704 r\u0704 s s s ṣ \u7779 s\u0803 t t t tʼ t\u0700 t\u0700 u u u ụ \u7909 u\u0803 w w w wˀ w\u0704 w\u0704 x x x xʷ x\u0695 x\u0695 x̣ x\u0803 x\u0803 x̣ʷ x\u0803\u0695 x\u0803\u0695 y y y yˀ y\u0704 y\u0704 ʕ \u0661 \u0661 ʕˀ \u0661\u0704 \u0661\u0704 ʕʷ \u0661\u0695 \u0661\u0695 ʕˀʷ \u0661\u0704\u0695 \u0661\u0704\u0695
I think I can actually achieve what I want to achieve using a much simpler approach than I'd been contemplating; in fact, I can probably use the method outlined in this post, which I followed for MynDIR. MynDIR sorting doesn't (as far as I can see) have any irrational sequences, though, so it remains to be seen whether it will actually do the job correctly; however, I'm hopeful. The other potential problem is that I can't do canonical decomposition prior to the comparisons. The RuleBasedCollator class doesn't leave room for this; it simply expresses a sequencing rule. However, I think I can build in handling for both decomposed and recomposed variants of the components, since the rules allow parallel sequences which are sorted together. The only problem then would be ill-configured sequences, which match neither decomposed nor recomposed sequences. If we do notice any bad sorting, though, we can fix the problem items, and even search-and-replace through the db to fix them globally.
Working with a little NetBeans application, I've generated the required sequence after canonical decomposition has been performed, with codepoints greater than 127 escaped:
\u0660 a a\u0803 c c\u0803 c\u0700 \u0601 \u0601\u0803 h h\u0803 h\u0803\u0695 i i\u0803 k k\u0700 k\u0695 k\u0700\u0695 l l\u0803 l\u0704 l\u0803\u0704 \u0620 \u0411\u0700 m m\u0704 n n\u0704 p p\u0700 q q\u0700 q\u0695 q\u0700\u0695 r r\u0704 s s\u0803 t t\u0700 u u\u0803 w w\u0704 x x\u0695 x\u0803 x\u0803\u0695 y y\u0704 \u0661 \u0661\u0704 \u0661\u0695 \u0661\u0704\u0695
This is a good start for my comparator...
Way back when, I'd started work on a sort comparator for the Moses entries, but since then the actual transcription characters have been altered substantially, and in any case I think my approach was flawed. I've revived the code and moved it from Eclipse into NetBeans to get it finished. The original approach was based on the comparison of individual characters in a single sequence, but the fact is that many of the sequences cannot be sorted this way -- for instance: q qʼ qʷ qʼʷ violates normal logic. What I'll have to do is to convert each group of such items into e.g. q1 q2 q3 q4, and sort based on those sequences rather than based on individual characters. If I can reduce all of the characters to such simplistic ascii representations, using numbers to force the correct sort order, then it should work fine.
ʔ a ạ c c̣ cʼ ə ə̣ h ḥ ḥʷ i ị k kʼ kʷ kʼʷ l ḷ lˀ ḷˀ ɬ ƛʼ m mˀ n nˀ p pʼ q qʼ qʷ qʼʷ r rˀ s ṣ t tʼ u ụ w wˀ x xʷ x̣ x̣ʷ y yˀ ʕ ʕˀ ʕʷ ʕˀʷ
I've created a new standalone Cocoon webapp using the original code and our standard Cocoon build from last year. I had to make some simple changes, but everything ported quite well.
I've posted this web application on the Pear server, which will be its long-term home.
I've also added a couple of refinements, as planned:
- There are now two links in the menu to the entries, one of which shows only those from files with
status="completed"
(as before); the other link shows all the entries in the database. I think the second view will be useful to ECH and SMK as they work on the files. - There's now a Status link on the menu, which leads to a page showing information about all of the files currently in the database. It shows the filename, its status, the number of entries in that file, the date of the last change, and a list of all the changes that have been done. The final column in the table shows "To do" entries. These are harvested from any XML comment which is inside the
<revisionDesc>
element, which contains these characters: "TODO:". This gives us a single standard location and method of storing a TODO list for each file; when actions are completed, a TODO comment can be turned into a<change>
entry.
I haven't yet turned off the old site, but I'd like to do that soon, once I get approval from ECH.
Set up the subversion repository, and the SVN client on SK's and ECH's computers in oXygen, so we're all clear on the process, and tested it. My next tasks:
- Add the status values to the ODD and generate a new schema, enforcing one of the six values in SK's post (below this one).
- Tweak the Cocoon app so that it only displays data from files which are
completed
. - Write a status page for the Cocoon app showing where each of the files is at, including as much info as possible from the
<revisionDesc>
.
Here are the values we've decided on for revisionDesc status. Martin will add these to the database schema.
rescued - Only one file has this status. It contains rescued entries that are missing from the main alphabetical files and need to be added back in.
unedited - These files have not yet been worked on.
editing - These files are actively being worked on by SMK.
additions_needed - ECH or SMK needs to add more entries from MDK's file cards. Only very few files should have this status. See comments at the top of each file for more details.
edited - These files just need a final proofread and check of phonemicizations by ECH.
complete - These files are done! Martin will program the database website to only display the entries from files with status="complete".
A file might not move through these statuses in order. Some files may go back and forth from "editing" to "additions_needed" a couple of times before reaching "edited" status.
I've been trawling through the Moses XML files trying to figure out a useful structure for the XML, and I'm realizing that the way we've been working -- moving files between directories to indicate their status -- is going to be an unhealthy way to work in subversion. Every time you move a file, you actually have to do an "svn remove" command to remove the file in its original location from tracking, and an "svn add" to add it into its new location. This will be tiresome and error-prone.
We really ought to be making use of the TEI revisionDesc element to track the changes we make to files.
Each change we make to the file would be documented in a <change>
element, and the <revisionDesc>
element itself has a "status" attribute which we can use to track what stage each file is at. I can fix it so that the web application looks at this status attribute to decide whether to display the content of the file or not, which means that all the files could always be stored in the database; we wouldn't have to put them in a special folder to "publish" them.
All the files would stay in one folder, and wouldn't have to be moved around; we could open any file to find out its situation and status.
So this is the proposal:
- Go through all the files in these folders, and identify _which_ of the folders each file should actually be in, based on its status:
- rescued
- tei_for_xform
- tei_xml_ECH
- tei_xml_done
- tei_xml_editing
- ready_to_edit_xformed
- For all of the files, create a
<revisionDesc>
element with its@status
attribute showing one of the following values:- status="unedited" (files from ready_to_edit...)
- status="editing" (files from tei_xml_editing)
- status="rescued" (file from rescued)
- status="edited" (files from tei_xml_ECH)
- status="complete" (files from tei_xml_done)
- Merge all these directories into one, called "xml", containing all the files.
- Use this to create the Subversion repository (which will also include the cocoon code, documentation, schema etc.).
After that, I'll be able to upload all files into the db without worrying about what status they're at, because the db will only "publish" files which have a status of "complete". ECH will be able to do a find-in-files to discover which files have status="edited", meaning that they're awaiting tweaks and approval from her, and SK will be able to do the same to determine which files she's currently working on. When I run XSLT transformations, I'll be able to target the transformations at files with a specific status value.
Every time a file is edited, we'll add an entry to the top of the <revisionDesc>
element explaining the changes we've made (very briefly, unless there's a good reason to go into detail).
Waiting for approval from the team before moving forward with this.
Here's an update on the status of the files in the tei_xml_editing folder:
affix.xml
-To do: proofreading against MDK cards from line 1870, phonemicizing throughout, checking questions in Comment tags.
-Add entries for missing affixes, per blog posts 1-4 Mar 2010. Go through affix paradigm sheets to determine which ones still need entries, and add entries for these affixes. CHECK ALSO pron.xml before creating a new entry!
h-phar-part1.xml
-Done by SMK and ready to be checked by ECH, but we'll wait 'til I have finished h-phar-part2 and combined the files again.
h-phar-part2.xml
-Ready for editing by SMK. Needs to be combined with h-phar-part1 when completed.
lex-suf-new.xml
- This file contains entries SMK created for 4 lexical suffixes which could not be found in the main lex-suff.xml file. We subsequently realized that many lexical suffix cards had never been entered into the Lexware database, so the next step here is to enter the rest of those cards. Then, if further lexical suffixes are still unaccounted-for, we can create new entries for them. CHECK ALSO lex-suf-nom.xml before creating a new entry!
lex-suf.xml
-ECH or SMK need to phonemicize all the examples.
-This file still needs to be checked against the Lexware printout.
-ECH needs to check SMK's work.
-The file needs to be proofed against MDK's cards, and the missing lexical suffixes need to be entered, as noted above.
-CHECK ALSO lex-suf-nom.xml before creating a new entry!
particles.xml (FORMERLY KNOWN AS affix-part.xml)
-This file contains only the first two entries from the particle card file. These two entries need editing, and the rest of the cards need to be entered. We will enter just the <form> and <def> information for the particles, not the dictegs, because the same dictegs appear elsewhere in the data (with their root morphemes). Later on, we will search for the examples of each particle, tag them, and link them to the particle entries programmatically.
qw-glot.xml
-Edited by SMK as far as line 469. The rest of the file is ready to edit.
rescued_final.xml
-This file contains the entries MDH recently rescued, which had been lost in an earlier transformation of the data. So when SMK comes across a missing entry in an alphabetical file, she can search for it in this file and paste it into the correct place in the alphabetical file.
These two files were left in a state of partial conversion after being partly edited before we wrote some universal conversions that automated some of the editing. We've now carried out all the required conversions on these files to bring them to the same state that the rest of the files awaiting editing are at. In the process, I discovered that in the case of gloss phrases which had multiple asterisked words, only the first was being tagged as a gloss, so in the file collapse_forms_etc.xsl
, I abstracted the code which tags glosses to make it into a recursive template which would tag all glosses. This now means that there may be some instances of multiple glosses untagged in the bulk of the files awaiting editing; if that's the case, then I can extract that template and make an identity transform with it to apply the same fix to the rest of those files.
We've now re-organized the file structure somewhat, so that we have tei_xml_editing
for files which SK is in the process of working on, tei_xml_ECH
, for files edited but awaiting ECH's approval, and tei_xml_done
, for files ready to go into the database. If this structure remains stable, I'll base the subversion repository on it.
Following up on our questions from March 2010 about how to mark up unattested glosses , we concluded the following:
An unattested gloss should be in its own <seg>, have type="u", and have a following <bibl> tag identifying the editor who supplied the gloss - generally ECH:
So this:
<seg><gloss>knot</gloss>ted up (one knot) <
<gloss>tie</gloss>>
</seg>
<bibl>Y13.2</bibl>
Would change to this:
<seg><gloss>knot</gloss>ted up (one knot)</seg> <bibl>Y13.2</bibl>
<seg><gloss type="u">tie</gloss></seg> <bibl>ECH</bibl>
I will need to process these manually.
When we left off this project in March 2010, affix.xml and lex-suff.xml still needed a subset
of the steps from collapse_forms_etc.xsl run on them to clean up their dicteg entries. Martin had attempted this on 16Sep10, but I checked these files today, and the transformations didn't work. We concluded that it will be fastest for me to make these changes to the dictegs manually.
See the file status report blog entry, 29Mar10, and the "Automating some repetitive tasks through XSLT" blog entry, 18Mar10, for details on what needs to be done to each of these files.
Although all ENTRIES missing from qw.xml were successfully rescued into rescued.xml, I noticed various missing LINES which didn't make it into qw.xml, and were NOT rescued either. Some of these losses were due to data entry typos by CDH, but some are systematic: The following bands were lost in the transformation from the entries_separated stage to the partially_transformed stage:
xf
gn
gc
q
cul
g and k bands immediately following these bands were also lost.
However, after discussion with Martin, we decided we have come too far to go back and rescue these bands. I will re-type this data from the Lexware printouts as I go.
MDH wrote:
I've now tweaked the XSLT and run a second operation, which brought back 1906 entries, as opposed to nearly 3,000 from yesterday's run. SK will examine these and see if we've now hit the nail exactly on the head.
SMK adds:
I checked this new version of rescued.xml against qw-glot.xml, l-affric.xml, and the first 10 pages of Lexware printout of qw.xml. All the missing entries from these three files were successfully rescued. No "non-missing" entries were found from l-affric.xml or qw.xml. Two "non-missing" entries from qw-glot.xml were still rescued:
<!--From file: Q'W.xml-->
<ENTRY level="002" id="">
<ds/>
<infl>stative</infl>
<stt>q'ʷə+√q'ʷác'‐t</stt>
<g>they are *full</g>
<k>JM3.7.5</k>
</ENTRY>
<!--From file: Q'W.xml-->
<ENTRY level="002" id="">
<ls.ls.dm/>
<infl>intransitive</infl>
<i>kən q'ʷ+√q'ʷác'=əl=tn</i>
<g>I *fill|ed up my basket</g>
<k>JM2.110.2</k>
<var>kən q'ʷəc'+√q'ʷác'=ɨl=tn̩</var>
<g>I *fill|ed several baskets</g>
<k>JM3.7.7</k>
<i>q'ʷəc'+√q'ʷác'=ɨl=tn̩</i>
<g>he *fill|ed several baskets</g>
<k>JM3.7.6</k>
<i>q'ʷəc'+√q'ʷác'=ɨl=tn̩ lɨx</i>
<g>they *fill|ed several baskets</g>
<k>JM3.7.8</k>
</ENTRY>
We couldn't figure out why these were included in the rescue operation, but since all the truly missing entries have been found, it doesn't matter that a few extra entries have also been found.
Martin then proceeded to run the transformations outlined in the previous post on rescued.xml to create rescued_final.xml.
This is what we've done to get the missing entry items into a state where they can be merged back into the main collection whenever SK finds that one is missing:
- Generated a
rescued.xml
file starting from the complete set ofunmerged_xml
documents, collated usingxincluded-complete-list-expanded.xml
, and running a new transform calledretrieve_lost_from_unmerged.xsl
. - Manually fixed nested
<ENTRY>
tags (there were only six instances). - Ran
rescued.xml
throughremove_empty_tags_from_rescued.xsl
, which removes the initial empty tag that caused all our problems; it also turns anychild::infl
tag into a comment (SK thinks this will be the simplest way to preserve that information). The output is calledrescued_empties_removed.xsl
. - Ran
rescued_empties_removed.xsl
throughexpand_separated_xml.xsl
to producerescued_empties_removed_expanded.xml
. In this case, I tweaked the XSLT file to preserve comments from the preceding steps, recording the file of origin of the entry, and any<infl>
tag value. - Ran
rescued_empties_removed_expanded.xml
throughglottal_conversion.xsl
to producerescued_empties_removed_expanded_fixed.xml
. This corrects a bunch of Unicode character representations. - Ran
rescued_empties_removed_expanded_fixed.xml
throughcollapse_forms_etc.xsl
to producerescued_empties_removed_expanded_fixed_forms_collapsed.xml
. This makes a set of changes SK identified as being uniform and helpful to reduce the amount of repetitive work she has to do. - Renamed
rescued_empties_removed_expanded_fixed_forms_collapsed.xml
torescued_final.xml
Where entries are missing from the other files, SK will now go to rescued_final.xml
and retrieve them from there; they should be in the same condition as the other entries they're being merged with.
Note to self: it was difficult to reconstruct this conversion sequence because half of it took place before the blog was in place. In future, make sure anything like this is blogged in great detail.
My first shot at pulling back missing entries did catch all of the missing ones, but also found some which were not missing. After SK examined the results, we now theorize that where the entry had an <infl>
element with an @mode
attribute, then it was retrieved during the original operation; so these entries can now be excluded from our set. I've now tweaked the XSLT and run a second operation, which brought back 1906 entries, as opposed to nearly 3,000 from yesterday's run. SK will examine these and see if we've now hit the nail exactly on the head.
Met with ECH and SK today, and got everything going. This is what's been done:
- Following my own and ECH's research into the entries which were not converted at the separation stage, I wrote an XSLT file to harvest all of those entries from the unmerged files. SK will check a significant number of these to make sure that we've got all the missing ones. It's possible we may have got others that weren't missing, too; that's not so much of a problem, but I'll probably try to fine-tune the process to keep the numbers of entries down anyway. Then when we're sure we have the correct subset, I'll try to reconstruct the conversion process that the other files went through, and run this file through it. Then the missing entries can be merged back into the existing files as SK works her way through them.
- The schema has been updated. We were working from a 2006 schema with some manual modifications to the RNG file, for two reasons: first, we are using an element (
<dicteg>
) which has now vanished from P5, and second, at the time when we created the schema, Roma wasn't processing customized ODD files properly (IIRC). Now, though, I've put the whole thing on a more formally-correct footing, and built an ODD file to add<dicteg>
, and to modify various other bits of the content model to allow (for instance)<bibl>
inside<entry>
. At the same time, we benefit from some other changes to P5, such as the availability of@type
and@subtype
on<gloss>
, which enables us to mark editor-supplied (unattested) glosses as<gloss type="u">
, making it possible to suppress them or or deprecate them in output if necessary. - SK is set up on Spartan, working directly on the server copies of the files.
- I have asked for a Subversion repository for the XML. This is going to be essential, since there will be three of us working on the content. Once sysadmin have set this up, we'll have a little training session and I'll make a cheat sheet for SK and ECH.
This is based on an email from ECH, describing the context in which content was lost during the transformation from "unmerged" to "entries_separated" files:
What is happening is this. When there is a sequence in the unmerged file of the following:
<ENTRY level=”002” id=””> <xx></xx> <infl>yyyyyy</infl> zzzzz zzzzz zzzz </ENTRY>
The entire entry is missing from the separated file.
These level 002 entries are derived words in the Lexware database; significantly, the entries are dropped from the transformed output when the inflection band follows the derivation band directly.
This has been observed to happen in qw-glot.xml
as well as other files. I'm going to look at that file directly, and find specific examples I can work with, then isolate them and work on a fix.
Results of some digging:
qw-glot_SEPARATED.xml
is constructed from the very smallQ'W.xml
and the much larger Q'W1CDH.xml. Both items from the former are there inqw-glot_SEPARATED.xml
, so the dropped items are all fromQ'W1CDH.xml
.- Some level 002 entries were definitely carried over OK (e.g.
entry xml:id="niʔqʼWacʼlqs"
, "nosebleed"), so it's not just a question of dropping level 002 entries. - It's not just a case of items with no
@id
being dropped (as I initially thought from ECH's description above); some entries with no@id
value are carried over (e.g.I *fill|ed up my basket
).
Here is an example which shows the problem. In the following, the outer entry (the root) is carried over, but the inner one is completely dropped:
<ENTRY level="001" id="√q'ʷáq'ʷ‐"> <rt>√q'ʷáq'ʷ‐</rt> <ENTRY level="002" id=""> <ls></ls> <infl>nominalizer</infl> <n>s‐√q'ʷáq'ʷ=əl'qʷ</n> <g>*prairie‐chicken, *sharp‐tailed~grouse</g> <gc>Y2.33 is JM only</gc> <k>A46; Y2.33</k> <var>s‐√q'ʷáq'ʷ=əl'qʷ‐aʔᵃ</var> <g>?</g> <gc>claimed by Agnes Miller to be MC, by Jerome Miller to be Colville</gc> <k>AM, JM</k> <var>s‐√q'ʷáq'ʷ=əl'qʷ‐aʔ</var> <g>*prairie~chicken</g> <k>EP2.31.8</k> </ENTRY> </ENTRY>
This appears to be a situation in which the first element of the embedded item (in this case "lexical suffix") is empty, and is followed by <infl>
.
I looked at the XSLT code (separate_xml.xsl
) and determined that:
- An entry is only processed if its first element has a string-length of more than zero; so the empty first item causes the problem here.
- However, this is only
otherwise
branch of a conditional; the first branch (presumably deemed to be the most common) expects to find an@mode
attribute on elements. What it does then is to process all following items which have the same @mode attribute.
It looks as though this process was primarily written targetting a situation in which we needed to separate not simply embedded <ENTRY>
elements, but also blocks of tags within <ENTRY>
elements, which were defined by their sharing an @mode
attribute value. However, the qw-glot file doesn't have ANY @mode
attributes, while some files have many of them. It appears that there were two distinct methods of structuring entries in the original data, and these were converted into two slightly differing XML structures.
However, this is something of a red herring; I found another entry, in T4CDH.WRK.xml
, which does make use of @mode
but still exemplifies the problem (its inner <ENTRY>
is lost):
<ENTRY level="001" id="√k'ᵊř"> <rt>√k'ᵊř</rt> <ENTRY level="002" id=""> <lc.ls></lc.ls> <infl>nominalizer</infl> <n mode="1">s‐t‐√k'ᵊř=álᵊqʷ</n> <g mode="1">tree cut with something</g> <k mode="1">Y24.74,77</k> <il.lc.ls.n mode="1">nawə́nt s‐t‐√k'ᵊř=álᵊqʷ</il.lc.ls.n> <df mode="1">groove or deep line cut into a tree</df> <k mode="1">Y24.74</k> </ENTRY> </ENTRY>
So the issue is clearly with the empty tag. It's obvious from the (very simple) XSLT that in such a context, we explicitly stop processing, so nothing is output:
<xsl:if test="(not(preceding-sibling::*) and (name() != 'ENTRY'))"> <xsl:if test="string-length(.) > 0"> <xsl:element name="ENTRY"> <xsl:copy-of select="."></xsl:copy-of> <xsl:for-each select="following-sibling::*[not(@mode)][not(name() = 'ENTRY')]"> <xsl:copy-of select="."></xsl:copy-of> </xsl:for-each> </xsl:element> </xsl:if> </xsl:if>
The question now is why -- why did we decide not to process entries that began with an empty tag? I'll write to ECH and see if she has any memory of this, and also keep digging to see if I can find a reason. Ultimately, it should be possible for me to use the same strategy in reverse to FIND all those entries, and output them specifically in one block, which could then be merged back into the new files (once it's gone through all the other processing
We are currently using <angle brackets> to denote a gloss supplied by Ewa
or other editors, rather than by the fluent speaker who actually uttered the Nxa'amxcin example. For example:
hə̣́ll
hə̣́ll ECH
hə̣́lə̣l Y39.109
√hə̣́l-C₂
lazy; <tired>
Y39.109
This indicates that speaker Y glossed this word "lazy", but the editors would also like it to appear in the English-Nxa'amxcin word list under "tired".
However, Martin noted that the angle brackets "amount to an alternative markup system, bypassing the XML, so it's definitely not ideal -- it will make it
difficult to find those particular items, or style them in a particular
way. They should be tagged in the proper way at some point."
So we need to figure out how best to tag them. I suggested we could mark them all with <bibl>ECH</bibl>, but we don't actually want these glosses to appear on the database website. We just want them to be searchable when we're creating the word list.
So this is another issue to be sorted out when we next pick up the project.
As my first contract comes to an end, here is a summary of the status of all the files we have worked on. I have also added explanatory comments at the top of each active file.
In the tei_xml folder:
c-rtr.xml - completed and posted on database website
h-phar-part1.xml - completed by SMK. ECH needs to check phonemicizations and hyphs, but might as well wait ' til part2 is completed too.
h-phar-part2_xformed.xml - ready to edit. MDH has completed the latest XSLT transformation. Future editors, please look out for missing entries as you continue to edit this file! It could have had the same data loss problems that qw-glot and other files had.
An old copy of h-phar.xml is currently posted on the database site, but that was a mistake.
h.xml - completed and posted on database website
lex-suf-new.xml - contains entries I created for 4 lexical suffixes which could not be found in the main lex-suff.xml folder. We subsequently realized that many lexical suffix cards had never been entered into the Lexware database, so the next step here is to enter the rest of those cards. Then, if further lexical suffixes are still unaccounted-for, we can create new entries for them.
phar-w.xml - completed and posted on database website
qw-glot.xml- edited by SMK as far as line 469, whereupon I discovered many missing lines and entries. This problem occurred during the transformation from the "unmerged" version to the "entries_separated" version. ECH is trying to deduce what went wrong, and will post details on the blog.
s-rtr.xml - completed and posted on database website
In the tei_for_xform folder:
affix_test.xml - a small file with a copy of one entry from the main affix.xml file, made for MDH to test the following XSLT transformation on
affix.xml
-MDH needs to adapt the most recent XSLT transformation to format the dictegs in this file as follows:
-add <phr type="p" subtype="u"> </phr> at the top of each <quote>
-surround *'d words with <gloss> tags
-move bibls up from quotes to their daughter <phr type="n">s and <seg>s.
-Then ECH and SMK need to: proofread against MDK cards from line 1870, phonemicize throughout, check questions in Comment tags.
lex-suf.xml
-MDH needs to use his XSLT transformation to reformat the dictegs. (SMK has formatted the form and sense/def sections manually.)
-ECH or SMK need to phonemicize all the examples.
-This file still needs to be checked against the Lexware printout.
-ECH needs to check SMK's work.
-The file needs to be proofed against MDK's cards, and the missing lexical suffixes need to be entered, as noted above.
In the ready-to-edit folder:
All files - MDH has completed the latest XSLT transformation, but more research is needed regarding data that was lost in the earlier transformation from "unmerged" versions to "entries_separated" versions. ECH is trying to deduce what went wrong, and will post details on the blog.
The results for this file were not exactly what we wanted -- some copies of bibl elements were not being made -- so I've revisited the transformation, and I think it's now done correctly. Waiting for SK to confirm.
For future reference, here's how we decided to handle MDK's entries for compound lexical suffixes (e.g. apqən, qnwil, etc.):
-keep compound lexical suffixes as their own entries in the database
-tag them with corresp in hyph - e.g.
<hyph>=<m corresp="ap-1 qin">ápqən</m> </hyph>
-add lexical-suffix-compound to their feature structures:
<fs>
<f name="baseType">
<symbol value="suffix"/>
</f>
<f name="derivational">
<symbol value="lexical-suffix-compound" />
</f>
</fs>
I added <symbol value="lexical-suffix-compound" /> to feature_system.xml after discussion with Martin and Ewa.
Three files which were already edited or in the process of editing when we ran the last series of transformations now need reworking, so we figured out what needs doing to each (each being different). I've rewritten my transformation from the other day so that I can switch on or off various aspects of it, and made some progress with running it on two of the files, but in the process of doing this a new, more serious problem emerged, concerning data which was lost from one or more of the files during a previous transformation in 2006. We think we know what triggered it, and we also think we know which files might be affected; if I can work back through blog entries to confirm exactly what was done, in what order, to the dataset, we should be able to make the same changes with some minor alterations to undo the damage. But this is going to be significant work, so it might have to wait until the fall.
@type attributes on various elements now have abbreviated values ("n" or "p" instead of "narrow" or "phonemic"). Updated and tested the XSLT to take account of this.
With periodic input from SK, I've finished writing the XSLT to transform the existing XML to something much closer to what the editors want to produce: multiple instances of many tags are collapsed into one, and bibl references copied in multiple places. The problem now is that oXygen 11.2 seems to have serious issues running a transformation scenario on the files; it runs out of memory (I've already upped its memory allotment twice), or it simply stops after one file instead of running through all the selected files. In the end I gave up on it, and ran the transformations in oXygen 10.3, which has no problems.
Contrary to previous discussions ...
Yes, we ARE going to keep the root symbol √ in both <hyph>s and <dicteg>s for monomorphemic entries.
This makes it less work for me to take the √ out in these contexts in all the rest of the files.
Ewa has graciously volunteered to put the √'s back IN in the appropriate places in the current active files.
Spent an hour working on the rather thorny XSLT to rationalize the entry structures. This is one of those jobs where the real needs only become apparent as you start working through the process and looking at the output from early code. SK and I are gradually figuring out what we need and how to do it.
In the tei_xml folder, I have added a file called qw-glot-test.xml.
It contains three entries copied from qw-glot.xml which should be good test cases for all the transformations we hope to be able to do.
We met this morning to discuss ways to relieve SK of some of her more tedious work during the editing process. We decided that some processes can be automated, and accomplished through XSLT on the files still awaiting editing. These are the details (from which I'll later write the XSLT):
- Where an
<entry>
has multiple<form>
elements:- Keep only the first one.
- Append the contents of each
<pron>
in the subsequent<form>
s to the<pron>
in the first<form>
.
- Where an
<entry>
has multiple<sense>
elements:- Copy the contents of #2ff to the end of #1.
- Delete #2ff.
- Delete any empty
<sense>
elements. - In every
<quote>
:- Start by adding this hard-coded content at the beginning:
<phr type="p" subtype="u"></phr><bibl>ECH</bibl>
- Next, wrap the first text node in a
<phr type="n">
tag. - Append any
<bibl>
which is a following-sibling of the parent<quote>
. gloss[parent::quote]
should be changed to<seg>
.- Then find any asterisk in its content, and wrap a
<gloss>
tag around that word, removing the asterisk. - Next, append any
<bibl>
which is a following-sibling of the parent<quote>
. - Output the rest as-is (should there be anything else?).
- Start by adding this hard-coded content at the beginning:
Finally, there needs to be a change to the server-side code, because "phonemic" and "narrow" as values of @type
will be changed to "p" and "n" respectively, while @subtype
will go from "unattested" to "u".
-Use Find-and-Replace to globally remove the following comments:
<!--Form for the core entry-->
<!--Definition for the core entry-->
<!--Not yet edited-->
-For unattested entries for roots in isolation, check MDK's card to determine whether MDK or ECH added the root entry, and add an appropriate <note> to the entry.
-Replace any instances of ḥ (composed of h and COMBINING DOT BELOW) with ḥ (LATIN SMALL LETTER H WITH DOT BELOW).
-Replace any transcribed R's with h, ḥ, or ʕ as appropriate.
-Keep the angle brackets < > around glosses added by ECH - e.g.
frozen <freeze>
This differentiates meanings given by native speakers from glosses added by ECH.
-Leave clitics in main entries; do not make examples with clitics into dictegs.
-Explicitly mark up editorial decisions with <note>s. (See for example the notes on ḥaƛʼ-1 and ḥaƛʼ-2.)
We need a space between the <phr> and the <seg><gloss> in all dictegs.
Right now, dictegs in the affix file are displaying on the database site with the gloss jammed right up against the ] of the <phr>.
I will fix this by adding a space between </phr> and <seg> throughout the affix file, with find & replace.
The affix.xml file which has just been completed needed to be processed using the XSLT I'd previously written to convert the erroneous transcriptions of glottalizations resulting from our original conversion to the current system. This XSLT has been successfully run on all the other files in the system. However, when it was run on the affix.xml file, in oXygen it simply did nothing; the old transcriptions remained untransformed.
I could not figure out what the problem was. Running Saxon at the command line didn't help either. There may be something rather odd about that file, though. There are two symptoms of oddity: 1) when opening the file in oXygen, I got a warning about bidirectional features being turned off due to the file size (which may indicate that there's something oddly bidi about it, although searches for the bidi-change-trigger characters produced nothing); and 2) when copying and pasting the contents of the file from oXygen to Transformer running under Wine, only three characters were pasted: an x, a 5, and a control character, "device control 2", or Unicode u+0012. However, if you search for this character, it can't be found in the document.
In the end, I gave up and used Transformer to do a conversion using search-and-replace. There may be some problem simply with the size of the file (1.1 MB) -- it's larger than any of the others. But I'll keep an eye on that file.
I just tried to find the Unicode characters for the mid-central [a] and the back [ɑ]. The former is LATIN SMALL LETTER a 0061, and the latter is LATIN SMALL LETTER ALPHA, 0251.
BUT when I copied and pasted the "back a" character which is used throughout the xml files, from Oxygen into the character palette search box, it turned out it IS LATIN SMALL LETTER a 0061!
So this is just a font problem in Oxygen, and we don't have to find and replace anything in the data. The character does display correctly as mid-central [a] on the database website.
Rewrote some parts of the markup documentation to include our decisions about attested vs unattested transcriptions, locations of <bibl>
elements, and other issues.
We discovered we needed the @subtype
attribute on <phr>
, and it wasn't available, so I went back to the schema. What I have is a manually-modified RNG file; the original was generated from Roma back before P5 was finally released, and I think Roma was broken at one point, hence the manual modification to the RNG (to allow <bibl>
in <entry>
). I discover now, though, that there were significant changes to the dictionary markup between the generation of my schema and the release of P5. Specifically, dicteg now no longer exists (<cit>
and/or <q>
and/or <quote>
should be used), and it looks as though @subtype
is now available on <phr>
. However, if I generate a new schema, all the old data will be invalid because of <dicteg>
, so as a temporary measure I've made another manual modification to the old schema to allow @subtype
on <phr>
, and we're sticking with the old schema until I get the chance to generate a new one, figure out the issues, re-process all the old markup, and rewrite the XSLT. Not a huge task, but an annoying and unexpected addition to the workload.
To clarify which phonemic transcriptions in the database were recorded by MDK, and which have been derived by ECH from MDK's narrow transcriptions, we are adding subtype="unattested" to the markup.
A phonemic form transcribed by MDK will still be marked up like:
<pron>
<seg type="phonemic">ṣə̣́nṣə̣nt</seg><bibl>JM3.20.11</bibl>
<seg type="narrow">sə́nsə̀nt</seg><bibl>Y24.40</bibl>
</pron>
(An appropriate disclaimer can be programmatically added to all of MDK's phonemic forms in the final output.)
A phonemic form derived by ECH will be marked up like:
<pron>
<seg type="phonemic" subtype="unattested">ṣə̣́nṣə̣nt lx</seg><bibl>ECH</bibl>
<seg type="narrow">ṣə̣́nṣə̣nt ləx</seg><bibl>JM3.21.1</bibl>
</pron>
A phonemic form derived by MDK (one with no source noted in the file cards or lexware database) will be marked up like:
<pron>
<seg type="phonemic" subtype="unattested">ṣə̣́nṣə̣nt lx</seg><bibl>MDK</bibl>
<seg type="narrow">ṣə̣́nṣə̣nt ləx</seg><bibl>JM3.21.1</bibl>
</pron>
This applies to <phr>s as well as <seg>s. Martin has rewritten the schema to allow for subtypes of <phr>s.
Every <seg> and <phr> should now have a sister <bibl> showing its source.
I have carried these changes through the s-rtr, c-rtr, and phar-w files, and will implement them in the affix file next.
It is also possible to put <bibl> tags higher up the tree; e.g., if everything in an entry came from the same source, only one <bibl> tag would be needed, a sister to the whole <entry>
Here's a summary of the rest of the markup decisions we made last week:
-placeName tags can be added to highlight place names within glosses, e.g.
<seg><gloss><placeName>Corkscrew Grade </placeName></gloss>(next canyon from <gloss><placeName>Jackass Canyon</placeName></gloss>)</seg>
-Each <entry> should generally only have one <hyph>. It is not necessary to have a <hyph> for every <form>, unless a variant form contains one or more additional morphemes (e.g. cahcimn, cahcimn-c).
-For words which were transcribed with an initial "(s-)", the prefixed and non-prefixed variants should be marked up as different forms within the same entry. They will have different <hyph>s.
-Words with ə transcribed in parentheses should have separate segs for transcriptions with and without ə - e.g.,
<seg type="narrow">cx̣cx̣ᵊnᵊwˀálənˀ</seg>
<seg type="narrow">cəx̣cəx̣ᵊnᵊwˀálənˀ</seg>
-Yes, morpheme breakdown should be included in <dicteg>s, but only for the form being exemplified.
I have just changed the following xml:ids in all the active files (including the affix file):
n --> n-CTL
nt --> n-CTL + t-TR
s2 --> s-SUBJ
kaL-pr --> kaɬ-PR
kɬ-der --> kɬ-DER
kˀɬ-der --> kʼɬ-DER
Ewa will need to ADD the following morphemes to the affix file:
=mix-LS (this is =amx, mix, ExW 'people')
(DONE, SMK 10OCt12. This entry already existed under =amx, but I have changed the xml:id to mix-LS.)
-aɬ- compound linking morpheme. Check fs for morpheme designation
n-LOC (locative prefix)
n-unknown (see sḥaƛʼƛʼnus)
k-LOC (locative prefix)
t-LOC (locative prefix)
kɬ-LOC (locative prefix)
kat-LOC (locative prefix)
niʔ (locative prefix)
t-TR (transitive suffix)
CHAR (characteristic reduplication)
OC (out-of-control reduplication, +C2)
n-SUBJ (1st person singular subject)
xW-SUBJ (2nd person singular subject)
t-SUBJ (1st person plural subject)
Ø-OBJ (3rd person singular object)
m-OBJ (1st/2nd person singular object after -stu-)
m-SUBJ (1pl subject occurring with 3 object in causative paradigm)
al (1st person plural object)
ɬ-DIR (directive)
sa (Note: one cross-reference to a morpheme "sa" already appears in the affix file.)
si
RED (total reduplication - e.g. hames+hames) [n.b.: there is already a morpheme with xml:id RED in the affix file. It's CVʔ reduplication. This needs sorting out, as total reduplication has already been marked up with RED in the h.xml file.]
m-MID (middle)
(Are other "m" morphemes missing too?)
sac (stative)
ʔin- (1st/2nd person singular possessive)
s-POSS (3rd person possessive)
ayˀ-PAST (not to be confused with the entry "ayˀ" in the lexical suffix file!)
s-SUBJ also needs its definition added.
Much debate today on how to mark up entries which have variations in narrow transcription, translation, source, or any combination of the above [<var>s in the Lexware system].
We concluded as follows:
-All the <forms> in an entry should have the same phonemic transcription. Ewa will need to make sure this is the case as she goes through the c-rtr, s-rtr,and phar-w files.
-For entries where the narrow transcriptions are different, but the <def>s are the same, all the narrow transcriptions can be collapsed into one <form>, and all the <bibl>s within the <sense> element can be collapsed into one <def>. SEE FOR EXAMPLE: sʕʷáʔʕʷaʔ "cougar".
-For entries where the senses are different, there should be a separate form for each sense, even if the narrow transcriptions are the same. SEE FOR EXAMPLE: ṣạpḷị́ḷ "flour, bread"
Made some more changes to improve the layout, in consultation with EC and SK. Much more to do...
As blogged elsewhere, we now have some <m>
components of <hyph>
which break down into multiple morphemes; we're expressing these using @corresp with multiple values, instead of @sameAs with one. I've now written some basic handling for that situation. In the process, I had to migrate the stylesheet from XSLT 1.0 to 2.0. The site is a mix of both. It really needs to be remodelled using a new Cocoon/eXist stack. I'll do that as soon as we have the new Tomcat box up and running.
In the phar-w file, I have put bibl tags within both prons and defs - such that every seg has a sister bibl - e.g:
<entry>
<form>
<pron><seg type="phonemic">sʕʷáʔʕʷaʔ</seg><bibl>Y1.68; MW; EP</bibl></pron>
<pron><seg type="narrow">swáʔwaʔ</seg><bibl>G48; J3.1; A4</bibl>
</pron>
<pron><seg type="narrow">swˀáʔwˀaʔ</seg><bibl>CS18</bibl></pron>
<hyph><m sameAs="nom">s</m>‐√<m sameAs="ʕWaʔ">ʕʷáʔ</m>+<m sameAs="CHAR">CVC</m></hyph>
<note>onomatopoeic<bibl>Y</bibl></note>
</form>
<sense>
<def><seg>cougar</seg>
<bibl>Y1.68; MW; EP</bibl>
<bibl>G48; J3.1; A4</bibl>
<bibl>CS18</bibl>
</def>
</sense>
I am going to do this even in cases where the entry has only one form and one def (rather than just putting the bibl on the whole entry).
This also serves to distinguish among phonemic representations:
-If it was transcribed by MDK, it has a bibl.
-If it was derived by ECH from a narrow transcription, it has no bibl.
I checked with Martin, and he agreed that this way makes sense, so I am going to change the s-rtr file to use this system too.
Martin questioned the stacking of bibl tags in the example above, where multiple sources all give the same definition. These could all actually be in a single bibl tag, but Martin can collapse them into one programmatically later. I will keep my eyes out for any cases where this would NOT work.
As we noted last week:
----------
<bibl> elements need to be applied in many different locations, especially inside <def> and <form>, to make it absolutely clear what the source for each of them is. Right now, a <bibl> tends to appear in a <def>, and that actually means that it applies not only to the <def>'s parent <sense> element, but also to the preceding sibling <form> element. Since this is not reliably the case, though, we need to be explicit about it.
----------
I went through the s-rtr file and made the placement of bibl tags more explicit. For simple entries with one form and one sense it looks like:
<form>
<pron>
<seg type="phonemic">stíks</seg>
</pron>
<bibl>Y37.45</bibl>
<hyph><m sameAs="stiks">stíks</m></hyph>
</form>
<sense>
<def>
<seg>big male mountain <gloss>goat</gloss></seg>
<bibl>Y37.45</bibl>
</def>
</sense>
The most complex combination I've found so far is this kind - one form with two defs.
<form>
<pron>
<seg type="phonemic">ṣạpḷị́ḷ </seg>
<seg type="narrow">sàpᵊlél</seg>
</pron>
<bibl>G7.32; Y6.151, 305; Y16.189; Y21.11</bibl>
<bibl>W9.100</bibl>
</form>
<sense>
<def>
<seg><gloss>flour</gloss></seg>
<bibl>G7.32; Y6.151, 305; Y16.189; Y21.11</bibl>
</def>
<def>
<seg><gloss>bread</gloss></seg>
<bibl>W9.100</bibl>
</def>
</sense>
So the markup shows that the form was given by two speakers (well, the first bibl is actually at least two speakers), but each had a different definition for it.
If we didn't have the bibls in the form here, the formula Martin mentioned would actually still work, I think:
"A <bibl> appears in a <def>, and that actually means that it applies not only to the <def>'s parent <sense> element, but also to the preceding sibling <form> element."
That should actually cover all possible variations:
-different sources are all listed in the same bibl
-different definitions are handled as above
-different forms of the same word get different <form>s and <sense>s anyway.
So can I get away with not putting bibls in the form after all?
Here are some more decisions we made in Friday's meeting:
1) Yes, we still need seg tags within quote tags in example phrases, even if there is no breakdown of the gloss. For example ...
This one has a breakdown of the gloss:
<seg>whole wheat <gloss>flour</gloss></seg>
This one doesn't, but it should still have <seg>s:
<seg><gloss>flour mill</gloss></seg>
I have fixed all these in the s-rtr file.
2) There does not have to be a hyph line for every form in an entry - just the first form in each entry.
3) For zero morphemes, we will use the LATIN CAPITAL LETTER O WITH STROKE character. Ewa will create entries in the affix file for Ø1 and Ø2.
4) When Ewa has added a gloss which was not attested in the original data, we will format it as in this example:
<sense>
<def>
<seg><gloss>stretch</gloss></seg>
<note resp="ECH">[definition by ECH]</note>
</def>
</sense>
Things I have to remember:
- When @corresp is used to bracket multiple morpheme components for a single segment, the site needs to handle this. When you first click on the segment, it should expand into separate morpheme representations; then clicking on each of those should take you into the morpheme.
<bibl>
elements need to be applied in many different locations, especially inside<def>
and<form>
, to make it absolutely clear what the source for each of them is. Right now, a<bibl>
tends to appear in a<def>
, and that actually means that it applies not only to the<def>
's parent<sense>
element, but also to the preceding sibling<form>
element. Since this is not reliably the case, though, we need to be explicit about it.
Ewa wrote yesterday:
Basically there are three crucial things to keep in mind when phonemicizing:
1. The “alphabet” is phonemic, so wherever a transcription deviates from the alphabet it is phonetic.
2. Raised vowels are schwas which are clearly phonetic in nature.
3. Schwas in general tend to be phonetic except for a few cases which are systematic. Only the latter kinds of schwas should be left in phonemic representations.
Here are the changes that Martin added to the XML markup documentation:
1) In section 4.1:
-The contents of hyph should be based on the phonemic transcription of the word.
-If the word contains reduplication, mark it in hyph with CV, etc., rather than the actual segments of the reduplicant, e.g.:
<entry xml:id="ṣə̣nṣə̣nt">
<form>
<pron>
<seg type="phonemic">ṣə̣́nṣə̣nt</seg>
<seg type="narrow">sə́nsə̀nt</seg>
</pron>
<hyph>√<m sameAs="ṣə̣n">ṣə̣̣́n</m>+<m sameAs="CHAR">CVC</m>-<m sameAs="t-STAT">t</m>
</hyph>
</form>
The types of reduplication include:
Characteristic = CVC
Distributive = CəC
Repetitive = Ca
Diminutive = C1
Out of Control = C2
Ewa is making sure all these are in the affix file.
2) For entries in which multiple morphemes combine inseparably to form a single-phoneme item (e.g, c = nt + sa + s), use @corresp instead of @sameAs, with the morphemes separated by spaces - e.g.
<hyph> <m corresp="nt sa s">c</m></hyph>
3) In section 4.2: Another thing I noticed on the blog that it would be good to have in the markup documentation:
Where there is no attested definition, Ewa will supply one in this form:
<def>
<note resp="ECH">[The definition/explanation]</note>
</def>
4) In section 4.4: cross references do not need to include an English gloss, so the format for <xr>s should be
<xr>See <ref target="idblah">blah</ref> and <ref target="idblah2">blah2</ref>.</xr>
(NOT: <xr>See <ref target="idblah">blah</ref> (English blah) <ref target="idblah2">blah2</ref>(English blah2).</xr>)
Made a number of changes to the XML markup documentation PDF, some on SK's instructions and some to clarify my implementation of the glottalization-related changes we've made in the last couple of days.
I finished my XSLT conversion for fixing the encoding of glottalization, and ran it on the files awaiting work. They're now all sitting in a directory on the server called "ready_to_edit". There are only a few oddities/problems which make some of the files invalid:
- Some entries have no xml:id at all, for some reason. Rather than make one up, I'll leave it to the editor to assign one.
- Some entries have xml:ids that begin with the standalone grave accent (`, U+0060), which is not a valid character at the beginning of an xml:id. This is because that character is at the beginning of the entry itself. We need to look at these, and decide a) if it should be there, or it's some kind of processing error; and b) if it's correct, then how we should handle it when creating xml:ids (possibly by just deleting it?).
- A couple of entries have completely borked content because the original DOS stuff was never converted over, for some reason. Those entries are few and small enough to be dealt with on a case-by-case basis; the English glosses are there, so they can be tracked back to their original data.
Wrote the beginnings of an identity transform to fix all the glottalization issues discussed two posts below this one. I have the matching basically working, and shelling out to a function; now I just have to construct the function to do all the replacements.
Up to now, we have been working on the basis that surface forms can be broken down into discrete segments constituting morphemes. This is not always the case, though; today one form surfaced in which three discrete morphemes combine to form a single-phoneme item (c = nt + sa + s).
I would have liked to use a sequence of lookups in the @sameAs
attribute, separated by spaces, but that's not allowed in TEI; @sameAs
can only hold one value. The obvious alternative is @corresp
, so we would do this:
<hyph> <m corresp="nt sa s">c</m></hyph>
That's what we're going to do, temporarily; but in the long run, I think we need to make two changes:
- Switch all
@sameAs
attributes to@corresp
attributes. - Think about whether we need to use hashes before the
xml:id
s we're pointing to. I'm never sure about this: the items aren't necessarily in the same file, although they sometimes are; but in the context of the database they're easily discoverable just by@xml:id
.
This is a summary of the global changes we'll be making to the XML data, based on a re-reading of all the relevant posts from 2007:
- The Glottalized Ejective class has the following members:
p’, t’, c’, ƛ’, k’, q’ - The Glottalized Sonorant/Resonant class has these members:
mˀ, nˀ, lˀ, ḷˀ, rˀ, wˀ, yˀ, ʕˀ - The former are currently transcribed in the already-processed files using raised glottals (e.g. tˀ). The raised glottal is U+02c0. These need to be transformed into U+02bc: MODIFIER LETTER APOSTROPHE, "glottal stop, glottalization, ejective". This letter is valid as part of an xml:id attribute, so we could do a global conversion there, using Transformer rather than an XSLT identity transform.
- However, in the partially-transformed files, it appears that all of these items have been transcribed using actual apostrophes. This means we can't use Transformer, because there are valid English sequences containing e.g. t+apostrophe; only in the context of the TEI tags which contain Moses script should the conversions take place. Therefore we will have to use an XSLT identity transform to accomplish this conversion.
The plan, therefore, is this:
- For the completed and in process files, the only conversion I think we need to make is to convert Ejective + raised glottal to Ejective + U+02bc. This can be done universally, using Transformer.
- For the partially-transformed files, we need to write an XSLT identity transform which targets a specific list of only those TEI tags which contain Moses script. The transform will map Ejectives + apostrophe to Ej. + U+02bc (modifier letter apostrophe), and will map Sonorant/Resonant + apostrophe to SR + U+02c0 (raised glottal).
Met with EC and SK, to plan the revival of the project. SK will work for six weeks starting next Tuesday, on Kale, and we'll spend some of Tuesday setting up the machine. I've had SK added to the moses group on TAPoR. We'll do all editing on the server, and I'll back up the content methodically. They will start work on the c-rtr file, and meanwhile Greg and I will analyze old blog posts to devise a replacement system to make the last few changes to Unicode representations, as previously discussed. I also need to revise the project/markup description document a little.
Good meeting today, clearing up a lot of stuff we've been confused about. The issue with gloss tags is resolved. We discovered and fixed a problem in the phar-w.xml
file where <bibl>
tags were children of <entry>
tags, when they should have been children of <def>
tags.
Before I leave, I need to go through the preceding blog posts and make a detailed plan for the search-and-replace operations we need to do on the data, to get to the transcription system that's actually correct.
Where there is no attested definition, Ewa will supply one in this form:
<def> <note resp="ECH">[The definition/explanation]</note> </def>
I'm still looking in detail at your postings below, and I'm not sure I've grasped the issue fully yet, but I think it would help if I explain how I envisage the English-NX wordlist system working (in fact, the only way I can envisage it working at the moment).
The intention, as I understand it, is to produce a wordlist, not a dictionary. In other words, the output will be a list of English words and phrases in alphabetical order, each with an equivalent NX word or phrase. The way this would be achieved is this:
- Find each
<gloss>
tag which is intended to be a wordlist entry. (This means that we have to disambiguate<gloss>
tags which are intended to be for the wordlist from those which aren't; that can only be done on the basis of their context, or failing that, because they have a particular attribute added to them which distinguishes them. - For each such gloss tag, find the nearest appropriate NX word or phrase in the tree which is equivalent to it. (I had understood this to mean going up the tree to the
<entry>
level, then taking the first<seg>
in the first<pron>
in the first<form>
element in the entry.
This obviously requires that any <gloss>
tag we're going to use for this purpose contain an English word or phrase that IS equivalent to the <seg>
element as described above. If it's not going to be equivalent, then the question arises "what is it a gloss for?"
Do you envisage the process in the same way I do? If not, how had you imagined it?
I've downloaded your completed affix.xml
file, and pushed it up to the database. We now have lots more items in the database, including some (at the beginning of the list) whose entry headword is their CV pattern. I'm assuming that's what's intended, since these are items whose form is so varied that they can't really be represented by anything else. Is that right?
In the following example (1) from a sense tag, the meaning of the entry is equivalent to that part of the meaning that would be targeted by a search when making the English-Nx wordlist.
(1) <sense><def><seg>cougar</seg></def></sense>
But in (2) the meaning of the entry and the part that would be targetted by the English-Nx wordlist are not identical. Hence, we have added in the gloss mark-up.
(2) <sense><def><seg><gloss>worn down</gloss> to the end</seg></def></sense>
In case (1) should I add in a gloss mark-up, even though it is redundant? Or is gloss only necessary in those cases where the English-Nx target is not identical to the meaning of an entry?
When I went back to work on the phar file, a number of questions came up. I raise them below:
In late April (26/04/07) we had an exchange about gloss tags. I asked whether we want to be able to access glosses in dictegs to construct the English-Nx wordlist. Martin replied that he had not envisioned that we would do so, and that therefore only the system of segs and glosses that occurs within the sense part of the entry is the one that will be searched for the English-Nx wordlist. Martin also asked if there is any good reason to take material from the illustrations for the English-Nx glossary. I didn’t have an answer to this last question at the time, but I think I have one now.
The following is a set of four connected entries based on the root, ʕʷə́cˀ, which can serve to exemplify my answer. Kinkade did not provide a gloss for this root, so in the entry for the root itself, there is no definition available. Similarly for the entry that consists of this root plus the STAT suffix -t, there is no definition available, but there is an illustration. The other two entries connected to this root do have definitions.
<entry xml:id="ʕWəcˀ">
<form>
<pron><seg>ʕʷə́cˀ</seg></pron>
<hyph><m>ʕʷə́cˀ</m></hyph>
</form>
<sense>
<def><!-- No definition available. --></def>
</sense>
<xr>See <ref target="yəcˀp">yə́cˀp</ref><note>worn down to the end</note></xr>
</entry>
<entry xml:id="ʕWəcˀt">
<form>
<pron><seg>ʕʷə́cˀt</seg></pron>
<hyph>√<m sameAs="ʕWəcˀ">ʕʷə́cˀ</m>‐<m sameAs="STAT">t</m></hyph>
</form>
<sense>
<def><seg><!-- No definition available. --></seg></def>
<dicteg>
<cit><quote>ʕʷə́c't ʔeɬxʷənčút<gloss>out of breath</gloss></quote><bibl>Y40.177</bibl></cit>
</dicteg>
</sense>
</entry>
<entry xml:id="ʕWə́cˀp">
<form>
<pron><seg>ʕʷə́cˀp</seg></pron>
<hyph>√<m sameAs="ʕWə́cˀ">ʕʷə́cˀ</m>‐<m sameAs="ʔ">p</m></hyph>
</form>
<sense>
<def><seg>worn down to the end</seg></def>
</sense>
<bibl>Y34.34</bibl>
</entry>
<entry xml:id="ʕWəcˀpaskˀáyˀt">
<form>
<pron><seg>ʕʷəcˀpaskˀáyˀt</seg></pron>
<hyph>√<m sameAs="ʕWə́cˀ">ʕʷə́cˀ</m>‐<m sameAs="inch">p</m>=<m sameAs="askˀáyˀt">askˀáyˀt</m></hyph>
</form>
<sense>
<def><seg>ran out of breath</seg></def>
</sense>
<bibl>JM3.73.7</bibl>
</entry>
If the English-Nx wordlist only looks at material within the sense tags to determine membership in the English-Nx wordlist, then the stative form, which has a dicteg, but no filled in gloss within sense tags will be missed.
Question 1: Is this a problem?
Questions 2-4: For the English-Nx wordlist, how do we handle roots that have no gloss? Should I provide a gloss, based on interpreting the available forms. And if I do that, should the ECH-interpreted gloss be marked as such, as opposed to it being an attested gloss?
Question 5: If we look at the glosses for the last two of the four entries above, we see that one is ‘worn down to the end’ and the other is ‘ran out of breath’. What is the best way to mark-up these glosses?
I envision two possibilities (at least):
(a) I could use the seg/gloss mark-up to do the following:
<def><seg><gloss>worn down</gloss> to the end</seg></def>
<def><seg>ran <gloss>out of</gloss> breath</seg></def>
This mark-up foregrounds what seem to be those parts of the senses that the two definitions have in common.
(b) I could provide a general and identical interpreted definition for the two forms: e.g. Here I highlight the general nature of the definition by putting it in caps.
<def><seg><gloss>WEAR OUT, RUN OUT</gloss> worn down to the end</seg></def>
<def><seg> <gloss> WEAR OUT, RUN OUT </gloss>ran out of breath</seg></def>
I would be grateful for your thoughts on these questions Martin.
Xml mark-up documentation:
1. The xml mark-up documentation does not include information about how we are using segs and glosses within defs to distinguish those parts of the definition that the English-Nx wordlist needs to search, as opposed to those which parts of the definition which it does not need to search.
2. I don’t think the latest version of the xml mark-up documentation is sufficiently clear on what we are doing with cross-references. For example, it is not clear that in the cross-references, because we are referring to the xml:ids of the xr forms, we do not actually need to supply the English meanings of those forms.
Your idea to use Transformer sounds great.
I won't be working on Thursday but will work again on Friday afternoon. (It's Ascension, so a holiday here. Plus Ales has a bull-fighting festival going on until Sunday which is a very big occasion here).
Sorry I've been a bit slow getting back to you on this -- I've been swamped by other projects.
Now I come to think about it, the best option for converting those files that have already been completed will be to use Transformer, which is designed for exactly this sort of job. I can create a replace sequence that I think will get us to where we want to be, then I can run it on the files that have already been completed. Then you can check the results -- we could actually put them into the database for that purpose, so you can read them more easily on the Web. Once we're happy the replacements are doing the job, I can run them on all the remaining files that you haven't yet worked on. In the meantime, you could be using the right markup for the one you're working on next.
Does that make sense?
I have completed the feature structures and merging of entries in the affix.xml file. The file still requires proofreading against MDK's filecards, phonemicizing, and changing all ejectives to stop+raised comma, and ensuring that all glottalized resonants are resonant+superscript glottal.
What is the most efficient way for me to undertake the ejective/glottalized resonant changes?
Next task: the Pharyngeal file, to have one more complete file with lexical rather than suffixal content. And then I will turn to the lexical suffix file because the LSs require feature structures as well.
Since I'm going to be writing Java classes and ultimately applications in the future, and I need to write some Java now for the Moses project, I need to choose an appropriate IDE. I've been working with Eclipse so far, and it looks good, but another alternative is NetBeans 5.5, which is also free, and that's getting good reviews. It also has a GUI-building tool, which may be very handy.
I downloaded and installed it, and read some introductory materials, then I began trying to duplicate my MosesIndexer project from Eclipse in NetBeans. After some faffing around I got it working. The hardest thing to figure out was Unicode support; not only did it default (like Eclipse) to Windows 1252 for the source editor encoding (ridiculous choice), but even when I figured out how to change the default encoding for source files, and change the default encoding for the project and for each individual source file, I still couldn't get a test class with some Unicode text in it to compile. The problem turned out to be the compiler. I had to go into the project properties, click on Build / Compiling, and add "-encoding utf8" to the Additional Compiler Options.
After that, everything worked. The basics seem no different from Eclipse; Eclipse seems to have slightly better content completion when it comes to adding imports automatically, while NB seems to generate more detailed code templates when you create a Java Application project or add a class. If the GUI builder component is useful at all, I think NetBeans will be the way to go.
Response to Martin and Greg regarding glottalization:
Well I am happy with everything here: in the data, ejectives will be transcribed with a raised comma; glottalized resonants will be transcribed with a superscript glottal. In the output, all the segments will be transcribed with raised comma.
The ejective and sonorant/resonant categories are not quite right however. They should be:
Ejectives: p’, t’, c’, ƛ’, k’, kʷ’, q’, qʷ’
Sonorants: mˀ, nˀ, lˀ, ḷˀ, rˀ, wˀ, yˀ and ʕˀ, ʕʷˀ (the voiced phargyngeal fricative, what you are calling epiglottal) which appears as both plain glottalized and rounded glottalized.
Belted l is never glottalized.
The question of which raised comma character to use: I agree that modifier letter apostrophe is the best option. (It’s too bad about the handwritten alphabet having the w and y with raised commas above--when we transcribe by hand we are not always as precise as we should be; raised comma above is often not distinguished from raised comma just to the right when one writes by hand, but clearly this is an important difference when using computer fonts)
Normalizing the data: I definitely think we should do this with simple search and replace operations. These are easy to do and can be done as I go through each file, although clearly I will need to watch out for apostrophes in the English text, as these do appear.
I’m glad we are in agreement about all this. I look forward to hearing about the techniques for searching with XPath in oXygen: so far I have been using the Find and Replace function when I have needed to change things (for instance I did this in dealing with the cross-reference changes that I had to make), and it has been working well.
Greg and I have just spent a little while with the IPA and Unicode specs, and I think we have a good strategy for this.
First, you're absolutely right that the IPA specifies a raised comma to indicate an ejective. Therefore these consonants should be written with a comma: c, k, p, q, t, epiglottal.
Sonorants/resonants should be written with the glottal diacritic in the data, to conform with IPA. We're not sure of the exact list of consonants that fall into this category, though -- we guess: belted l, barred lambda, m, n, r, w and y.
If we've got any consonants in the wrong sets above, let us know. Now, even though we're using the raised glottal diacritic in the data, that doesn't mean we have to show it in the output; it's trivial to replace the raised glottal with a raised comma in the output. This way, we end up with a traditional "orthography" without sacrificing IPA conformance in the data.
The next question is which raised comma character to use. Unicode has these two candidates:
- \u02bc: MODIFIER LETTER APOSTROPHE, "glottal stop, glottalization, ejective"
- \u0313: COMBINING COMMA ABOVE, "Americanist ejective or glottalization"
The first is clearly the best option, and it's the one specified by the IPA. The description of the second shows that it is frequently used in Americanist transcriptions. We want a modifier letter, not a diacritic above, so we should choose the first. Incidentally, on the question of why we used that character in the data, Greg says he based the choice on your handwritten alphabet, which clearly shows the w and y with commas above them, not following them.
I'll summarize by showing the character combos I believe we should have in the data, and those we will generate in the output routines (hoping that they'll show up correctly in the browser font):
In the data:
cʼ, kʼ, pʼ, qʼ, tʼ, ʕʼ
ɬˀ, ƛˀ, mˀ, nˀ, rˀ, wˀ, yˀ
In the output for readers:
cʼ, kʼ, pʼ, qʼ, tʼ, ʕʼ
ɬʼ, ƛʼ, mʼ, nʼ, rʼ, wʼ, yʼ
If you agree, the next question is how best to go back and normalize all the data. Some of this can be done programmatically or with simple search-and-replace -- for example, we can just replace all instances of comma-above with modifier-glottal. Similarly, we can replace sequences of cons+modifier-glottal with cons+modifier-raised-comma. Slightly more problematic will be any instances of REAL apostrophes in the data, since these will always be wrong in Moses text, but may be fine in English text. I can show you some techniques for searching with XPath in oXygen to find all those instances.
Let me know what you think.
Worked on the last few entries of the affix.xml file. In the DIST entry there are over 100 crossreferences that need to be corrected manually. I have done about half of them.
In the data files that I've looked at there seem to be a few cases where glottalized w and glottalized y have a raised comma written directly above them. This is not a standard representation in either IPA or in the Americanist forms used traditionally for writing Moses. So I would definitely change any w or y that has a raised comma above it into either w/y followed by raised comma or w/y followed by superscript glottal (depending on which we finally decide to do). I don't know why there are these w/y with raised comma above them since in the lexware database that is not how those sounds were represented.
Here are the considerations from my perspective. I am laying them out for you below because I am not sure from reading your various comments whether I have succeeded in making my thinking on this clear to you. Once you’ve read what I write below, if you still think that we should use superscript glottals throughout, then that is fine with me.
1. In the IPA, a distinction is made in the way that glottalization is marked on ejective stops and the way it is marked on glottalized sonorant/resonant consonants.
a. In the case of ejective stops, the raised comma transcribed after the stop is preferred [p’, t’, k’, etc.] . Although it is possible to transcribe ejective stops with a superscript glottal this is not the preferred way to do so.
b. In the case of glottalized sonorant/resonant consonants, a superscript glottal diacritic is used and can be positioned either before or after the sonorant. So glottalized m or y, for instance, would be transcribed [y?, m?] (the question marks are supposed to be raised glottals).
2. In the Americanist tradition used in the way Moses has always been written up until now, all ejective stops and all glottalized sonorants are marked in the same way: namely, they always have a raised comma after them: [p’, t’, k’, etc., m’, l’ , y’ , etc.].
3. In the xml:ids, we can’t use raised commas so we have been using superscript glottals throughout (I think, Martin, that you wrote a little conversion for this at some point).
4. Neither the IPA nor the Americanist tradition according to which Moses has been transcribed so far use a superscript glottal for ejective stops and affricates. Therefore, I argued that we should use the raised comma representation for these sounds, and not to use the superscript glottal for ejective stops and affricates.
5. Because the Moses/Americanist tradition has up until now used the raised comma for glottalized sonorants/resonants, I also argued that we should use the raised comma for the glottalized sonorants of Moses.
6. If, however, you guys think that the raised comma representation is problematic for the search functions, then I will accept your recommendation to use the superscript glottal for the ejective stops, ejective affricates, and for the glottalized sonorants.
7. My concern for consistency is the same as yours. I believe that at the moment we have several different kinds of representations. For example, I think that there are glottalized sonorants which have the raised comma written directly above them, and others which have the raised comma written just after them. And there are clearly ejectives written both with raised comma and with superscript glottal.
8. Whatever you think is the best representation, I will be happy to check for consistency as I go through the files.
Created a package called MosesIndexer, and wrote and tested a class called MosesConverter, which implements the NFKD and ASCII conversion algorithms described in the preceding post. Wrote a JUnit test for it, and debugged it.
Next is to implement the Indexer class, which will read a series of files from disk, and process each one to build an index XML file, then save it.
This post represents the results of research Greg and I have been doing on the issue of searching complex Unicode data.
The search interface for this project presents a range of interesting challenges. Searching the English components of entries is no problem at all, but when it comes to searching the Moses text fields, we have to deal with the issue of diacritics. For instance, take the word "ṣạpḷị́ḷ". If your browser is using an appropriate font, you'll see this sequence of characters:
- s with dot below
- a with dot below
- p
- l with dot below
- i with acute accent and dot below
- l with dot below
We can envisage two extremes of user. On the one hand, a researcher familiar with the language might be searching for exactly this string of characters, with all the accents in place. In this case, we have only one type of problem. Consider the "i with acute accent and dot below". This could legitimately be created through any of the following sequences:
- i + combining acute accent + combining dot below
- i + combining dot below + combining acute accent
- (composite) i with acute accent + combining dot below
- (composite) i with dot below + combining acute accent
These all look the same, and are equivalent; there's no way to know which one the user typed in, and there's also no way to be sure which form might be in the database. This makes matching a search string against fields in the database rather difficult.
Unicode provides a solution for this, which will work pretty well for Moses. In Unicode, each combining character is assigned a combining class -- a number between 0 and 255. The classes are based on the position of the diacritic with respect to the character it's joined to. (See here for more info.) Now, Unicode provides a set of normalization forms for any string. Among them is NFKD, or "Compatibility Decomposition". What this process does is:
- Decompose all composite characters into character + diacritic.
- Apply canonical ordering to diacritics, based on their combining class.
The result of applying NFKD normalization to any of the sequences above would result in the same output:
- i + combining dot below + combining acute accent
because they would all be decomposed into three components, and then the i would come first (as the base character), then the dot below (combining class 220), and finally the acute accent (because its combining class is 230, which is higher than that of the dot below).
Therefore we have a solution to the problem of our advanced searcher: if we perform NFKD normalization on BOTH the strings in the db against which we're searching, AND the search string entered by the user, then we'll be able to do a useful search.
The second type of user, a casual surfer, or someone who is not linguistically sophisticated or familiar with the language, presents a different type of problem. They most likely have no idea what diacritics should go where, and even if we provide onscreen buttons or keystroke methods for entering diacritics or modifier letters, they won't be able or willing to use them. Their search for "ṣạpḷị́ḷ" is likely to be entered as "saplil". Nevertheless, they'll still want to get results.
Another application of Unicode normalization form NFKD, followed by some extra processing, can solve this problem. First of all, it will split off the combining diacritics. We can then remove them from the string, turning "ṣạpḷị́ḷ" into "saplil". If we do this for the search string entered by the user, and for the db strings against which we're searching, then we can obtain meaningful matches whatever the user enters.
In addition to splitting out the combining diacritics, compatibility decomposition will also convert some characters into their "compatibility" equivalents. For example, modifier letter w (raised w) will be converted to a regular w. This solves yet another problem: people will tend to type a w rather than finding the keystroke or button for raised w. Some characters, however, do not have compatibility equivalents where we would want them. Modifier raised glottal, for instance, doesn't have a compatibility equivalent, even though there is a regular glottal character. When we process the string to strip out the diacritics, though, we could do that conversion too.
These are the conversions we would need to make in order to create an "ascii representation" of a Moses string:
- Split out combining diacritics (NFKD does this).
- Convert characters to their compatibility equivalents (NFKD does this for some characters).
- Discard combining diacritics.
- Convert raised w to w.
- Convert raised glottal to glottal.
- Convert belted l to l.
- Convert barred lambda to l.
Now we have a decision to make. There are two non-ascii (potentially "confusing-to-the-layman") characters still in the mix: glottal and epiglottal. We could either leave them there, or we could replace them with something more bland. If we replace them, we need appropriate replacements. One option would be to replace both by the apostrophe; another would be to use a convention such as X-SAMPA, which replaces the glottal by a question mark, and the epiglottal by "?\". A decision on this should be guided by our sense of what semi-sophisticated users (such as band members familiar with the basic orthography) might be expected to use.
So we have a situation where we need to map two representations of a search string against two representations of the data. The data itself does not contain any normalized representations at all, and we would prefer to avoid cluttering it up with them; furthermore, because they can be generated programmatically, it makes no sense to have them in the data anyway. However, generating two representations of every Moses string in the db every time a search is done makes no sense either; it would make searching ridiculously slow.
The solution to this is to create an index to the entries, consisting of a series of parallel entries. Each one would have the same xml:id
attribute as the entry it's based on, as well as a copy of the representative string for that entry (which is currently the first <seg>
in the first <pron>
in the first <form>
element). It would then have two blocks of tags, one for the NFKD forms of all Moses strings in that entry, and one for the ascii-ized forms. This index can be loaded into a special subcollection on the database. Searching can be done against the index, to retrieve the xml:id
and the representative string directly from the index, instead of from the db itself. A list of such hits can be returned to the user, who can then click on any entry to retrieve the real data for that entry from the database.
This index would have to be generated manually from the existing data. The best way to do this would probably be to write a Java application which can be passed a list of markup files. It loads each XML file into a DOM, then generates an index entry for each entry in the source file; it then uses a node iterator to run through each of the tags containing Moses text, and generates one of each type of compatibility form for them. This index file can go into the database, and eXist range indexes can be defined for it, making searching even more effective.
That solves the problem of creating searchable indexes; now we have to deal with the issue of the search string, which cannot be processed ahead of time, because we don't know what the user will type in. We need to render this string into two forms as well, to search against the two types of item in the index. There are two possible ways to handle this:
- We could try to produce the two forms using XSLT, by predicting all possible variants of characters and diacritics, and searching/replacing on them. This sounds like it would be inefficient and complicated, but since search strings will typically be very short, it might be a perfectly good approach.
- We could write a new Transformer for Cocoon using Java, which does the same job, and which can be called from a Cocoon pipeline. This would most likely be a little faster, and more reliable since we could depend on Java to do the NFKD normalization for us. However, it would involve learning how to write and deploy a customized transformer, which would take some time. On the other hand, knowing how to do this would be very handy in the future.
Whatever method we use to massage the search string, we'll have to integrate it into a pipeline, which will pass the results as two parameters to an XQuery page. The XQuery code will then perform two searches, with the normalized form against the equivalent forms in the index, and the ascii-ized form against its equivalent. This will result in two sets of hits; the first set should be prioritized because they're more likely to be more precise matches. Duplicates would need to be discarded, and then the resulting set of hits (xml:id
and representational form for the entry) would be returned to the pipeline to be styled as hits and be inserted into the search page. This would be done through AJAX.
This will obviously need more thought to work out the fine details, but I think we have the basis of an excellent strategy here, enabling us to provide fast searching which combines precision with fuzziness, and which should suit any type of user.
Got the sort system working by rewriting the XSLT file till it worked with Saxon. Saxon 8 is very finnicky about stuff like namespaces. The other XSLT will also have to be ported over to work with Saxon 8, so we have a fully XSLT-2.0-based system.
Then I added the extra code to disregard accents in the sort comparator. This simply involves stripping out accents from the strings before doing the comparison. It's an extra stage, so it may add extra processing time, but on the other hand the comparison character string is now shorter (no accents) and the input strings are often going to be shorter once their accents are stripped out, so the net results may not be noticeable. We'll have to see how fast this code runs when we've got more data in the db.
In the process of doing this, we found a bug in the implementation (groups of identical entries were not being sorted together). Fixed that bug.
Some answers to your questions:
1. Question: There are several cases of two or more different affixes having almost identical meanings. This means that they have identical feature structures. Is this going to be a problem? For example, xix and xax are both baseType, suffix, morphoSyntactic, indefinite-object.
I don't see why this would be a problem. We identify them by their xml:id
, and display/sort them by their first <pron>
, so I don't see any conflict.
2. Question about prons and hyphs of reduplicative morphemes: How should prons and hyphs of reduplications be represented? Reduplicative morphemes have changeable form, depending on what the shape of the base of the reduplication is. For example, if the root is of the shape xit, the reduplicative suffix “characteristic” will be xit (xit-xit), but if the root is of the shape quc, the reduplicative suffix “characteristic” will be quc (quc-quc). The basic shape of the reduplication is thus CVC (consonant-vowel-consonant), but what the exact segmental content of the suffix is depends on the segments found in the root. The simplest thing for a pron would be to specify the CV-shape of each reduplication. For example, the pron for the reduplicative suffix whose meaning is “characteristic” would be CVC, for the distributive it would be CEC (where E=schwa), for repetitive it would be Ca, and for out of control it would be VC, and for diminutive it would be C-. For the hyph forms, it would be the same type of thing. For example, for characteristic the hyph would thus include sameAs=”CHAR”>CVC Is it possible/desirable to do this in an xml markup?
As long as the xml:id
attributes are distinct, I don't think it matters. If each has a unique CV-shape, then that would be a good way to characterize them, given that they have no default or normalized representation at all.
3. I have completed to the end of hard copy affix10 of the affix files, except for fixing cross-references in the last entry, which is the DIM form. There is one more of these files left in the affix set.
I'm not sure what this means. On the server, I can only see one affix.xml
file. Is "affix10" above a typo, or is there such a file somewhere?
1. combining glottal and combining comma. By combining comma do you mean the combining apostrophe? If so, then combining glottal and combining comma/apostrophe are two different symbols for the same sound: they both represent glottalization. We decided at some point in December I think that we will actually replace all the combining glottals with combining commas/apostrophes in order to be completely consistent throughout. The one constraint on this is that we have to continue to use combining glottals in the xml:ids.
As we've said before, Greg and I both think replacing the glottals with commas is a bad idea, because it amounts to misrepresenting the data. It's also rather pointless, because for any particular context in which we're displaying this data, we can do a translation from glottal to comma on the fly; there's no need to store misleading data just so we can see it on the page. The combining comma I'm talking about, though, is one which appears above the w and y characters in the handwritten alphabetical order I've been working from. That character is "U+0313 : COMBINING COMMA ABOVE", whereas the combining glottal is "U+02C0 : MODIFIER LETTER GLOTTAL STOP". The former shows up above the modified letter, the latter shows up to the right of it. If I understand you correctly, these are intended to represent the same sound -- a glottal -- but you want the modifier to appear above the letter when the letter is w or y, and on the right of it when it's any other letter. This is problematic because if you convert them all to combining comma above, they'll appear above the letter everywhere; so using commas won't even solve the display problem it's intended to solve.
My recommendation is to keep your data correct and pure, and use the right character throughout (the glottal). Then for display purposes we write display code that does a translation in some circumstances (e.g. it substitutes a comma above for w and y, and adds an apostrophe after for other letters, if that's what you want). If the data itself is corrupted by display preferences, then it's going to be less useful for research and display in the future, in other contexts. That's my opinion, anyway.
2. Acute and grave accents are irrelevant for alphabetical order. In other words there is no difference in alphabetical order between [a] with no accent, [a] with acute accent, and [a] with grave accent; and this is similar for all the other vowels. Does this mean that the java sorter can ignore the accents?
I'll have to go away and think about that. We've actually got the java class sorting successfully, but right now, it needs a position for every character in the alphabet (which includes accents); I'll have to add some code to strip out the accents before comparing words. I think it should be fairly straightforward.
3. What I can’t determine from your presentation of the material is whether there is significance to the order that you have given for the diacritics. Why have you placed [dot below] before [combining glottal], etc. ? Can you explain this to me?
This is based on your own handwritten list, in which c-with-dot-below appears before c-with-glottal (and the same for all other combinations of these diacritics with other characters). If c-with-dot-below comes before c-with-glottal, then dot-below comes before glottal.
1. Question: There are several cases of two or more different affixes having almost identical meanings. This means that they have identical feature structures. Is this going to be a problem? For example, xix and xax are both baseType, suffix, morphoSyntactic, indefinite-object.
2. Question about prons and hyphs of reduplicative morphemes: How should prons and hyphs of reduplications be represented?
Reduplicative morphemes have changeable form, depending on what the shape of the base of the reduplication is. For example, if the root is of the shape xit, the reduplicative suffix “characteristic” will be xit (xit-xit), but if the root is of the shape quc, the reduplicative suffix “characteristic” will be quc (quc-quc). The basic shape of the reduplication is thus CVC (consonant-vowel-consonant), but what the exact segmental content of the suffix is depends on the segments found in the root. The simplest thing for a pron would be to specify the CV-shape of each reduplication. For example, the pron for the reduplicative suffix whose meaning is“characteristic” would be CVC, for the distributive it would be CEC (where E=schwa), for repetitive it would be Ca, and for out of control it would be VC, and for diminutive it would be C-.
For the hyph forms, it would be the same type of thing. For example, for characteristic the hyph would thus include
sameAs=”CHAR”>CVC
Is it possible/desirable to do this in an xml markup?
3. I have completed to the end of hard copy affix10 of the affix files, except for fixing cross-references in the last entry, which is the DIM form. There is one more of these files left in the affix set.
May 9, 2007
1. combining glottal and combining comma. By combining comma do you mean the combining apostrophe? If so, then combining glottal and combining comma/apostrophe are two different symbols for the same sound: they both represent glottalization. We decided at some point in December I think that we will actually replace all the combining glottals with combining commas/apostrophes in order to be completely consistent throughout. The one constraint on this is that we have to continue to use combining glottals in the xml:ids.
2. Acute and grave accents are irrelevant for alphabetical order. In other words there is no difference in alphabetical order between [a] with no accent, [a] with acute accent, and [a] with grave accent; and this is similar for all the other vowels. Does this mean that the java sorter can ignore the accents?
3. What I can’t determine from your presentation of the material is whether there is significance to the order that you have given for the diacritics. Why have you placed [dot below] before [combining glottal], etc. ? Can you explain this to me?
Today I learned some Java, which is pretty much new to me. I had to implement and test a Java class which implements the java.util.Comparator
interface, and which can then be used as a sort of plug-in to Saxon, invoked from XSLT, to do custom sorting. I downloaded and installed Eclipse, and set myself up with a new package, including a source file for the class and one for a JUnit test class for trying it out.
With lots of help from Stew, I eventually got the class working, based on a list of all the characters in a simple string. The first useful discovery was the Java Normalizer
class; this can be used to solve the problem of sorting strings which may contain pre-composed characters or strings of char+combining characters, which are equivalent. The Normalizer can be used to do a canonical decomposition of the strings before comparing them. Very handy -- and it might also be handy for normalizing actual data permanently at some point.
Testing of the results of sorting revealed that my initial assumption -- that putting the diacritics etc. at the beginning was wrong; to get the desired behaviour, they actually need to be at the end. That was easily fixed.
Once the class was working, we started trying to test it. The main requirement is that it be invoked using a URI, in a manner which is implementation-dependent. Our intention is to use it with Saxon 8, and the instructions for this are here. The code looks like this:
<xsl:sort select="tei:form[1]/tei:pron[1]/tei:seg[1]" collation="http://saxon.sf.net/collation?class=MosesSortComparator" />
Next, you have to put the class somewhere on the Java classpath, so it can be found by Saxon. We presume this means that it should go in with the other Java libraries in Cocoon, so I generated a JAR file (File / Export
in Eclipse), and added it to the other JAR files on the server, in /usr/local/apache-tomcat-6.0.2/webapps/cocoon/WEB-INF/lib
.
Initial testing failed, and I was puzzled, so I went back to the sitemap and discovered that although the file was XSLT 2.0, it was being run through the default XSLT processor, which is Xalan. When I changed the sitemap to call the Saxon processor, I got no results at all (an empty page). This was the case both with and without the new comparator being used, so the problem isn't the comparator; the stylesheet is not written correctly for Saxon, so we'll need to rewrite it before we can see if the sort actually works. That's for tomorrow.
I have a list of characters showing Moses sort order:
ʔ a ạ c c̣ cˀ ə ə̣ h ḥ ḥʷ i ị k kˀ kʷ kˀʷ l ḷ lˀ ḷˀ ɬ ƛˀ m mˀ n nˀ p pˀ q qˀ qʷ qʷˀ r rˀ s ṣ t tˀ u ụ w w̓ x xʷ x̣ x̣ʷ y ỵ̓ ʕ ʕˀ ʕʷ ʕˀʷ
However, when I look at the entries themselves, I see lots of instances of acute and grave accents. I need to know how those accents fit into the sort order.
Some background to this: in order to sort the entries according to the Moses sort order, I'm having to write a Java class that can be called on the server, which does the sorting. This class has to encapsulate the sort order in sequence. The actual sequence suggested by the list above is this:
[dot below], [combining glottal], [combining w], [combining comma above], ʔ a c ə h i k l ɬ ƛ m n p q r s t u w x y ʕ
In other words, the combining diacritics come first, followed by all the letters. Acute and grave presumably have to fit into the sequence of combining diacritics at the beginning. Can you tell me where they should show up?
Delighted to hear that the affix file is not too large to be worked on.
Worked on the affix.xml file today. I am now at about line 24,000 out of 29,000 lines.
Completed several large entries, including applicatives and reciprocals. Checked Willett (2003) for some definitions and rewrote some definitions and added some notes explaining interpretations. Some merging of forms from the applicative entry into other entries (causative and L-applicative) needed to be done.
Next task: to merge -u and -tu, and to work on -ul.
Hi Ewa,
It turned out the file problem was simply my fault; I was uploading a file which wasn't valid because of some old code being left in it. I was thrown by the bad error message from eXist -- it seems that whenever it has any kind of problem indexing a file, it reports a network error. Once I'd fixed the file, it went into the DB OK, so the latest complete affix entries should now be available on the Entries page.
May 7, 2007
Martin, as you point out the affix file is big. The reason for the size is the following: at the moment in going through the affix file, I am just focusing on two things 1) filling in the feature structures, and 2) fixing up big editing problems like the cross-reference problem and the merging of entries.
When you wrote the conversions for the files, you left in unconverted forms with a comment that says: “The following list of elements is copied from the source doc for information and checking purposes.” Because I am not doing most of the detailed editing on the entries at this stage (since that would take a very long time, and we though that just focusing on feature structures would be sufficient for now), I am not deleting the unconverted forms as I go through the affix file. Taking out the unconverted forms would make the affix file much smaller. So for your purposes maybe we should make a copy of the affix file which has all the unconverted forms deleted?
I will be finished the current pass through the affix file by the end of this week, I hope. At that point we could make a shorter version of the affix file for you to work with.
After I finish the affix file, I thought I would return to working on the phargyngeal file since it is fairly small, and has already been worked on a bit. If I could finish that fairly soon, you would have a second non-affix file to work with. The editing work is more time-consuming than we had expected it would be because it is turning out that I need to do quite a lot of manual input of phonemic representations and file-merging, cross-references, etc. We had not anticipated this before I began working on the affix file.
ps Thanks for the AutoP info.
3. Editing Decisions concerning Conflation of Entries in Affix.xml file: As I have been going through the Affix.xml file, I have been making editing decisions about entries. Specifically, throughout this file there are entries which Kinkade listed as separate suffixes on his filecards, but which in fact are either 1) allomorphs of one affix, rather than affixes in their own right, or 2) combinations of two affixes which Kinkade analyzed as a separate affix.
In the first case, an example is the ‘inchoative’ affix has two allomorphs -?- and -p. These were listed as separate affixes in Kinkade’s affix filecards, and were therefore typed into the computer database as separate affixes. However, Kinkade himself did consider these to be allomorphs of one affix. Similarly with ?ac-/c-. With cases of this kind, I have listed both forms of the affix under one xml:id entry, but have indicated that there are allomorphs of the affix. And I have also placed all the illustrations of all the allomorphs together into one entry. Thus, for instance, all -?- and -p cases are in one xml:id entry.
In the second case, an example is words containing the causative morpheme -stu. In addition to the fact that this morpheme has different stressed and unstressed allomorphs, it is also found in combination with other suffixes. For example, -stu-m occurs in the data, as does -stu-n(n). Kinkade listed both -stu-m and -stunn as separate suffixes in the filecards, but his later analyses indicate that he perceived these sequences as combinations of affixes. With cases of this kind, I have moved the illustrations that Kinkade had for -stum and -stunn into the -stu xml:id entry. I have noted that I did this in the database itself, just to keep track, but these notes can be erased eventually.
I don't think this requires any action from me, although we'll need to look closely at what's happening to those entries in the Web page. To that end, I just downloaded the latest affix.xml
file in order to push it into the database, but it has some problems. One is this:
<dicteg> <cit>əxʷ [TEXT IS NOT ALLOWED HERE OUTSIDE THE quote TAG] <quote>(s)‐n̩‐√mèy=ap‐m̩én‐ct<gloss>confess</gloss> </quote> <bibl><!--[No source]--></bibl> </cit> </dicteg>
It looks like this pretty much shows the point you've reached in your editing, so I created a temporary file including only the entries up to that point, and tried uploading it to the database. I hit a file size limit problem when doing that -- an issue we've occasionally encountered before, and which Greg and I are working on -- so at the moment I can't put the file in the db. It does seem to me that this file is getting quite big, though, so I'm wondering if it would make things more practical if we split this file up into three or four sections?
One last thing I've been meaning to mention about posting to the blog. If you look to the bottom right of your screen in the blog entry editing page, you'll see a section headed "Text Renderers". In there is a checkbox labelled "Auto P". If you check that checkbox, you'll find that your carriage returns turn into actual paragraph breaks, so your text is not all run together. When I see your entries which are run together without breaks, I've been clicking on Edit, checking that box, and saving again, to introduce the page breaks. If you're wondering why that's off by default, it's because most of us using the blog tend to use HTML tags as we type, and do more complicated formatting, and the Auto P would interfere with that.
1. Cross-references: I went back and checked cross-reference examples. There is no difference between a link introduced by xr ‘See’ and one introduced by ‘cf’. So the second cross-reference type should be re-encoded using an xr tag. Do you do this, Martin, or do I?
I don't think there's any way to do this programmatically, because the <note>
elements which use "cf." are discursive; there's no way for any algorithm to parse them and figure out what entry is being referred to. So it's a markup task, and as such I think it falls to you. You can assign it to me if you'd prefer, but it will take me a while to get to it, and I'll probably just have to ask you what to do in each case.
There's a simple way to find all <note>
elements containing the text "cf" in oXygen:
- Open the XML markup file in oXygen.
- In the top left of the screen, you'll see "XPath 1.0" or XPath 2.0". Use the dropdown list to select "2.0".
- Type this into the text box next to it:
//note[contains(., 'cf')]
- Press Return.
- You should see a list of items in at the bottom of the oXygen screen. Click on each one to show it in the editor.
Doing this shows 8 items in the affix.xml
file, and none at all in the s-rtr.xml
, c-rtr.xml
or phar-w.xml
files.
Incidentally, this raising a question from me: which files are "finished" to the point where they should be uploaded into the database?
2. Martin’s question about lack of documentation of phr type=”phonemic/narrow” which are found in dictegs: We discussed this in December. See Blog of 18/12/06 Decisions. In Kinkade’s database some of the illustrations are transcribed phonetically and some are transcribed phonemically. So we decided to add phr type=”phonemic/narrow” in order to allow for phonemicization of the dicteg forms when necessary.
Mea culpa, and thank god for the blog. I've now updated the documentation to reflect this. I've also done a quick update to the rendering code so these are rendered between slashes and square brackets as in the case of <seg>
elements in the <pron>
.
May 3, 2007
1. Cross-references: I went back and checked cross-reference examples. There is no difference between a link introduced by xr ‘See’ and one introduced by ‘cf’. So the second cross-reference type should be re-encoded using an xr tag. Do you do this, Martin, or do I?
2. Martin’s question about lack of documentation of phr type=”phonemic/narrow” which are found in dictegs: We discussed this in December. See Blog of 18/12/06 Decisions.
In Kinkade’s database some of the illustrations are transcribed phonetically and some are transcribed phonemically. So we decided to add phr type=”phonemic/narrow” in order to allow for phonemicization of the dicteg forms when necessary.
3. Editing Decisions concerning Conflation of Entries in Affix.xml file:
As I have been going through the Affix.xml file, I have been making editing decisions about entries. Specifically, throughout this file there are entries which Kinkade listed as separate suffixes on his filecards, but which in fact are either 1) allomorphs of one affix, rather than affixes in their own right, or 2) combinations of two affixes which Kinkade analyzed as a separate affix.
In the first case, an example is the ‘inchoative’ affix has two allomorphs -?- and -p. These were listed as separate affixes in Kinkade’s affix filecards, and were therefore typed into the computer database as separate affixes. However, Kinkade himself did consider these to be allomorphs of one affix. Similarly with ?ac-/c-. With cases of this kind, I have listed both forms of the affix under one xml.:d entry, but have indicated that there are allomorphs of the affix. And I have also placed all the illustrations of all the allomorphs together into one entry. Thus, for instance, all -?- and -p cases are in one xml:id entry.
In the second case, an example is words containing the causative morpheme -stu. In addition to the fact that this morpheme has different stressed and unstressed allomorphs, it is also found in combination with other suffixes. For example, -stu-m occurs in the data, as does -stu-n(n). Kinkade listed both -stu-m and -stunn as separate suffixes in the filecards, but his later analyses indicate that he perceived these sequences as combinations of affixes. With cases of this kind, I have moved the illustrations that Kinkade had for -stum and -stunn into the -stu xml:id entry. I have noted that I did this in the database itself, just to keep track, but these notes can be erased eventually.