Nxaʔamxcín (Moses) Dictionary Blog

April 26, 2007

Responses to questions

Posted by on 26 Apr 2007 in Activity log

I've finally had a chance to look at your questions in detail:

1. In the inchoative entry one of the allomorphs (the glottal stop), is an infix, while the other allomorph (the -p) is a suffix. In the feature structure it is possible to two different symbol values for type of morpheme, so I used this possibility to list the inchoative morpheme as being both an infix and a suffix. ՠQuestion for Martin: But the question that I have is how can we indicate which allomorph is an infix and which is a suffix in the database? Does this pose any kind of problem for the database?

This is a question I hadn't thought about before, because it hadn't occurred to me that there would be two different morpheme types for one morpheme. However, a relatively simple solution suggests itself:

<form type="allomorph" n="1">
  ...
</form>
<form type="allomorph" n="2">
  ...
</form>

 ...
 
<fs>
  <f name="baseType">
    <symbol value="infix" n="1"/>
    <symbol value="suffix" n="2" />
  </f>
  ...
</fs>

Then I can write code to detect the presence of the n attributes, and link the correct form to the correct symbol value. I've added this to the documentation, and I've also posted a link to the documentation on the site.

2. This has to do with markup of glosses in illustration (dictegs). ՠQuestion for Martin: Do we want the English-Nx wordlist to be able to access illustration glosses, and if so, how do we mark this up? Can we use the same system of segs and glosses?

My original intention was that the gloss tag would be used inside a <def><seg> tag to signal a word or phrase which could be used to create the English-Nx wordlist, and that the wordlist would be constructed only based on <gloss> tags occurring in that context. In the <dicteg> tags, we're using <gloss> for something else:

If a gloss for the illustration is required, it can be included in the <quote> tag with a <gloss> tag, like this:
  <cit>
    <quote>The quoted illustration<gloss>Translation of the illustration</gloss></quote>
      ...
  </cit>
(from our guidelines)

Therefore it seems to me that using <gloss> in a different way inside the <dicteg> will be confusing. I took a look at <code>s-rtr.xml</code>, and I found some bits that look like this:

<cit><!--check stress on this one-->
  <quote>
    <phr type="phonemic">ni?c'ikus ??p?iԿ?</phr>                         
    <phr type="narrow">ne?c?ikos ??p?lꬼ/phr>
    <seg>whole wheat <gloss>flour</gloss></seg>
  </quote>
  <bibl>Y41.7</bibl>
</cit>

This doesn't look anything like the guidelines, so I'm wondering what happened here. Was this based on the code already in the file, or did you construct this format with <phr> and <seg> tags?

It seems to me that if there's a word or phrase that can serve as a direct English equivalent to the headword appearing in an illustration, it might as well be in the <def> element, wrapped in a <gloss> tag; is there any good reason to take material from the illustrations for the English-Nx glossary?

3. Crossreferences: Here are the two different ways of doing cross-references. They are from the same entry -?-/-p ԩnchoativeԮ Note that the format in (a) does not provide a gloss for the cross-reference. Presumably this is because the gloss is meant to be determined by looking at the entry of the word that is referred to in the cross-reference. The effect of xr is to point to the xml:id and thus the entry for the referred to word. The format in (b) does provide a gloss, but does not point to the entry of the referred to word. I assume that (a) is actually the format that we want to be following but your input is needed here Martin. (a) <dicteg> <cit> <quote>s-vt?a+?+x-mgloss>it is getting sweet</gloss> </quote> <bibl></bibl> </cit> <xr>See<ref target="t??x">t??ո</ref></xr> </dicteg> (b) <dicteg> <cit> <quote>vk??մ?-p<gloss>rope breaks</gloss> <note>cf. k?k'?մ'?n ' break a line'</note> </quote> <bibl></bibl> </cit> </dicteg>

Our documentation shows this example:

<xr>See <ref target="idblah">Blah</ref> (English blah) and <ref target="idblah2">blah2</ref> (English blah2).</xr>

The intention is that the gloss, if needed, be simply in brackets. I think the structure quoted in your question is the result of the automatic conversion code doing the best it could with the source material; in this case, there was no gloss for the cross reference encoded with <xr>, and the second cross-reference was simply not encoded properly in the original source. If there's a difference between a link introduced by "See" and one introduced by "cf.", then we'll need to elaborate the tagging system a bit, but I suspect in this case the second cross-reference should be re-encoded using an <xr> tag.

Cross-references

Posted by on 26 Apr 2007 in Activity log

April 26, 2007 Crossreferences: Here are the two different ways of doing cross-references. They are from the same entry -?-/-p “inchoative”. Note that the format in (a) does not provide a gloss for the cross-reference. Presumably this is because the gloss is meant to be determined by looking at the entry of the word that is referred to in the cross-reference. The effect of xr is to point to the xml:id and thus the entry for the referred to word. The format in (b) does provide a gloss, but does not point to the entry of the referred to word. I assume that (a) is actually the format that we want to be following but your input is needed here Martin. (a) <dicteg> <cit> <quote>s‐√tˀa+ʔ+x‐míx<gloss>it is getting sweet</gloss> </quote> <bibl></bibl> </cit> <xr>See<ref target="tˀəx">tˀə́x</ref></xr> </dicteg> (b) <dicteg> <cit> <quote>√kˀə́tˀ‐p<gloss>rope breaks</gloss> <note>cf. kɬk'ə́t'ən ' break a line'</note> </quote> <bibl></bibl> </cit> </dicteg> Other: I worked on merging various entries which were separate in Kinkade's filecards, but which in fact all involve the causative morpheme -stu-, and its various allomorphs. More work on this tomorrow..

April 25, 2007

Porting the project to Lettuce

Posted by on 25 Apr 2007 in Activity log

The old location of the Moses project on the Mustard server is going obsolete, so we're moving all projects over to a newer Tomcat/Cocoon/eXist block on the Lettuce server. I got stuck into that process today. The new site is here, and the old site will be pointed at it soon.

I immediately faced a major problem. The AJAX code which retrieves entry information from the server sends the id attribute as a GET request to the server. In previous projects we've had some problems with this, using characters between 127 and 255, but worked around it by using the JavaScript escape() function. However, with these id attributes, there's no hope of that; they have all sorts of characters above 255 in them.

After several hours of hacking around with the JavaScript, I eventually found a solution that could be implemented through a change to the Cocoon web.xml file. I'll document that in detail on the Maintenance blog, since it's really a server configuration issue.

The main thing is that the new site is now working, and it gives us a lot of new opportunities in terms of performance, indexing and stability that were not there on the old server. Next, I'll be able to look in detail at the questions in Ewa's post below.

April 24, 2007

Affix and s.rtr

Posted by on 24 Apr 2007 in Activity log

I fixed the errors in the s.rtr file and then went back to working on the affix file. In the affix file I accomplished the following:
1. I corrected some feature structures at the beginning of the file which had errors
2. I worked on restructuring the p entry. This entry was a mess because it has a lot of cross-references and the conversion program placed all the cross-references at the beginning of the entry and not in the right dictegs, and some of the cross-references also had their transcriptions and their glosses separated.
3. I moved the p entry into the glottal stop ‘Inchoative’ entry since p is an allomorph of the Inchoative.

Three issues arose:
1. In the ‘Inchoative’ entry one of the allomorphs (the glottal stop), is an infix, while the other allomorph (the -p) is a suffix. In the feature structure it is possible to two different symbol values for type of morpheme, so I used this possibility to list the ‘Inchoative’ morpheme as being both an infix and a suffix.
• Question for Martin: But the question that I have is how can we indicate which allomorph is an infix and which is a suffix in the database? Does this pose any kind of problem for the database/
2. In the affix file there seem to be two different ways of doing cross-references: one way involves using the tag xr and the other involves using the note tag with dictegs and quote. I need to figure out what the difference is and whether there is a problem to be fixed here, or whether there is a consistent reason for the difference.
• Task for Ewa: how are cross-references used in dictegs?
3. The last question that arose has to do with markup of glosses in illustration (dictegs).
• Question for Martin: Do we want the English-Nx wordlist to be able to access illustration glosses, and if so, how do we mark this up? Can we use the same system of segs and glosses?

April 23, 2007

s-rtr.xml file

Posted by on 23 Apr 2007 in Activity log, Tasks

I took a look at the file and it looks fine. I've uploaded it into the DB, and the list of words is a bit longer:

There are a couple of things I noticed. There's one asterisk still in there:

<gloss>he is real *gentle</gloss>

Also, there's one instance of a space at the beginning of a gloss element:

<seg>they are 
  <gloss>tame</gloss> or <gloss> gentle</gloss>
</seg>

which could be eliminated -- it'll save on processing time if we don't have to strip leading and trailing spaces from gloss tags when we process them.

There's one <m> element with no sameAs reference:

<m sameAs="">wílˀx</m>

which I presume is because you don't yet know what that reference would be, not having done all the affixes yet. This is a good reason for doing the affixes next, so that we can go back and fill any missing ones in before there are too many of them.

Finally, there are some things about the display of the entries which, now I look at them, I don't understand. The primary form (the headword on the web page) seems to be the <seg type="narrow"> form, rather than the phonemic form, while I suspect the sorting is being done based on the phonemic form. In addition, for an entry like "sweater" where the morpheme element can only point back to its own entry, the morpheme should not be a link; however, it is a link, and even worse, it doesn't point back to its own entry, it points to nothing.

So I'm adding this as a task for me to fix the sorting, headword display, and handling of morphemes which shouldn't link because they'd link only to themselves.

April 19, 2007

S-rtr.xml file draft is finished

Posted by on 19 Apr 2007 in Activity log

1. I have completed a version of the s-rtr.xml file.

2. Glosses:
Following Martin’s comments about marking up glosses, I have deleted all the asterisks (I think), and have tried to use seg and gloss tags in such a way as to ensure that the right words from the glosses will be singled out when creating the English-Moses wordlist.

3. Phonemic forms:
a) I have tried to be systematic about how I have indicated phonemic and narrow types. I think there is redundancy in what I have done, but the entries seem to me to be easier to read now. Martin, could you look over the file with redundancy in mind and tell me what you think.
b) In phonemicizing the dictegs, and the entries themselves, I have made the following decisions:
(i) The so-called ‘phonemic’ forms are not identical to underlying forms in all cases. For example, when a morpheme has more than one allomorph, one stressed, and one unstressed, and the unstressed allomorph involves changes in vowels, or some similar change, then I have stayed fairly close to the pronunciation of the unstressed variant in the ‘phonemic’ form. Thus, in the case of a suffix like =áw’s~=u?s, I am writing the unstressed variant as =u?s--writing =aw’s would make it hard for speakers/learners to know how to pronounce the form without learning complicated rules. If, however, in the phonetic form Kinkade has recorded =o?s, then I change the ‘o’ to ‘u’ in the phonemic form. Similarly for a morpheme like =míx~=mx~=ExW (E=schwa, W=raised w), I write =ExW in the phonemic form (rather than =mix, or =oxW).
(ii) I am leaving out all unstressed schwas in phonemic forms, except those that occur in reduplicative morphemes, those that occur in roots, and those that occur in a few suffixes (e.g., =ul’ExW), where they are unpredictable synchronically. This phonemic transcription thus differs from the orthographic representations found in the Nx Language Program dictionary edited by Nancy Mattina.
(iii) In loanwords, I am transcribing vowels as Kinkade transcribed them, even if they are not fully phonemic (e.g., spanyol I have phonemicized as spanyol, and not as spanyul). This decision is made to allow the spelling to reflect the loanword status and to make pronunciation more transparent.
(iv) In the s-rtr file, I have transcribed all the xml-id forms and the phonemic forms with an initial s-rtr, even if Kinkade did not transcribe them all that way. The reason for this is to aid in alphabetization.

4. Next task: I plan to work on the Affix file to try to finish it. The reason for this is that then all the affixes are available to be referred to in hyphs. Does that make sense to you Martin?

March 16, 2007

Entries and forms

Posted by on 16 Mar 2007 in Activity log

I've read through what your post on entries and forms says, and it seems to me that the multiple form elements are redundant; if the forms are the same, then there only needs to be one form element. In other words, one form can have multiple senses, and one sense can have multiple forms; however, if it's a situation where there are two forms (a and b) and two senses (1 and 2), and form a only goes with sense one, while form b only goes with sense 2, then they should be in separate entry elements.

Does that make sense?

Glosses

Posted by on 16 Mar 2007 in Activity log

The idea of the gloss tags was based on the original star idea: single words were starred, on the basis that they would be used to create a simple word-list in English. Phrases couldn't be starred, because there's no way to know when the phrase would end, so the original starring was restricted to single words. Our gloss tags are doing the same sort of thing, but now, if you want to put a phrase in a gloss tag, you're welcome to do that. However, bear in mind that what you're creating with a gloss tag is an entry in a simple English-Moses dictionary; if you wrap a phrase such as "a small cup" in a gloss tag, then that entry will show up under "a" in the E-M dictionary, which is presumably not what you want. In other words, the gloss tags reproduce exactly what was available in the original star system (they identify a single word that will be used to create an E-M wordlist); but they aren't restricted to a single word, as the starring system was.

"Broad" and code in posts

Posted by on 16 Mar 2007 in Activity log

I've changed the markup documentation to use "phonemic" instead of "broad".

I've also edited Ewa's last-but-one post to add <pre></pre> tags around the markup. This makes the linebreaks and indentation show up as expected in the output. You can use the "pre" button on the toolbar in the post editing window to do this. To make code show up as expected, first escape it with the &; button, then use the "pre" button on it.

phonemic versus broad

Posted by on 16 Mar 2007 in Activity log

We did initially propose to use 'broad', and then we changed that to 'phonemic' when we decided to use phonemic representations in the hyph elements. Neither term is exactly right. However, I think we should decide to go with 'phonemic' because this means that one can characterize the orthography as 'roughly phonemic' and that will be generally understood ('roughly broad phonetic' is a messier designation). So this should be changed in the xml markup documentation.

Nxaʔamxcín (Moses) Dictionary Blog

This is an XML dictionary project based primarily on the materials compiled by the late M. Dale Kinkade during fifteen years of work in the 1960’s and 1970’s with more than a dozen native speakers of the language, but it also includes materials compiled by Ewa Czaykowska-Higgins in the early 1990’s.

Search

XML Feeds

RSS 2.0: Posts
Atom: Posts

What is RSS?

Sidebar 2

This is the "Sidebar 2" container. You can place any widget you like in here. In the evo toolbar at the top of this page, select "Customize", then "Blog Widgets".