Nxaʔamxcín (Moses) Dictionary Blog

May 18, 2007

One more gloss question

Posted by on 18 May 2007 in Activity log

In the following example (1) from a sense tag, the meaning of the entry is equivalent to that part of the meaning that would be targeted by a search when making the English-Nx wordlist.

(1) <sense><def><seg>cougar</seg></def></sense>

But in (2) the meaning of the entry and the part that would be targetted by the English-Nx wordlist are not identical. Hence, we have added in the gloss mark-up.

(2) <sense><def><seg><gloss>worn down</gloss> to the end</seg></def></sense>

In case (1) should I add in a gloss mark-up, even though it is redundant? Or is gloss only necessary in those cases where the English-Nx target is not identical to the meaning of an entry?

Small Correction

Posted by on 18 May 2007 in Activity log

I've been working on the phar-w.xml file, not the phar file.

Glosses and Cross-references Again

Posted by on 18 May 2007 in Activity log

When I went back to work on the phar file, a number of questions came up. I raise them below:

In late April (26/04/07) we had an exchange about gloss tags. I asked whether we want to be able to access glosses in dictegs to construct the English-Nx wordlist. Martin replied that he had not envisioned that we would do so, and that therefore only the system of segs and glosses that occurs within the sense part of the entry is the one that will be searched for the English-Nx wordlist. Martin also asked if there is any good reason to take material from the illustrations for the English-Nx glossary. I didn’t have an answer to this last question at the time, but I think I have one now.

The following is a set of four connected entries based on the root, ʕʷə́cˀ, which can serve to exemplify my answer. Kinkade did not provide a gloss for this root, so in the entry for the root itself, there is no definition available. Similarly for the entry that consists of this root plus the STAT suffix -t, there is no definition available, but there is an illustration. The other two entries connected to this root do have definitions.

<entry xml:id="ʕWəcˀ">
<form>
<pron><seg>ʕʷə́cˀ</seg></pron>
<hyph><m>ʕʷə́cˀ</m></hyph>
</form>
<sense>
<def></def>
</sense>

<xr>See <ref target="yəcˀp">yə́cˀp</ref><note>worn down to the end</note></xr>
</entry>

<entry xml:id="ʕWəcˀt">
<form>
<pron><seg>ʕʷə́cˀt</seg></pron>
<hyph>√<m sameAs="ʕWəcˀ">ʕʷə́cˀ</m>‐<m sameAs="STAT">t</m></hyph>
</form>
<sense>
<def><seg></seg></def>
<dicteg>
<cit><quote>ʕʷə́c't ʔeɬxʷənčút<gloss>out of breath</gloss></quote><bibl>Y40.177</bibl></cit>
</dicteg>
</sense>
</entry>

<entry xml:id="ʕWə́cˀp">
<form>
<pron><seg>ʕʷə́cˀp</seg></pron>
<hyph>√<m sameAs="ʕWə́cˀ">ʕʷə́cˀ</m>‐<m sameAs="ʔ">p</m></hyph>
</form>
<sense>
<def><seg>worn down to the end</seg></def>
</sense>
<bibl>Y34.34</bibl>
</entry>

<entry xml:id="ʕWəcˀpaskˀáyˀt">
<form>
<pron><seg>ʕʷəcˀpaskˀáyˀt</seg></pron>
<hyph>√<m sameAs="ʕWə́cˀ">ʕʷə́cˀ</m>‐<m sameAs="inch">p</m>=<m sameAs="askˀáyˀt">askˀáyˀt</m></hyph>
</form>
<sense>
<def><seg>ran out of breath</seg></def>
</sense>
<bibl>JM3.73.7</bibl>
</entry>

If the English-Nx wordlist only looks at material within the sense tags to determine membership in the English-Nx wordlist, then the stative form, which has a dicteg, but no filled in gloss within sense tags will be missed.
Question 1: Is this a problem?
Questions 2-4: For the English-Nx wordlist, how do we handle roots that have no gloss? Should I provide a gloss, based on interpreting the available forms. And if I do that, should the ECH-interpreted gloss be marked as such, as opposed to it being an attested gloss?
Question 5: If we look at the glosses for the last two of the four entries above, we see that one is ‘worn down to the end’ and the other is ‘ran out of breath’. What is the best way to mark-up these glosses?
I envision two possibilities (at least):
(a) I could use the seg/gloss mark-up to do the following:
<def><seg><gloss>worn down</gloss> to the end</seg></def>
<def><seg>ran <gloss>out of</gloss> breath</seg></def>
This mark-up foregrounds what seem to be those parts of the senses that the two definitions have in common.
(b) I could provide a general and identical interpreted definition for the two forms: e.g. Here I highlight the general nature of the definition by putting it in caps.
<def><seg><gloss>WEAR OUT, RUN OUT</gloss> worn down to the end</seg></def>
<def><seg> <gloss> WEAR OUT, RUN OUT </gloss>ran out of breath</seg></def>

I would be grateful for your thoughts on these questions Martin.

Xml mark-up documentation:
1. The xml mark-up documentation does not include information about how we are using segs and glosses within defs to distinguish those parts of the definition that the English-Nx wordlist needs to search, as opposed to those which parts of the definition which it does not need to search.

2. I don’t think the latest version of the xml mark-up documentation is sufficiently clear on what we are doing with cross-references. For example, it is not clear that in the cross-references, because we are referring to the xml:ids of the xr forms, we do not actually need to supply the English meanings of those forms.

May 17, 2007

Transformer

Posted by on 17 May 2007 in Activity log

Your idea to use Transformer sounds great.

I won't be working on Thursday but will work again on Friday afternoon. (It's Ascension, so a holiday here. Plus Ales has a bull-fighting festival going on until Sunday which is a very big occasion here).

May 16, 2007

Doing changes to transcription

Posted by on 16 May 2007 in Activity log

Sorry I've been a bit slow getting back to you on this -- I've been swamped by other projects.

Now I come to think about it, the best option for converting those files that have already been completed will be to use Transformer, which is designed for exactly this sort of job. I can create a replace sequence that I think will get us to where we want to be, then I can run it on the files that have already been completed. Then you can check the results -- we could actually put them into the database for that purpose, so you can read them more easily on the Web. Once we're happy the replacements are doing the job, I can run them on all the remaining files that you haven't yet worked on. In the meantime, you could be using the right markup for the one you're working on next.

Does that make sense?

Affix.xml File FSs completed

Posted by on 16 May 2007 in Activity log

I have completed the feature structures and merging of entries in the affix.xml file. The file still requires proofreading against MDK's filecards, phonemicizing, and changing all ejectives to stop+raised comma, and ensuring that all glottalized resonants are resonant+superscript glottal.

What is the most efficient way for me to undertake the ejective/glottalized resonant changes?

Next task: the Pharyngeal file, to have one more complete file with lexical rather than suffixal content. And then I will turn to the lexical suffix file because the LSs require feature structures as well.

May 11, 2007

Testing another Java IDE

Posted by on 11 May 2007 in Activity log

Since I'm going to be writing Java classes and ultimately applications in the future, and I need to write some Java now for the Moses project, I need to choose an appropriate IDE. I've been working with Eclipse so far, and it looks good, but another alternative is NetBeans 5.5, which is also free, and that's getting good reviews. It also has a GUI-building tool, which may be very handy.

I downloaded and installed it, and read some introductory materials, then I began trying to duplicate my MosesIndexer project from Eclipse in NetBeans. After some faffing around I got it working. The hardest thing to figure out was Unicode support; not only did it default (like Eclipse) to Windows 1252 for the source editor encoding (ridiculous choice), but even when I figured out how to change the default encoding for source files, and change the default encoding for the project and for each individual source file, I still couldn't get a test class with some Unicode text in it to compile. The problem turned out to be the compiler. I had to go into the project properties, click on Build / Compiling, and add "-encoding utf8" to the Additional Compiler Options.

After that, everything worked. The basics seem no different from Eclipse; Eclipse seems to have slightly better content completion when it comes to adding imports automatically, while NB seems to generate more detailed code templates when you create a Java Application project or add a class. If the GUI builder component is useful at all, I think NetBeans will be the way to go.

Further to commas and glottals

Posted by on 11 May 2007 in Activity log

Response to Martin and Greg regarding glottalization:
Well I am happy with everything here: in the data, ejectives will be transcribed with a raised comma; glottalized resonants will be transcribed with a superscript glottal. In the output, all the segments will be transcribed with raised comma.

The ejective and sonorant/resonant categories are not quite right however. They should be:
Ejectives: p’, t’, c’, ƛ’, k’, kʷ’, q’, qʷ’

Sonorants: mˀ, nˀ, lˀ, ḷˀ, rˀ, wˀ, yˀ and ʕˀ, ʕʷˀ (the voiced phargyngeal fricative, what you are calling epiglottal) which appears as both plain glottalized and rounded glottalized.
Belted l is never glottalized.

The question of which raised comma character to use: I agree that modifier letter apostrophe is the best option. (It’s too bad about the handwritten alphabet having the w and y with raised commas above--when we transcribe by hand we are not always as precise as we should be; raised comma above is often not distinguished from raised comma just to the right when one writes by hand, but clearly this is an important difference when using computer fonts)

Normalizing the data: I definitely think we should do this with simple search and replace operations. These are easy to do and can be done as I go through each file, although clearly I will need to watch out for apostrophes in the English text, as these do appear.

I’m glad we are in agreement about all this. I look forward to hearing about the techniques for searching with XPath in oXygen: so far I have been using the Find and Replace function when I have needed to change things (for instance I did this in dealing with the cross-reference changes that I had to make), and it has been working well.

On commas and glottals

Posted by on 11 May 2007 in Activity log

Greg and I have just spent a little while with the IPA and Unicode specs, and I think we have a good strategy for this.

First, you're absolutely right that the IPA specifies a raised comma to indicate an ejective. Therefore these consonants should be written with a comma: c, k, p, q, t, epiglottal.

Sonorants/resonants should be written with the glottal diacritic in the data, to conform with IPA. We're not sure of the exact list of consonants that fall into this category, though -- we guess: belted l, barred lambda, m, n, r, w and y.

If we've got any consonants in the wrong sets above, let us know. Now, even though we're using the raised glottal diacritic in the data, that doesn't mean we have to show it in the output; it's trivial to replace the raised glottal with a raised comma in the output. This way, we end up with a traditional "orthography" without sacrificing IPA conformance in the data.

The next question is which raised comma character to use. Unicode has these two candidates:

\u02bc: MODIFIER LETTER APOSTROPHE, "glottal stop, glottalization, ejective"
\u0313: COMBINING COMMA ABOVE, "Americanist ejective or glottalization"

The first is clearly the best option, and it's the one specified by the IPA. The description of the second shows that it is frequently used in Americanist transcriptions. We want a modifier letter, not a diacritic above, so we should choose the first. Incidentally, on the question of why we used that character in the data, Greg says he based the choice on your handwritten alphabet, which clearly shows the w and y with commas above them, not following them.

I'll summarize by showing the character combos I believe we should have in the data, and those we will generate in the output routines (hoping that they'll show up correctly in the browser font):

In the data:

cʼ, kʼ, pʼ, qʼ, tʼ, ʕʼ
ɬˀ, ƛˀ, mˀ, nˀ, rˀ, wˀ, yˀ

In the output for readers:

cʼ, kʼ, pʼ, qʼ, tʼ, ʕʼ
ɬʼ, ƛʼ, mʼ, nʼ, rʼ, wʼ, yʼ

If you agree, the next question is how best to go back and normalize all the data. Some of this can be done programmatically or with simple search-and-replace -- for example, we can just replace all instances of comma-above with modifier-glottal. Similarly, we can replace sequences of cons+modifier-glottal with cons+modifier-raised-comma. Slightly more problematic will be any instances of REAL apostrophes in the data, since these will always be wrong in Moses text, but may be fine in English text. I can show you some techniques for searching with XPath in oXygen to find all those instances.

Let me know what you think.

May 11, 2007

Posted by on 11 May 2007 in Activity log

Worked on the last few entries of the affix.xml file. In the DIST entry there are over 100 crossreferences that need to be corrected manually. I have done about half of them.

Nxaʔamxcín (Moses) Dictionary Blog

This is an XML dictionary project based primarily on the materials compiled by the late M. Dale Kinkade during fifteen years of work in the 1960’s and 1970’s with more than a dozen native speakers of the language, but it also includes materials compiled by Ewa Czaykowska-Higgins in the early 1990’s.

Search

XML Feeds

RSS 2.0: Posts
Atom: Posts

What is RSS?

Sidebar 2

This is the "Sidebar 2" container. You can place any widget you like in here. In the evo toolbar at the top of this page, select "Customize", then "Blog Widgets".