Nxaʔamxcín (Moses) Dictionary Blog

October 13, 2010

Collation sequence failing in tests

Posted by on 13 Oct 2010 in Activity log

Built the collation file using the hard-coded Unicode sequence as shown in the previous post, and tested it, but it seems to fail completely, which is a little puzzling. I'm now thinking that hard-coding the actual Unicode characters into the text, as opposed to defining them ahead of time using escapes, is probably causing the problem. In any case, I have to add extra handlers for the acute and grave variants of all the vowels, so I have to go back into the code and work on it. I've made a start, but run out of time for today.

Collation sequence: wrote an app to produce it

Posted by on 13 Oct 2010 in Activity log

This is the collation rules sequence, produced by a little app I wrote from the input string:

("< ʔ < a < ạ,ạ < c " + 
 " < c̣ < cʼ < ə < ə̣ " + 
 " < h < ḥ,ḥ < ḥʷ,ḥʷ < i " + 
 " < ị,ị < k < kʼ < kʷ " + 
 " < kʼʷ < l < ḷ,ḷ < lˀ " + 
 " < ḷˀ,ḷˀ < ɬ < ƛʼ < m " + 
 " < mˀ < n < nˀ < p " + 
 " < pʼ < q < qʼ < qʷ " + 
 " < qʼʷ < r < rˀ < s " + 
 " < ṣ,ṣ < t < tʼ < u " + 
 " < ụ,ụ < w < wˀ < x " + 
 " < xʷ < x̣ < x̣ʷ < y " + 
 " < yˀ < ʕ < ʕˀ < ʕʷ < ʕˀʷ ")

Full collation info

Posted by on 13 Oct 2010 in Activity log

EDIT: This is wrong: ignore it!

This is the moses collation info: original sequence, canonical-composed sequence, and canonical-decomposed sequence:

ʔ        \u0660            \u0660

a        a            a

ạ        \u7841            a\u0803

c        c            c

c̣        c\u0803            c\u0803

cʼ        c\u0700            c\u0700

ə        \u0601            \u0601

ə̣        \u0601\u0803        \u0601\u0803    

h        h            h

ḥ        \u7717            h\u0803

ḥʷ        \u7717\u0695        h\u0803\u0695    

i        i            i

ị        \u7883            i\u0803

k        k            k

kʼ        k\u0700            k\u0700

kʷ        k\u0695            k\u0695

kʼʷ        k\u0700\u0695        k\u0700\u0695    

l        l            l

ḷ        \u7735            l\u0803

lˀ        l\u0704            l\u0704

ḷˀ        \u7735\u0704        l\u0803\u0704    

ɬ        \u0620            \u0620

ƛʼ        \u0411\u0700        \u0411\u0700    

m        m            m

mˀ        m\u0704            m\u0704

n        n            n

nˀ        n\u0704            n\u0704

p        p            p

pʼ        p\u0700            p\u0700

q        q            q

qʼ        q\u0700            q\u0700

qʷ        q\u0695            q\u0695

qʼʷ        q\u0700\u0695        q\u0700\u0695    

r        r            r

rˀ        r\u0704            r\u0704

s        s            s

ṣ        \u7779            s\u0803

t        t            t

tʼ        t\u0700            t\u0700

u        u            u

ụ        \u7909            u\u0803

w        w            w

wˀ        w\u0704            w\u0704

x        x            x

xʷ        x\u0695            x\u0695

x̣        x\u0803            x\u0803

x̣ʷ        x\u0803\u0695        x\u0803\u0695    

y        y            y

yˀ        y\u0704            y\u0704

ʕ        \u0661            \u0661

ʕˀ        \u0661\u0704        \u0661\u0704

ʕʷ        \u0661\u0695        \u0661\u0695

ʕˀʷ        \u0661\u0704\u0695    \u0661\u0704\u0695

More progress with the Collator

Posted by on 13 Oct 2010 in Activity log

I think I can actually achieve what I want to achieve using a much simpler approach than I'd been contemplating; in fact, I can probably use the method outlined in this post, which I followed for MynDIR. MynDIR sorting doesn't (as far as I can see) have any irrational sequences, though, so it remains to be seen whether it will actually do the job correctly; however, I'm hopeful. The other potential problem is that I can't do canonical decomposition prior to the comparisons. The RuleBasedCollator class doesn't leave room for this; it simply expresses a sequencing rule. However, I think I can build in handling for both decomposed and recomposed variants of the components, since the rules allow parallel sequences which are sorted together. The only problem then would be ill-configured sequences, which match neither decomposed nor recomposed sequences. If we do notice any bad sorting, though, we can fix the problem items, and even search-and-replace through the db to fix them globally.

Sort comparator: the actual sequence

Posted by on 13 Oct 2010 in Activity log

Working with a little NetBeans application, I've generated the required sequence after canonical decomposition has been performed, with codepoints greater than 127 escaped:

\u0660
a
a\u0803
c
c\u0803
c\u0700
\u0601
\u0601\u0803
h
h\u0803
h\u0803\u0695
i
i\u0803
k
k\u0700
k\u0695
k\u0700\u0695
l
l\u0803
l\u0704
l\u0803\u0704
\u0620
\u0411\u0700
m
m\u0704
n
n\u0704
p
p\u0700
q
q\u0700
q\u0695
q\u0700\u0695
r
r\u0704
s
s\u0803
t
t\u0700
u
u\u0803
w
w\u0704
x
x\u0695
x\u0803
x\u0803\u0695
y
y\u0704
\u0661
\u0661\u0704
\u0661\u0695
\u0661\u0704\u0695

This is a good start for my comparator...

October 12, 2010

Moses Sort Comparator code revived

Posted by on 12 Oct 2010 in Activity log

Way back when, I'd started work on a sort comparator for the Moses entries, but since then the actual transcription characters have been altered substantially, and in any case I think my approach was flawed. I've revived the code and moved it from Eclipse into NetBeans to get it finished. The original approach was based on the comparison of individual characters in a single sequence, but the fact is that many of the sequences cannot be sorted this way -- for instance: q qʼ qʷ qʼʷ violates normal logic. What I'll have to do is to convert each group of such items into e.g. q1 q2 q3 q4, and sort based on those sequences rather than based on individual characters. If I can reduce all of the characters to such simplistic ascii representations, using numbers to force the correct sort order, then it should work fine.

alphabetical order

Posted by on 12 Oct 2010 in Activity log

ʔ a ạ c c̣ cʼ ə ə̣ h ḥ ḥʷ i ị k kʼ kʷ kʼʷ l ḷ lˀ ḷˀ ɬ ƛʼ m mˀ n nˀ p pʼ q qʼ qʷ qʼʷ r rˀ s ṣ t tʼ u ụ w wˀ x xʷ x̣ x̣ʷ y yˀ ʕ ʕˀ ʕʷ ʕˀʷ

October 8, 2010

Ported web app to new Cocoon, and added new pages

Posted by on 08 Oct 2010 in Activity log

I've created a new standalone Cocoon webapp using the original code and our standard Cocoon build from last year. I had to make some simple changes, but everything ported quite well.

I've posted this web application on the Pear server, which will be its long-term home.

I've also added a couple of refinements, as planned:

There are now two links in the menu to the entries, one of which shows only those from files with status="completed" (as before); the other link shows all the entries in the database. I think the second view will be useful to ECH and SMK as they work on the files.
There's now a Status link on the menu, which leads to a page showing information about all of the files currently in the database. It shows the filename, its status, the number of entries in that file, the date of the last change, and a list of all the changes that have been done. The final column in the table shows "To do" entries. These are harvested from any XML comment which is inside the <revisionDesc> element, which contains these characters: "TODO:". This gives us a single standard location and method of storing a TODO list for each file; when actions are completed, a TODO comment can be turned into a <change> entry.

I haven't yet turned off the old site, but I'd like to do that soon, once I get approval from ECH.

October 7, 2010

Subversion now working, everyone knows how to use it

Posted by on 07 Oct 2010 in Activity log

Set up the subversion repository, and the SVN client on SK's and ECH's computers in oXygen, so we're all clear on the process, and tested it. My next tasks:

Add the status values to the ODD and generate a new schema, enforcing one of the six values in SK's post (below this one).
Tweak the Cocoon app so that it only displays data from files which are completed.
Write a status page for the Cocoon app showing where each of the files is at, including as much info as possible from the <revisionDesc>.

RevisionDesc status values

Posted by on 07 Oct 2010 in Activity log

Here are the values we've decided on for revisionDesc status. Martin will add these to the database schema.

rescued - Only one file has this status. It contains rescued entries that are missing from the main alphabetical files and need to be added back in.

unedited - These files have not yet been worked on.

editing - These files are actively being worked on by SMK.

additions_needed - ECH or SMK needs to add more entries from MDK's file cards. Only very few files should have this status. See comments at the top of each file for more details.

edited - These files just need a final proofread and check of phonemicizations by ECH.

complete - These files are done! Martin will program the database website to only display the entries from files with status="complete".

A file might not move through these statuses in order. Some files may go back and forth from "editing" to "additions_needed" a couple of times before reaching "edited" status.

Nxaʔamxcín (Moses) Dictionary Blog

This is an XML dictionary project based primarily on the materials compiled by the late M. Dale Kinkade during fifteen years of work in the 1960’s and 1970’s with more than a dozen native speakers of the language, but it also includes materials compiled by Ewa Czaykowska-Higgins in the early 1990’s.

Search

XML Feeds

RSS 2.0: Posts
Atom: Posts

What is RSS?

Sidebar 2

This is the "Sidebar 2" container. You can place any widget you like in here. In the evo toolbar at the top of this page, select "Customize", then "Blog Widgets".