Built the collation file using the hard-coded Unicode sequence as shown in the previous post, and tested it, but it seems to fail completely, which is a little puzzling. I'm now thinking that hard-coding the actual Unicode characters into the text, as opposed to defining them ahead of time using escapes, is probably causing the problem. In any case, I have to add extra handlers for the acute and grave variants of all the vowels, so I have to go back into the code and work on it. I've made a start, but run out of time for today.
This is the collation rules sequence, produced by a little app I wrote from the input string:
("< ʔ < a < ạ,ạ < c " + " < c̣ < cʼ < ə < ə̣ " + " < h < ḥ,ḥ < ḥʷ,ḥʷ < i " + " < ị,ị < k < kʼ < kʷ " + " < kʼʷ < l < ḷ,ḷ < lˀ " + " < ḷˀ,ḷˀ < ɬ < ƛʼ < m " + " < mˀ < n < nˀ < p " + " < pʼ < q < qʼ < qʷ " + " < qʼʷ < r < rˀ < s " + " < ṣ,ṣ < t < tʼ < u " + " < ụ,ụ < w < wˀ < x " + " < xʷ < x̣ < x̣ʷ < y " + " < yˀ < ʕ < ʕˀ < ʕʷ < ʕˀʷ ")
EDIT: This is wrong: ignore it!
This is the moses collation info: original sequence, canonical-composed sequence, and canonical-decomposed sequence:
ʔ \u0660 \u0660 a a a ạ \u7841 a\u0803 c c c c̣ c\u0803 c\u0803 cʼ c\u0700 c\u0700 ə \u0601 \u0601 ə̣ \u0601\u0803 \u0601\u0803 h h h ḥ \u7717 h\u0803 ḥʷ \u7717\u0695 h\u0803\u0695 i i i ị \u7883 i\u0803 k k k kʼ k\u0700 k\u0700 kʷ k\u0695 k\u0695 kʼʷ k\u0700\u0695 k\u0700\u0695 l l l ḷ \u7735 l\u0803 lˀ l\u0704 l\u0704 ḷˀ \u7735\u0704 l\u0803\u0704 ɬ \u0620 \u0620 ƛʼ \u0411\u0700 \u0411\u0700 m m m mˀ m\u0704 m\u0704 n n n nˀ n\u0704 n\u0704 p p p pʼ p\u0700 p\u0700 q q q qʼ q\u0700 q\u0700 qʷ q\u0695 q\u0695 qʼʷ q\u0700\u0695 q\u0700\u0695 r r r rˀ r\u0704 r\u0704 s s s ṣ \u7779 s\u0803 t t t tʼ t\u0700 t\u0700 u u u ụ \u7909 u\u0803 w w w wˀ w\u0704 w\u0704 x x x xʷ x\u0695 x\u0695 x̣ x\u0803 x\u0803 x̣ʷ x\u0803\u0695 x\u0803\u0695 y y y yˀ y\u0704 y\u0704 ʕ \u0661 \u0661 ʕˀ \u0661\u0704 \u0661\u0704 ʕʷ \u0661\u0695 \u0661\u0695 ʕˀʷ \u0661\u0704\u0695 \u0661\u0704\u0695
I think I can actually achieve what I want to achieve using a much simpler approach than I'd been contemplating; in fact, I can probably use the method outlined in this post, which I followed for MynDIR. MynDIR sorting doesn't (as far as I can see) have any irrational sequences, though, so it remains to be seen whether it will actually do the job correctly; however, I'm hopeful. The other potential problem is that I can't do canonical decomposition prior to the comparisons. The RuleBasedCollator class doesn't leave room for this; it simply expresses a sequencing rule. However, I think I can build in handling for both decomposed and recomposed variants of the components, since the rules allow parallel sequences which are sorted together. The only problem then would be ill-configured sequences, which match neither decomposed nor recomposed sequences. If we do notice any bad sorting, though, we can fix the problem items, and even search-and-replace through the db to fix them globally.
Working with a little NetBeans application, I've generated the required sequence after canonical decomposition has been performed, with codepoints greater than 127 escaped:
\u0660 a a\u0803 c c\u0803 c\u0700 \u0601 \u0601\u0803 h h\u0803 h\u0803\u0695 i i\u0803 k k\u0700 k\u0695 k\u0700\u0695 l l\u0803 l\u0704 l\u0803\u0704 \u0620 \u0411\u0700 m m\u0704 n n\u0704 p p\u0700 q q\u0700 q\u0695 q\u0700\u0695 r r\u0704 s s\u0803 t t\u0700 u u\u0803 w w\u0704 x x\u0695 x\u0803 x\u0803\u0695 y y\u0704 \u0661 \u0661\u0704 \u0661\u0695 \u0661\u0704\u0695
This is a good start for my comparator...
Way back when, I'd started work on a sort comparator for the Moses entries, but since then the actual transcription characters have been altered substantially, and in any case I think my approach was flawed. I've revived the code and moved it from Eclipse into NetBeans to get it finished. The original approach was based on the comparison of individual characters in a single sequence, but the fact is that many of the sequences cannot be sorted this way -- for instance: q qʼ qʷ qʼʷ violates normal logic. What I'll have to do is to convert each group of such items into e.g. q1 q2 q3 q4, and sort based on those sequences rather than based on individual characters. If I can reduce all of the characters to such simplistic ascii representations, using numbers to force the correct sort order, then it should work fine.
ʔ a ạ c c̣ cʼ ə ə̣ h ḥ ḥʷ i ị k kʼ kʷ kʼʷ l ḷ lˀ ḷˀ ɬ ƛʼ m mˀ n nˀ p pʼ q qʼ qʷ qʼʷ r rˀ s ṣ t tʼ u ụ w wˀ x xʷ x̣ x̣ʷ y yˀ ʕ ʕˀ ʕʷ ʕˀʷ
I've created a new standalone Cocoon webapp using the original code and our standard Cocoon build from last year. I had to make some simple changes, but everything ported quite well.
I've posted this web application on the Pear server, which will be its long-term home.
I've also added a couple of refinements, as planned:
- There are now two links in the menu to the entries, one of which shows only those from files with
status="completed"
(as before); the other link shows all the entries in the database. I think the second view will be useful to ECH and SMK as they work on the files. - There's now a Status link on the menu, which leads to a page showing information about all of the files currently in the database. It shows the filename, its status, the number of entries in that file, the date of the last change, and a list of all the changes that have been done. The final column in the table shows "To do" entries. These are harvested from any XML comment which is inside the
<revisionDesc>
element, which contains these characters: "TODO:". This gives us a single standard location and method of storing a TODO list for each file; when actions are completed, a TODO comment can be turned into a<change>
entry.
I haven't yet turned off the old site, but I'd like to do that soon, once I get approval from ECH.
Set up the subversion repository, and the SVN client on SK's and ECH's computers in oXygen, so we're all clear on the process, and tested it. My next tasks:
- Add the status values to the ODD and generate a new schema, enforcing one of the six values in SK's post (below this one).
- Tweak the Cocoon app so that it only displays data from files which are
completed
. - Write a status page for the Cocoon app showing where each of the files is at, including as much info as possible from the
<revisionDesc>
.
Here are the values we've decided on for revisionDesc status. Martin will add these to the database schema.
rescued - Only one file has this status. It contains rescued entries that are missing from the main alphabetical files and need to be added back in.
unedited - These files have not yet been worked on.
editing - These files are actively being worked on by SMK.
additions_needed - ECH or SMK needs to add more entries from MDK's file cards. Only very few files should have this status. See comments at the top of each file for more details.
edited - These files just need a final proofread and check of phonemicizations by ECH.
complete - These files are done! Martin will program the database website to only display the entries from files with status="complete".
A file might not move through these statuses in order. Some files may go back and forth from "editing" to "additions_needed" a couple of times before reaching "edited" status.