Added new temporary diagnostics report to the build

Per SK's request, new report on entries ending with a specific sequence of chars.


zero morpheme character

Greg pointed out that we are using Ø (Latin Capital Letter O with Stroke, U+00D8) for our zero morpheme marker, rather than ∅ (Empty Set, U+2205). The latter is noted in the Unicode character map as the one used in linguistics to indicate a null morpheme or phonological zero.

We have at least been consistent in our use of the former! We may not have known the Empty Set character existed when we chose the other one in 2010, or it may have been a font-based choice. (I'm using Aboriginal Sans in Oxygen right now, and the Empty Set character doesn't display properly.)

Martin will add this change to his list of global changes to make when improving our current encoding, if we can be assured of fonts that include Empty Set along with all the other special characters we need.


Bug fixes in XML encoding; set up for XEP; fix in XSLT...

Diagnosed the borkedness of a borked XML file; fixed some XSLT; tried building the dictionary only to discover that of course XEP wasn't set up in Oxygen; reconfigured all the old hard-coded paths in build tasks; built the PDF; and more tweaks. Reminder to self: the diagnostics page is erroneously including an extra include for the personography, minus its file extension; needs fixing.


Feature structure validation

Beefed up the diagnostic processing of feature structures to add stats tables, revealing that many vals are never used. Food for thought. But no new errors revealed, which is good.


content of <label> elements and "translated hyphs"

In our discussion today about segments with more than one possible hyph, we also revisited how hyphs should look in print dictionary entries. We currently show the hyph followed by the "translated hyph" - e.g.:

[[x̣mánk-n-c • love-CTR-TR.1SgObj.3TrSbj]]

The final segment -c is shown to be composed of 3 morphemes, separated by periods: TR.1SgObj.3TrSbj

We are concerned that this will not be transparent to learner users of the dictionary. So we decided to update the <label> elements to include the first <pron> of the morpheme entry, where that would be helpful - e.g.


We need to think further about exactly how this should be implemented.

We should also check again how Montler 2012 represented syncretic morphemes in these "translated hyphs" in his root-based index. See photocopies in Print Dictionary Working Notes folder.

And we need to add an index of labels somewhere in the dictionary front matter!


Another diagnostic

Added a check for morphemes in completed files not pointing at existing entries. There are 222.


Another tweak to diagnostics

Per SMK, improved one of the diagnostics dealing with placeName entries and matching non-placeName entries.


Work on diagnostics

Picked up and extended the work JT has been doing on the diagnostics, adding six more features to complete the requirements as set out by SK. Waiting for the next build to complete so we can check that they're all working as expected.

Rebuilt the PDF with minor changes

Per ECH: faded out the draft watermark, commented out the Acknowledgements, and changed the date to 2016, then rebuilt the PDF.


Diagnostics Wishlist

17Aug16: JT has created a new diagnostics page (http://jenkins.hcmc.uvic.ca/job/moses/lastSuccessfulBuild/artifact/trunk/utilities/diagnostics.html) which looks for the following errors in complete, edited, and light-edited files:

-entries with glosses in cits

-entries with no gloss at all (and no name, persName, placeName, orgName or label)

-entries with no def

-entries containing more than one xr (These need to be concatenated into multiple refs within one xr.)

-pron:segs with parentheses in them (These need to be analyzed by a human, and made into either two seg type=n's, or two separate form elements, as necessary.)

-use of n-CTL t-TR Ø-OBJ n-SUBJ or n-CTL t-TR Ø-OBJ s-SUBJ on a word-final -n or -s if preceded by other transitive morphology (n-CTL, t-TR, stu, xit, ɬ-DIR, ɬ-EP,min, nun, tuɬ.)

26Sep16: MDH has added the following additional diagnostics:

-entries with the same string gloss-tagged more than once

-placeName entries that "duplicate" regular entries in <form> (e.g. entry for "deadfall"), so SMK and ECH can review them and make sure they are handled consistently.

-refs in xrs containing non-phonemic characters ... that is, anything BUT these characters

ʔ a á à ạ c ʼ ə h ḥ ʷ i í ì ị k l ḷ ˀ ɬ ƛ m n p q r s ṣ t u ú ù ụ w x y ʕ combining acute accent, combining grave accent, combining dot below, whitespace.

We also need to remember to deal with glosses which end in the English inflections -ing, -ed, -en or -s, as well as ablaut forms like "blow/blew". We can check these with the Find function once all the files are edited, and as we are proofing the print dictionary.

MDH had also asked about diagnostics for what makes an entry "complete", to improve the statistics report. The answer is simply:

-no Not Yet Edited Comment, AND either
-a completed hyph with no m:UNASSIGNED, OR
-a root or stem <fs>

For the time being (26Sep16), MDH has just moved the counts of total entries and entries with no Not Yet Edited Comment into the Statistics section of the new diagnostics page.

If EJD needs to edit the diagnostics further in the future, the file is: trunk/utilities/diagnostics.xsl

Once SMK and EJD have completed a first pass of editing all the files, and resolved all the problems identified by the diagnostics, MDH can make them part of the schematron to make sure we don't introduce new mistakes subsequently.

Nxaʔamxcín (Moses) Dictionary Blog

This is an XML dictionary project based primarily on the materials compiled by the late M. Dale Kinkade during fifteen years of work in the 1960’s and 1970’s with more than a dozen native speakers of the language, but it also includes materials compiled by Ewa Czaykowska-Higgins in the early 1990’s.



