Nxaʔamxcín (Moses) Dictionary Blog

July 19, 2021

fixed xrs which pointed to persNames and non-persNames

Posted by on 19 Jul 2021 in Activity log

We are currently suppressing all persNames when we generate the PDF print dictionary, as well as all roots/stems that are only roots/stems of persNames, and all xrs that point to persNames.

The XEP PDF build generates WARNING messages if there are any xrs which contain a ref to one or more persName entries AND one or more non-persName entries.

I have dealt with this issue today by:

-removing the schematron rule forbidding more than one xr in an entry (This didn't affect the PDF build.), and

-reviewing the 9 entries that triggered warning messages, and splitting their xrs into two as appropriate - one with the persName(s) and one with the non-persName(s).

So now, when the PDF build runs, we just get INFO messages regarding the persNames, roots/stems of persNames, and xrs to persNames that are being suppressed.

There are two INFO messages that are a bit funny: For the entries "xWullqs" and "cʼəris_2", it says that just a ref within an xr has been suppressed. But when I check the PDF, all three refs in the non-persName xr in both these entries have been preserved. This is correct, as they are not refs to persNames. I just don't know why the INFO message is coming up.

June 4, 2020

Work required on AboriginalSerif font collection

Posted by on 04 Jun 2020 in Activity log

The PDF dictionary uses the Aboriginal Serif font, which lacks the characters subscript one (U+2081, ₁) and subscript 2 (U+2082, ₂). These are needed for some morpheme identifiers. I used FontForge to construct those characters based on the superscript versions, U+00B9 and U+00B2 respectively, in each of the four font variants (regular, bold, italic, bolditalic), and the results seem to be OK.

September 13, 2019

<m> tags around particles in <phr>s

Posted by on 13 Sep 2019 in Activity log

An outstanding question from ECH, MDH & SMK's 2019-09-09 meeting was: What are the <m> tags for particles (and clitics) within <phr>s doing?

They are not currently processed in generating the website or the print dictionary. However, they ARE used in our statistical reports on clitic distribution (on the Diagnostics page), so don’t remove them!

For the website, the plan is for Martin to use the <m> tags in <phr>s in a new function, "Retrieve all example sentences containing this particle", in the future when the website is public. This will parallel the “other entries containing this morpheme” function, which retrieves all entries containing a given root or affix.

For the alphabetical listing in the print dictionary, we plan to display the "top 4" cits for all affix, particle and pronominal entries. These cits are (or will be made) part of the entries (i.e. not commented out), so don't need to be found programmatically.

To Do items for this sub-project are noted in our Overall To Do List document, and detailed in docs/To Do Lists/cits_for_affixes_and_particles.odt.

For the affix indices in the print dictionary, we had thought about displaying ALL the cits for each affix, particle and pronominal, found programmatically. We have not implemented this yet. Do we want to? It would make these indices very long!

September 9, 2019

Plan for abstracting cits

Posted by on 09 Sep 2019 in Activity log

The cits are scattered throughout the entries, and are therefore duplicated all over the place. In preparation for converting them to orthography, we need to centralize them. This is the basic plan:

Process all cits in all files so that each gets a unique id based on its bibl[s] plus a unique generated thing, and is moved into a separate file called cits.xml. Replace each cit with a ptr target="c:ID" thing.
Process the cits.xml file to order by id, so that all cits with the same bibls are grouped together.
Check identity between the cits. In an XSLT tranform of cits.xml, generate a new XSLT file which will contain a stack of very precise templates for ptr target="c:ABCD". For each cit which is a duplicate of a preceding one, a) nuke it from the cit file, and b) create a template to replace any pointers to it such that they point to the earliest preceding one.
Run that transformation over the collection. That should give us a situation where duplicate cits have been removed, and all pointers normalized.
Add a diagnostic that checks for ptrs inside sense elements that don't point to a cit, and fix anything found.
Do a similarity metric on cits to find any more close duplicates, and refer these to SK and ECH to diagnose.
Fix the website processing to handle the ptrs instead of in-place cits.
Fix the PDF processing to handle the ptrs.

May 24, 2019

Consultation on Python diagnostics

Posted by on 24 May 2019 in Activity log

SK and I determined that most of the Python diagnostics are no longer running, and one of the two that were running is obsolete so it's now disabled. The one remaining one is very flawed, but it's better than nothing, pending re-implementation in XSLT. 60 minutes.

February 6, 2019

Root-based index sort order

Posted by on 06 Feb 2019 in Activity log

SK reported an issue with the sort order of entries in the root-based index. I dug into it, and discovered: The main Moses-to-English entries appear to be sorted in the correct order. First a sort key is created like this:

<xsl:variable name="sortKey" select="if (descendant::orth) then normalize-space(descendant::orth[1]) else normalize-space(string-join(for $s in descendant::pron[seg[@type='p']]/descendant::seg[@type='p'] return hcmc:createOrth($s), ''))"/>

In other words, if there's an orth it uses the orth, and if not, it creates an orth from all the descendant phonemic prons. Then it sorts the entries using the orthographic collation:

<xsl:sort select="@sortKey" collation="http://saxon.sf.net/collation?class=ca.uvic.hcmc.moses.MosesOrthographyCollation"/>

When it comes to processing the root-based index, we were doing something slightly different:

<xsl:sort select="if (descendant::orth) then descendant::orth[1] else hcmc:createOrth(descendant::pron[seg[@type='p']][1]/descendant::seg[@type='p'][1])" collation="http://saxon.sf.net/collation?class=ca.uvic.hcmc.moses.MosesPhonemicCollation"/>

In other words, we were using the Phonemic collation. I can't remember when/where/why we have both phonemic and orthographic collations -- there must have been a reason -- but I've now switched the root-based index sort so that it uses the orthographic one. That appears to fix the problem, but SK will check for any unwanted fallout.

January 11, 2019

Meeting and rewrite of presentation slides

Posted by on 11 Jan 2019 in Activity log

Discussed our first draft at length, and I then rewrote my slides.

November 16, 2018

Tweak to order of root-based index component sorting

Posted by on 16 Nov 2018 in Activity log

Per SK, switched the order of two morphemes and rebuilt the PDF. Took a while to figure out where to make the change, though.

August 24, 2018

Worked with SK on diagnostics code to replace old Python stuff

Posted by on 24 Aug 2018 in Activity log

ED's convoluted Python/NTLK stuff for diagnostics just doesn't work on the new Jenkins server, and in any case it seems, as we look at it, that it could perfectly well have been done in XSLT, so SK and I have made a start on figuring out how it works and converting it. It'll take a while, but lesson learned -- don't let people use stuff just because they like it, keep the range of tech limited for any given project.

December 12, 2017

Added diagnostics build scenario to XPR file

Posted by on 12 Dec 2017 in Activity log

So that ECH can work remotely without needing a network connections, we've added a build scenario for the diagnostics to the Oxygen project file, so that running the default scenario on any XML document actually runs the diagnostic process. It takes nearly ten minutes, but it's still a bit quicker than waiting for Jenkins and it can be done without a network connection.

Nxaʔamxcín (Moses) Dictionary Blog

This is an XML dictionary project based primarily on the materials compiled by the late M. Dale Kinkade during fifteen years of work in the 1960’s and 1970’s with more than a dozen native speakers of the language, but it also includes materials compiled by Ewa Czaykowska-Higgins in the early 1990’s.

Search

XML Feeds

RSS 2.0: Posts
Atom: Posts

What is RSS?

Sidebar 2

This is the "Sidebar 2" container. You can place any widget you like in here. In the evo toolbar at the top of this page, select "Customize", then "Blog Widgets".