The PDF dictionary uses the Aboriginal Serif font, which lacks the characters subscript one (U+2081, ₁) and subscript 2 (U+2082, ₂). These are needed for some morpheme identifiers. I used FontForge to construct those characters based on the superscript versions, U+00B9 and U+00B2 respectively, in each of the four font variants (regular, bold, italic, bolditalic), and the results seem to be OK.
An outstanding question from ECH, MDH & SMK's 2019-09-09 meeting was: What are the <m> tags for particles (and clitics) within <phr>s doing?
They are not currently processed in generating the website or the print dictionary. However, they ARE used in our statistical reports on clitic distribution (on the Diagnostics page), so don’t remove them!
For the website, the plan is for Martin to use the <m> tags in <phr>s in a new function, "Retrieve all example sentences containing this particle", in the future when the website is public. This will parallel the “other entries containing this morpheme” function, which retrieves all entries containing a given root or affix.
For the alphabetical listing in the print dictionary, we plan to display the "top 4" cits for all affix, particle and pronominal entries. These cits are (or will be made) part of the entries (i.e. not commented out), so don't need to be found programmatically.
To Do items for this sub-project are noted in our Overall To Do List document, and detailed in docs/To Do Lists/cits_for_affixes_and_particles.odt.
For the affix indices in the print dictionary, we had thought about displaying ALL the cits for each affix, particle and pronominal, found programmatically. We have not implemented this yet. Do we want to? It would make these indices very long!
The cits are scattered throughout the entries, and are therefore duplicated all over the place. In preparation for converting them to orthography, we need to centralize them. This is the basic plan:
- Process all cits in all files so that each gets a unique id based on its bibl[s] plus a unique generated thing, and is moved into a separate file called cits.xml. Replace each cit with a ptr target="c:ID" thing.
- Process the cits.xml file to order by id, so that all cits with the same bibls are grouped together.
- Check identity between the cits. In an XSLT tranform of cits.xml, generate a new XSLT file which will contain a stack of very precise templates for ptr target="c:ABCD". For each cit which is a duplicate of a preceding one, a) nuke it from the cit file, and b) create a template to replace any pointers to it such that they point to the earliest preceding one.
- Run that transformation over the collection. That should give us a situation where duplicate cits have been removed, and all pointers normalized.
- Add a diagnostic that checks for ptrs inside sense elements that don't point to a cit, and fix anything found.
- Do a similarity metric on cits to find any more close duplicates, and refer these to SK and ECH to diagnose.
- Fix the website processing to handle the ptrs instead of in-place cits.
- Fix the PDF processing to handle the ptrs.
SK and I determined that most of the Python diagnostics are no longer running, and one of the two that were running is obsolete so it's now disabled. The one remaining one is very flawed, but it's better than nothing, pending re-implementation in XSLT. 60 minutes.
SK reported an issue with the sort order of entries in the root-based index. I dug into it, and discovered: The main Moses-to-English entries appear to be sorted in the correct order. First a sort key is created like this:
<xsl:variable name="sortKey" select="if (descendant::orth) then normalize-space(descendant::orth) else normalize-space(string-join(for $s in descendant::pron[seg[@type='p']]/descendant::seg[@type='p'] return hcmc:createOrth($s), ''))"/>
In other words, if there's an orth it uses the orth, and if not, it creates an orth from all the descendant phonemic prons. Then it sorts the entries using the orthographic collation:
<xsl:sort select="@sortKey" collation="http://saxon.sf.net/collation?class=ca.uvic.hcmc.moses.MosesOrthographyCollation"/>
When it comes to processing the root-based index, we were doing something slightly different:
<xsl:sort select="if (descendant::orth) then descendant::orth else hcmc:createOrth(descendant::pron[seg[@type='p']]/descendant::seg[@type='p'])" collation="http://saxon.sf.net/collation?class=ca.uvic.hcmc.moses.MosesPhonemicCollation"/>
In other words, we were using the Phonemic collation. I can't remember when/where/why we have both phonemic and orthographic collations -- there must have been a reason -- but I've now switched the root-based index sort so that it uses the orthographic one. That appears to fix the problem, but SK will check for any unwanted fallout.
Discussed our first draft at length, and I then rewrote my slides.
Per SK, switched the order of two morphemes and rebuilt the PDF. Took a while to figure out where to make the change, though.
ED's convoluted Python/NTLK stuff for diagnostics just doesn't work on the new Jenkins server, and in any case it seems, as we look at it, that it could perfectly well have been done in XSLT, so SK and I have made a start on figuring out how it works and converting it. It'll take a while, but lesson learned -- don't let people use stuff just because they like it, keep the range of tech limited for any given project.
So that ECH can work remotely without needing a network connections, we've added a build scenario for the diagnostics to the Oxygen project file, so that running the default scenario on any XML document actually runs the diagnostic process. It takes nearly ten minutes, but it's still a bit quicker than waiting for Jenkins and it can be done without a network connection.
Met with AP, linguist and app developer, and shared ideas on dictionary interfaces, data-entry, and outputs.