The PDF dictionary uses the Aboriginal Serif font, which lacks the characters subscript one (U+2081, ₁) and subscript 2 (U+2082, ₂). These are needed for some morpheme identifiers. I used FontForge to construct those characters based on the superscript versions, U+00B9 and U+00B2 respectively, in each of the four font variants (regular, bold, italic, bolditalic), and the results seem to be OK.
Category: "Activity log"
An outstanding question from ECH, MDH & SMK's 2019-09-09 meeting was: What are the <m> tags for particles (and clitics) within <phr>s doing?
They are not currently processed in generating the website or the print dictionary. However, they ARE used in our statistical reports on clitic distribution (on the Diagnostics page), so don’t remove them!
For the website, the plan is for Martin to use the <m> tags in <phr>s in a new function, "Retrieve all example sentences containing this particle", in the future when the website is public. This will parallel the “other entries containing this morpheme” function, which retrieves all entries containing a given root or affix.
For the alphabetical listing in the print dictionary, we plan to display the "top 4" cits for all affix, particle and pronominal entries. These cits are (or will be made) part of the entries (i.e. not commented out), so don't need to be found programmatically.
To Do items for this sub-project are noted in our Overall To Do List document, and detailed in docs/To Do Lists/cits_for_affixes_and_particles.odt.
For the affix indices in the print dictionary, we had thought about displaying ALL the cits for each affix, particle and pronominal, found programmatically. We have not implemented this yet. Do we want to? It would make these indices very long!
The cits are scattered throughout the entries, and are therefore duplicated all over the place. In preparation for converting them to orthography, we need to centralize them. This is the basic plan:
- Process all cits in all files so that each gets a unique id based on its bibl[s] plus a unique generated thing, and is moved into a separate file called cits.xml. Replace each cit with a ptr target="c:ID" thing.
- Process the cits.xml file to order by id, so that all cits with the same bibls are grouped together.
- Check identity between the cits. In an XSLT tranform of cits.xml, generate a new XSLT file which will contain a stack of very precise templates for ptr target="c:ABCD". For each cit which is a duplicate of a preceding one, a) nuke it from the cit file, and b) create a template to replace any pointers to it such that they point to the earliest preceding one.
- Run that transformation over the collection. That should give us a situation where duplicate cits have been removed, and all pointers normalized.
- Add a diagnostic that checks for ptrs inside sense elements that don't point to a cit, and fix anything found.
- Do a similarity metric on cits to find any more close duplicates, and refer these to SK and ECH to diagnose.
- Fix the website processing to handle the ptrs instead of in-place cits.
- Fix the PDF processing to handle the ptrs.
SK and I determined that most of the Python diagnostics are no longer running, and one of the two that were running is obsolete so it's now disabled. The one remaining one is very flawed, but it's better than nothing, pending re-implementation in XSLT. 60 minutes.
SK reported an issue with the sort order of entries in the root-based index. I dug into it, and discovered: The main Moses-to-English entries appear to be sorted in the correct order. First a sort key is created like this:
<xsl:variable name="sortKey" select="if (descendant::orth) then normalize-space(descendant::orth[1]) else normalize-space(string-join(for $s in descendant::pron[seg[@type='p']]/descendant::seg[@type='p'] return hcmc:createOrth($s), ''))"/>
In other words, if there's an orth it uses the orth, and if not, it creates an orth from all the descendant phonemic prons. Then it sorts the entries using the orthographic collation:
<xsl:sort select="@sortKey" collation="http://saxon.sf.net/collation?class=ca.uvic.hcmc.moses.MosesOrthographyCollation"/>
When it comes to processing the root-based index, we were doing something slightly different:
<xsl:sort select="if (descendant::orth) then descendant::orth[1] else hcmc:createOrth(descendant::pron[seg[@type='p']][1]/descendant::seg[@type='p'][1])" collation="http://saxon.sf.net/collation?class=ca.uvic.hcmc.moses.MosesPhonemicCollation"/>
In other words, we were using the Phonemic collation. I can't remember when/where/why we have both phonemic and orthographic collations -- there must have been a reason -- but I've now switched the root-based index sort so that it uses the orthographic one. That appears to fix the problem, but SK will check for any unwanted fallout.
Discussed our first draft at length, and I then rewrote my slides.
Per SK, switched the order of two morphemes and rebuilt the PDF. Took a while to figure out where to make the change, though.
ED's convoluted Python/NTLK stuff for diagnostics just doesn't work on the new Jenkins server, and in any case it seems, as we look at it, that it could perfectly well have been done in XSLT, so SK and I have made a start on figuring out how it works and converting it. It'll take a while, but lesson learned -- don't let people use stuff just because they like it, keep the range of tech limited for any given project.
So that ECH can work remotely without needing a network connections, we've added a build scenario for the diagnostics to the Oxygen project file, so that running the default scenario on any XML document actually runs the diagnostic process. It takes nearly ten minutes, but it's still a bit quicker than waiting for Jenkins and it can be done without a network connection.
Met with AP, linguist and app developer, and shared ideas on dictionary interfaces, data-entry, and outputs.
Our PDF build is dependent on XEP, which is installed on my desktop, so up to now I've needed to be here to build it (since it was a scenario run from inside Oxygen). I've now converted this into an ant task which can be run remotely at the command line. Lesson learned at some cost of time: you can't use this:
<arg value="-fo ${foFile}"/>
Instead, you have to use this:
<arg line="-fo ${foFile}"/>
Otherwise XEP doesn't find the FO file, and assumes the fo is coming from stdin; it then complains that the root element is not fo:root.
Met with SK and ECH and discussed a number of remaining issues that might be amenable to algorithmic approaches; one was decided on (removing stress marks from phonemic segs in inferred roots), and I wrote and tested the required transformation, then ran it on the data at the end of the day.
Ran it on these files:
affix_aspectual, affix_glot-ix, affix_k-m, affix_n-t, affix_u-CAPs, lex-pref, lex-suf, particles,pron
and committed the results. SMK now checking.
Finished and tested the XSLT from yesterday; SMK will check results before we hard-run it and change the data.
Further to our discussions on numbers, I have added the following to feature_system.xml
1) wordType numberStem. So ECH will add this <fs> to the number stems 1-10.
<fs>
<f name="numberStem">
<binary value="true"/>
</f>
</fs>
2) countingType "ten"
I have also added the following <fs> to lexical suffix "akst-2", so ECH can use this morpheme for marking up the numbers 30, 40 ... 90.
<fs>
<f name="baseType">
<symbol value="affix"/>
</f>
<f name="positionType">
<symbol value="suffix"/>
</f>
<f name="affixType">
<symbol value="derivational"/>
</f>
<f name="derivationalType">
<symbol value="lexical"/>
<symbol value="counting"/>
</f>
<f name="countingType">
<symbol value="ten"/>
</f>
</fs>
MDH will then search for entries with this <fs> to build a test column for the table of numerical expressions. We can subsequently add more countingType values to the feature system, and to the entries for the appropriate lexical suffixes with classifier functions, and generate more columns for the table.
Discussions and decisions on how to handle numbers and counters: new wordType of cardinalNumeral, new lexicalSuffix type of numeralClassifier. These will be applied, and then harvesting will be done to generate a table of numerical expressions which will form the basis of decisions on how/whether to create a special section in the print dictionary.
SK pointed out that the English-Moses index was sorting Js to the end, and indeed when I looked at the collation that we're using for all sorting (MosesPhonemicCollation, which is designed to handle both English and Moses), J was omitted from the sequence. I added it to the source, installed NetBeans and recompiled the jar, and all seems to be well. I was happy to see that NetBeans was its usual trouble-free self; installed quickly, worked out of the box, and although it complained that a dependency ("hamcrest") was missing from the project, it added it for me, resolving the issue painlessly.
Also added a new Schematron rule to the set, to catch entries with no pron/seg[@type='p'], at SK's request; that caught 19 additional errors, which she's fixing.
Per SK's request, new report on entries ending with a specific sequence of chars.
Diagnosed the borkedness of a borked XML file; fixed some XSLT; tried building the dictionary only to discover that of course XEP wasn't set up in Oxygen; reconfigured all the old hard-coded paths in build tasks; built the PDF; and more tweaks. Reminder to self: the diagnostics page is erroneously including an extra include for the personography, minus its file extension; needs fixing.
Beefed up the diagnostic processing of feature structures to add stats tables, revealing that many vals are never used. Food for thought. But no new errors revealed, which is good.
In our discussion today about segments with more than one possible hyph, we also revisited how hyphs should look in print dictionary entries. We currently show the hyph followed by the "translated hyph" - e.g.:
[[x̣mánk-n-c • love-CTR-TR.1SgObj.3TrSbj]]
The final segment -c is shown to be composed of 3 morphemes, separated by periods: TR.1SgObj.3TrSbj
We are concerned that this will not be transparent to learner users of the dictionary. So we decided to update the <label> elements to include the first <pron> of the morpheme entry, where that would be helpful - e.g.
t(TR).sa(1SgObj).s(3TrSbj)
We need to think further about exactly how this should be implemented.
We should also check again how Montler 2012 represented syncretic morphemes in these "translated hyphs" in his root-based index. See photocopies in Print Dictionary Working Notes folder.
And we need to add an index of labels somewhere in the dictionary front matter!
Added a check for morphemes in completed files not pointing at existing entries. There are 222.
Per SMK, improved one of the diagnostics dealing with placeName entries and matching non-placeName entries.
Picked up and extended the work JT has been doing on the diagnostics, adding six more features to complete the requirements as set out by SK. Waiting for the next build to complete so we can check that they're all working as expected.
Per ECH: faded out the draft watermark, commented out the Acknowledgements, and changed the date to 2016, then rebuilt the PDF.
A little table of contents for our Diagnostics page at:
http://jenkins.hcmc.uvic.ca/job/moses/lastSuccessfulBuild/artifact/trunk/utilities/
1) Broken Cross References - to be checked every time someone finishes editing an alphabetical file. The broken xrs in the affix files will be dealt with later by a separate process. (Commenting out affix cits that duplicate completed entries elsewhere.)
2) Duplicate IDs - to be checked every time someone finishes editing an alphabetical file.
3) Feature Structure Problems - to be checked every time someone finishes editing an alphabetical file.
4) Glosses_All: List of all glosses in the entire database, with the entries they appear in. May be useful for final-final proofread, but doesn't need urgent action.
5) Glosses_Complete: These were all reviewed and gloss-tagging standardized by Marianne Huijsmans in August 2015. No further action needed.
6) Glosses_Edited: List of all glosses in edited and light-edited files, with the entries they appear in. This list could be reviewed and any non-standard glosses fixed, per Notes on Definitions and Gloss-Tagging, but this is not a priority item,
MDH: Found over 3,000. Seems like a lot...
SMK: We revised the process later. The correct total is only 735! And dropping ...
The three compound test cases listed here have been found correctly in the latest build. So it appears the new rules for finding compounds are working.
I spent a long time this afternoon puzzling over the root-based index in the latest build, trying to answer Martin's question from Feb. 2: Is the revised compound-finding rule working correctly?
I think it is, although a lot of our test cases can't be tested fully until other files are completed and added to the build.
The current compound-finding rule looks for:
"all entries with the root or stem in question whose hyphs ALSO contain another <m corresp=m:""> pointing to an entry with a root or stem feature structure"
The new rule is actually finding fewer compounds right now, because it requires the entries for both roots (and/or stems) to be in files included in the build.
Our test case "wəswisxnascʼəlcʼəl", with stem "wisxn" and root "cʼəl" is currently NOT getting found as a compound, because c-glot.xml is unedited, so the root entry for "cʼəl" is not in the build. "wəswisxnascʼəlcʼəl" doesn't get caught by the compound-finder, so it passes through to the next rule and gets caught as a reduplication, and appears under Reduplications under the stem "wisxn". This is wrong, but it's temporary. When c-glot is edited and added to the build, "wəswisxnascʼəlcʼəl" will sort correctly as a compound under both "wisxn" and cʼəl"
Meanwhile, many compounds with root √x̣əƛʼ and other roots which are in the build are being found correctly and organized under both roots.
Another good test case to look at when EJD finishes s.xml and we add it to the build is "siʔsiʔtax̣x̣ƛʼcinʔ". This is not currently being found as a compound, but it should be once s.xml is added to the build.
A couple of the other compounds with √x̣əƛʼ were not found correctly due to the underdot under the x̣ floating under the schwa - likely copy-paste errors made when completing hyphs. I have now fixed these, so they should also be found correctly in the next build.
For the record, they are:
"swahamaɬx̣x̣ƛʼcintn" Ribbon Cliff
"sqəlˀtmx̣Wax̣ƛʼcin" gelding
For further test cases, see SMK's notebook, 22Mar16.
After a couple of false starts due to inadvertently nested morphemes, we've now been able to build a version of the dictionary with over 5,000 entries and 600 pages. SK says we're actually more than half-way through the editing, when other incomplete files are taken into consideration.
SK pointed out that we were harvesting compounds which were not compounds, because we were conflating multiple forms in an entry when doing the test for more than one root or stem. I've now fixed that, and it looks like the current crop of compounds is right. I also worked around a mysterious bug that was eliminating spaces in the rendering of hyphs in the root-based index; I don't know why lone spaces between components were getting lost, but using an en space instead of a #x20 solved it for some reason.
SK suggested a change to the generation of the root-based index, which we implemented, but in order to test it I finally had to bite the bullet and go through the process of getting the dictionary build set up (config was lost during the hard-drive failure last year, and the subsequent transition to a new machine). I discovered a number of things:
- My XEP config had been lost because it was in ~/apps, which was not backed up. I've now reconstructed it; the config file xep.xml was partially reconstructed already based on the ScanCan server setup, but was missing the Aboriginal Sans fonts. I've now added those fonts to the font folder in the xep app folder and to xep.xml, and this stuff is in my backup script.
- I have three Java collations used in the form of jar files for sorting purposes. The source code for two of them is in svn, but the third, MosesOrthographyCollation, was missing both its jar file and source code, although the project framework was there. Not sure how that happened, but probably because it was developed in a different location on the dead drive and didn't get svn'ed or backed up for some reason. I had a built copy of the jar file in the eXist project, so I'm able to use that, but I'll actually need to reconstruct the source code (not complicated) so it can be tweaked if necessary. There's a hard-coded list of the orth symbols in order showing in the built dictionary PDF, so that won't be hard.
- The dictionary now builds OK, and the change to compound discovery, although it should not have retrieved any new compounds per SK for the moment (the particular cases in point are in a file that's not yet being used in the build), it does seem to have found some; I'm waiting for SK to look at those and determine whether they are actually correct, or whether our new criteria are too broad somehow.
The feature structure checker was wrongly flagging lex-suff entries with a single <m>
element which mapped to two other morphemes through @corresp
as requiring a feature structure. Reported by SMK, fixed today.
Added a couple more tests for problem feature structures based on request from SMK.
Per request from SK, there are now three output files being created:
- glosses_all.txt
- glosses_complete.txt
- glosses_edited.txt
The first one works slightly differently from the others. In the first one, each distinct gloss is listed, along with a list of all the prons from any entry that includes that gloss (in other words, multiple entries). In the other two, each entry is processed individually, and the glosses for that entry are listed following its pron.
Partly for encoders to do quicker and easier proofing (hopefully), but also partly to re-familiarize myself with some of the quirks of author-mode CSS in Oxygen.
SK asked for some stats on our progress, so that's now in the build process. More varieties of gloss lists are also going to be generated eventually.
This currently:
- Validates all the dictionary files (rng and both Schematrons, TEI and local)
- Runs diagnostics to check all cross-references
- Runs diagnostics to check all feature structures
- Generates a list of glosses (the team is using this right now)
More will come, including hopefully a build of the PDF, but that will require some decisions about where our copy of XEP is to live.
SMK is entering new words from our August 2014 work session with EB and PCS. This prompted a discussion of how to make entry of new data more efficient.
MDH and SMK set up keyboard shortcuts for inserting new entry templates, using a new feature in Oxygen v. 17. See Dictionary Editing Manual for details.
MDH and SMK also looked briefly at Author View in Oxygen, which allows setup of a data entry "front end". This is not worth doing when SMK is the primary data-enterer and already comfortable with XML. But it might be worth considering for a future workstudy person.
For both "front-end" data entry, and importing data from a spreadsheet, the difficulty is that we don't know how many of each field each entry needs. An entry might have one or more narrow transcriptions, one or more cits, etc. In cases like this, MDH thinks it's just as quick to enter the data directly into XML.
EB has shared a list of new words not in previous NLP dictionaries. SMK will use it as a test case for working out the steps to import it and check it against the existing database.
Rewrote bits of the code so you can just plug in a morpheme id and get a fresh page and JSON for that item without overwriting anything else; tested and bugfixed with a bunch of other morphemes, committing the results to the repo in case ECH can use them.
After much pain, managed to get Tomcat 7 running on ECH's laptop with the Moses app inside it. Tomcat 8 will not run it; need to investigate that.
Meanwhile, we discussed visualizations and as a result I worked on generating data for a force-directed graph of the shared morphemes between elements all containing a specific morpheme. The example morpheme had 66 entries in addition to itself, and there 545 edges, so the result was painfully slow in the browser and not very revealing. However, it might be useful for less common morphemes.
Tweaked the autophonemicizer so that it applies only to phrs in cits, and then ran it on eight files ready to be done. After this was working and checked, wrote a new transformation which detects when a cit is actually a duplicate of a full entry which already exists elsewhere; comments out absolutely identical ones, and flags close ones for further checking. In this way over 1400 cits were commented out; 800+ await checking.
Quick last-minute review of IRG proposal.
Prompted by a discussion on TEI-L, added a Schematron rule to check NFC in text nodes, and ran my normalization code on a bunch of files which had non-NFC content.
Following discussions yesterday, I've made a minor change to the retrieval code that runs when you click on "Other entries containing this morpheme" on the website. It now retrieves only entries with the morpheme in their <hyph>
, not (as before) anywhere at all.
I've made a similar change to a number of XPaths in the PDF generation code, and a quick check of the resulting PDF shows that it doesn't seem to have had any adverse effects.
That means that if the team wishes to tag <m>
s in <cit>
/<phr>
contexts (e.g. for particles), that can be done without any side-effects on the website or on the PDF generation. I could add a feature to the website which triggers only for entries with this feature:
<f name="baseType"> <symbol value="particle"/> </f>
which would "Retrieve all example sentences containing this particle", but I think that should only be done if we're consistently tagging the particles in cits. Of course, if that can be done through search-and-replace (in the cit/phr XPath context) for all particles, then there's no reason not to do it since it won't cost us any time.
SMK has come up with a nice algorithmic approach to identifying reduplicants in hyphs, and converting the contents to CVC-style representations. I've coded up some XSLT to do the first two steps, and we've tested on a couple of files. If no problems emerge, we'll run it on everything. Then there are a series of replacement steps that she's going to do manually, because it's not quite clear whether they can be automated without risk of unexpected side-effects.
Met with ECH and SMK to plan the presentation for January. Also the process generated a new idea: in the print dictionary, the main entry list should be in a separate volume from all the indexes.
With SMK, made a range of different fixes to the orthography generation code. What we have now is the same in the fo output code and in the auto_orthography.xsl library; the latter is an identity transform that we plan to use to generate hard-coded orths which can then be manually tweaked. I think we're as close as we can be; what remains is a lot of oddities and special cases.
Suppressed display of labels in lexical affix index; special-cased the OC (out-of-control) morpheme so that it gets a reduplicant delimiter (+) instead of -, owing to its reduplicant nature (not currently reflected completely in its feature structure).
That's the complete set of changes arising out of our meeting on Wednesday.
From discussion yesterday: changed "Meaning unclear" to "Meaning not determined"; suppressed display of labels in web app; checked out some potential issues with display of various note types and resolved to stick with the status quo.
ECH has written most of the front matter, and today we worked on layout, pagination, styles etc. It all looks pretty good now, and is detailed enough to go to the Tribes for feedback.
ECH and I spent some time today looking at the complexities of orthing, with the assistance of the document SMK has been working on. We added a few simple rules to the existing algorithm (removing secondary stresses), and there's one more I can add (removing unstressed schwas at the beginning of the process).
But beyond that, the basic conclusion we came to is this:
Most of the hard decisions for the awkward cases have to be made on the basis of hyph or feature structure information. Right now, I'm not inputting that information into the orthing algorithm; it just operates on a plain phonemic transcription. In the case of <pron>
orthing, it would be possible to pass in the relevant hyph and fs data. However, the orthing algorithm also has to be applied to <phr>
s in <cit>
s, and in that case, there is no hyph or feature structure info; any phr might contain a word from a completely different entry, and we wouldn't have any way of knowing about it.
So we think the best approach is this:
- We do the best we can to improve the simple algorithm.
- We run it against the existing files and create hard-coded
<orth>
elements (or<phr type="orth">
) in the actual entry files. - We also write code which is designed to detect (but not handle) circumstances under which a manual fix is likely to be required (where, for instance, the sequence əxʷ appears, which might indicate m:mix or m:ulˀəxW). When we detect those cases, we add a comment to the effect that someone needs to fix the results.
Our quick tests suggest that the problem entries are likely to be in the low hundreds rather than the thousands, and a lot of the checking and fixing could be done by a good workstudy student.
Met with ID and TM, visiting from Texas, and had a good long discussion about dictionary production, layout, publication and related topics. Very helpful.
Completed all the tasks in the previous post. Surprised to discover that background watermarking was not very difficult. Created an SVG file for the watermark, and a temporary xsl:attribute-set which sets it as a background-image, then assigned the temporary attribute-set to all the fo:region-body elements in the output. Done.
Arising out of today's meeting:
- When rendering appendix indexes, place the label after the full list of allomorphs, not after the first one. This currently applies only to the Grammatical Morphemes, but will apply later to Lexical Affixes (see below).
- All entries in lex-suf and lex-pref should have a
<label>
(last element in first<def>
containing "la". - When rendering a list of allomorphs in the indexes, use only a tilde, not a comma + tilde, to separate them.
- Create another appendix index of all the placeName entries (where the
<form>
/<pron>
/<seg>
contains a<placeName>
element). There are currently only two in the completed files, but 264 across all the files. - Investigate whether a watermark for DRAFT ONLY can be added in XEP, so that we don't have to worry so much about the PDF being shared before it's finished.
Per decisions in the previous post:
- Appendix index entries now include all allomorphs.
- Where allomorphs have different feature structures (through
<vAlt>
elements in the<fs>
), the prefixes and suffixes for affixation are now sensitive to those differences and supply the correct versions for each allomorph (this needs rigorous confirmation by ECH).
I've also done a considerable amount of cleanup of the rendering of all indexes, especially the root-based index, which had headwords hanging over into the page margin.
One outstanding question: the root-based index headwords do not include allomorphs right now. I think they probably shouldn't (it's cleaner and clearer without, and in any case you would most likely get to them from the main entries), but if we decide otherwise, all that needs to happen is that code from the fo_extra_indexes.xsl/outputExtraIndexEntry
template would need to be imported into the fo_root_based_index.xsl/outputEntry
template.
Currently, allomorphs are not included in the appendix indexes, and they should be (listed immediately after the first). This will require changes to the outputEntry
template. However, in addition, the hcmc:getAffixPrefix()
, hcmc:getAffixSuffix()
and hcmc:getAffixDelimiter()
functions will also need to be updated so that they are not as crude as currently. Right now, they read the first descendant::symbol
element, but in fact they'll have to be aware of which allomorph is the subject of the operation, and choose the correct symbol where there is a <vAlt>
element.
Based on feedback from ECH, I've made a number of changes to the rendering of the indexes in the appendix. In the process of discussing this yesterday, we noticed there are many oddities in the placement of gloss tags and related spaces, so I did some regex work to pull up a few hundred candidate issues and fixed the ones that needed doing.
Lots of work this morning on clarifying what we should do with hyphs. Here's the breakdown:
- We should fix the delimiters in the source data, not in the output process. That means:
-ʔ- becomes <ʔ> +a+ becomes <a> +C₂+ becomes <C₂> +CVC+ becomes <CVC>
BUT ONLY (in the last two cases) where the same root morpheme appears before and after the sequence. - The string-replacement code I wrote yesterday to crudely accomplish this in the output should be removed, since the standard hyph output will now be correct anyway.
- The deletions mentioned in SMK's post should be carried out by pre-processing the whole hyph before the "translated hyph" is created:
- Delete the second/rightmost instance of the root after these morphemes: inchoative (xml:id="ʔ"), characteristic (xml:id="CHAR"), out of control (xml:id="OC"), but only when they are infixes; you can now tell this context by the angle-bracket text nodes surrounding them. Note that there may be more than one infix separating the two roots (there are no instances of this right now, but there will be as more data is processed).
- Delete the first/leftmost instance of the root before the repetitive morpheme (xml:id="REP"), and put the root symbol before the second part of the root (again, only where it is an infix, determined by surrounding text nodes).
- The other part of SMK's post, relating to the situation where a root morpheme has no gloss, is now changed: we do not keep the other instance of the root, but instead we replace the unglossed root with a smallcapped label "Unk", signifying "unknown".
ECH also sends these instructions re changes to the indexes in the appendix, having changed the feature structures of clitics. I've implemented these:
Put the List of Root Morphemes first, and maybe change the headings as I've indicated here:Four Appendices 1. List of Root Morphemes all roots (but not stems) (i.e. anything with <f name="baseType"> <symbol value="root"/></f>) 2. List of Lexical Affixes (from lex-pref.xml and lex-suf.xml) 3. List of Grammatical Morphemes -all grammatical affixes (those in the five affix xml files) plus inflectional clitics The inflectional clitics are defined as <f name="baseType"> <symbol value="clitic"/> AND <f name="cliticType"> <symbol value="inflectional"/> 4. List of Particles -all particles (particles.xml),
It makes sense to have the List of Root Morphemes because this provides a different information than what is in the Root-based Index. The Root-based Index is a listing of all the words in the dictionary organized by root, and with morphological breakdowns; the List of Root Morphemes in the Appendix is simply a list of all the root morphemes. It is therefore a subset of the information in the Root-based Index, but in listing only morphemes it parallels the other 3 appendices which are lists of different categories of morphemes. So the Appendices will list all the individual morphemes in the dictionary.
I'm working on SMK's instructions for hyphs here. I've implemented the first part, which is easy: it's just a search-and-replace on strings. But I'm struggling with the second part, mainly because I don't understand the examples properly. My questions are below; waiting for clarification from ECH.
[INSTRUCTIONS] -- when generating the translated hyph, a) Delete the second/rightmost instance of the root after these morphemes: inchoative (xml:id="ʔ"), characteristic (xml:id="CHAR"), out of control (xml:id="OC"): For example: [[√ʔiɬ<CVC>n-úl • √eat<char>-attrib]] BUT, if the root has no gloss, DO keep the second part of the root: For example: [[k-√cúwˀ<CVC>x=ánaʔ • loc-√cúwˀ<char>x=ear]] b) Delete the first/leftmost instance of the root before the repetitive morpheme (xml:id="REP"), and put the root symbol before the second part of the root. For example: [[√p<a>tix̣ʷ • <rep>√test]] Again, if the root has no gloss, keep the first part of the root. For example: [[√p<a>tix̣ʷ • √p<rep>tix̣ʷ]] [/INSTRUCTIONS]
The first example comes from this (I'll pretty-print the hyph for clarity):
<hyph> √ <m corresp="m:ʔiɬn">ʔiɬ</m> + <m corresp="m:CHAR">CVC</m> + <m corresp="m:ʔiɬn">n</m> - <m corresp="m:ul">úl</m> </hyph>
Question 1: Can I ignore the intervening characters between the <m> elements for the purposes of detecting infixes? For instance, can I search for a sequence of:
<m>rootX</m> <m>CHAR</m> <m>rootX</m>
and be sure it's OK to delete the second root, regardless of what text nodes happen to intervene? Or might there be instances of, for instance,
<m>rootX</m>-<m>CHAR</m>-<m>rootX</m>where instead of + characters, there are hyphens, and the relationship is now entirely different so the deletion should not be triggered?
Question 2: I'm a bit confused about the idea of retaining the second root if it has no gloss. Why? The example comes from this hyph:
<hyph> <m corresp="m:k-LOC">k</m> -√ <m corresp="m:cuwx">cúwˀ</m> + <m corresp="m:CHAR">CVC</m> + <m corresp="m:cuwx">x</m> = <m corresp="m:anaʔ">ánaʔ</m> </hyph>
and the entry xml:id="cuwx" is indeed lacking a gloss (it's an inferred entry). But if we delete reduplicated roots in most cases, but not in this one, aren't people going to assume that the second instance of the morpheme, which shows up as "x", is something else entirely, because they will assume that a second instance has already been deleted, as it would be in most normal cases? Are we expecting people to distinguish between a case where a root disappears because it has a gloss, and one where it doesn't disappear because it doesn't have a gloss? That seems extremely confusing to me. I would naturally assume that if reduplicated roots are normally deleted, that's the case here too, and the "x" is a subsequent and completely different morpheme (especially since it bears no resemblance to the first instance, "cúwˀ").
Today:
- Fixed naming of particle index.
- Split out lexical affix index, particle index and root index into separate page-sequences so they can have appropriate running headers.
- Fixed some display spacing issues with translated hyphs (compensating for superfluous spaces in data).
- Fixed a bug in xsl:key to look up glosses, so glosses are now appearing for lexical items that have them in translated hyphs.
- Fixed a problem with page-masters for front matter and appendices (page-masters were not properly configured for recto and verso).
- Fixed a blank-page bug (referenced master was not there, so page was unselectable in PDF output).
- Began work on handling of various infixes (this will be very complicated).
Also, GN hacked our fonts to add subscript 1 and 2, since we need these, and the font author has not responded to our requests.
... the clitic index is actually a particle index. Relabelled and renamed variables accordingly. More to come on this...
I've written an XSLT thing which generates a report on broken cross-references, and generated a report. Then I fixed a few obvious easy ones. There are quite a few left.
I've just worked through the handful of remaining instances of bibls containing parentheses. These were problematic because it wasn't exactly clear how to assign responsibility (@corresp) values to them.
I've left the contents of the bibls alone, so we can easily find them all again with this regex:
<bibl[^<]*\([^<]+</bibl>
My basic approach has been to credit everyone whose initial-key appears in the entry, without trying to figure out the relationship between them, on the basis that this is safer than ignoring some people based on question marks or parentheses. So, for instance:
<bibl corresp="psn:K psn:AM psn:JM">K(Y40.29)</bibl>
becomes
<bibl corresp="psn:K psn:AM psn:JM">K(Y40.29)</bibl>
(K becomes K; Y signifies AM and JM together.)
The only remaining problems are two instances in the rescued.xml file of this:
<bibl>(q.v.)</bibl>
I have no idea what to do with this, and a search through the original lexware files doesn't really help. I think someone will have to go back to the filecards for those.
There are 10,124 bibls remaining that have no @corresp, but 8029 of them are <bibl><!--[No source]--></bibl>. That leaves over 2000 which are a problem, though, and many of these are in already-completed files; affix_aspectual.xml has 186 instances of this, for instance:
<bibl>4.56</bibl>
which is I would guess supposed to be W4.56 or G4.56, both of which appear elsewhere.
I have a working table of contents for the book, based on divs with xml:ids. I'm sure this will get more complicated in time, but it works for the moment. I think we're going to need to organize some individual section title pages which have only large text in the middle of them. This might be done by deciding that any div with a head but no other content is a big fat title page.
As requested in SMK's post of March 4, I've added the four indexes in an appendix. They add only 17 pages to the length of the dictionary.
Yesterday ECH requested the following three fixes, which I've now done:
- Allomorphs are now separated by tildes in main entries.
- The "capitals" in small-caps are now explicitly 9pt, alongside 7pt for the "small" letters, instead of inheriting the default 10pt from the context.
- In the root-based index, we're using the first def instead of the gloss, since glosses often don't constitute short definitions; instead they're partial definitions whose purpose it to generate the English-Moses lookup index.
ECH and I met and went through all the existing tasks and documentation. We've put together a Gantt chart (using GanttProject -- very straightforward to use), and we'll be able to move forward in a more organized way through the next phase of the project.
Made a few bugfixes, and added title page info and some other placeholder content to the XML intro file for the book. Then added rendering for the title pages, and fixed the blank page issue. Also added some XSL messages to track rendering progress, and rationalized the build directory by deleted some old files with hard-coded orths (not needed now we're creating them on the fly). Did another proof of the LD & C article too.
Today:
- Spacing is fixed (all explicitly inserted and controlled now).
- Leading/trailing spaces in text nodes are clipped.
- Affix prefixes and suffixes are working.
- The small-caps implementation is working, and is used in the root-based-index too.
- The RBI is now formatted very nicely, with no confusing indents, and with some decent settings for keeping headers and headwords with following blocks, to avoid widows.
- Labels are now included in entries for affixes.
- Sections now all begin on rectos.
New tasks arising:
- Because of auto-blank pages, we need a page template for blank pages and it needs to be included with sequences that have headers and footers, so that blank is really blank (the technique is adaptable from ScanCan corpus-to-pdf code).
- Sarah's remaining instructions below need to be implemented (some are straightforward, others less so).
A certain amount of one-step-forward-two-steps-back today. We've determined that there's no way with XEP to do small-caps, because the font does not have a small-caps variant, so I had to write some code to make small-caps programmatically with text analysis and styles. This caused horrible problems because it introduced lost of whitespace, due to the indenting between FO nodes. The only way to get rid of this was to set indent="no" on the XSL output, which leaves us with the need to manually insert spaces all over the place; that will take me a while, but ultimately it'll probably be the right thing to do. So the results are currently ugly as sin, but will eventually look a bit better.
Lots of other decisions as detailed in SMK's posts today.
Documenting some more decisions which will affect the final form of the print dictionary.
1) Cits for affixes
For the time being, we are suppressing cits for affixes - that is, cits which have not been autophonemicized. It's on our To Do list to go through and check that all these cits exist elsewhere as full entries. We will decide how or whether to display cits for affixes later. We can harvest examples from entries containing each affix, and either print them as full cits, or print page references to their full entry. If there is not enough space to print all cits for affixes and keep the dictionary under 1000 pages, we could just print a few cits for each affix, or leave them out entirely. To be decided.
2) Names
-placeNames, orgNames, flora, fauna, and StoryPeople are to be included in the main dictionary as regular entries
-personal names are to be excluded from main dictionary and placed in a separate appendix (if we have enough space in the final print version). Include hyph. SMK has emailed EB to confirm these instructions, 4Mar14. Awaiting response.
If we go with the Personal Names appendix, would it be possible to continue to print cross-references to personal names, and add a page number pointer to the appendix?
If not, we should search for all <xr>s which contain a pointer to a persName xml:lang="col", and go through and make sure the refs are in their own separate <xr> tags. There shouldn't be a lot of these.
Worked with the team to help complete the IRG application, which went in at the end of the day. In the process we discovered a lot of useful material that will contribute to the article.
Copyedit on the article, and some discussion of the IRG application (more work to do on that on Monday).
Lots of discussion and testing of various suggestions today, as well as the following major changes:
- Added secondary entries for variant forms, pointing to the root entry.
- Added the English-Moses gloss index.
- Rejigged the margins and added some lines to clean up the layout and save pages.
- Suppressed cits which have no orthable phonemic representation.
Lots of small tweaks and improvements too. It's looking good.
So we're now working with PDF instead of plain text on the index formatting. Lots of other updates to the dictionary entry display etc. too.
This took a bit longer than I expected because of some problems encountered with missing data, but I think I've completed all the instructions SMK provided on Tuesday night for the new dictionary entry format. Generally I think it looks pretty good. Some notes:
- The test PDF, which runs to 185 pages right now, is built using the same test set that SMK and I have been using for developing the root-based index -- the full list of included files is below. It's basically all the completed entry files along with any incomplete files containing morphemes required by items in the completed files.
- "Name" entries have been excluded from the dictionary. At present, this means all entries which have the "name" feature set to true. This is too crude, because it will also include entries for flora and fauna, but it looks as though the feature structures will need to be made a bit more sophisticated to allow us to exclude people's names more reliably and keep the other ones.
- I've set it up so that it automatically generates orthographical forms where required, based on the phonemic pron, and it also sorts based on these forms. This saves having to preprocess all the files to add orths before generating the dictionary. If an orth already exists, it will use it (so when you we get around to adding orths, they'll be used in place of the generated ones).
- Where "orth?" appears in the middle of an entry, it's from a quotation which has no phonemic <phr>, so there's nothing to generate an orth from. There are 2,204 of these in lex-suf.xml alone. Perhaps auto-phonemicization can help here.
- There are problems with cross-references which contain refs pointing at entries which are excluded from the dictionary (see #2), so any such cross-references are ignored. This means that some legitimate cross-references are excluded because they share an <xr> tag with unusable ones.
- I think that enclosing the two versions of the hyph (the regular hyph and the "translated" hyph) in the same set of paired square brackets is a bit confusing; I think there needs to be a more obvious delimiter between them, or perhaps they should be bracketed separately:
[[c-ka-√ƛʼaʔá-s-n c-kas-√ƛʼaʔa-stu-(3Obj, 1SgSubj)]] should be something like: [[c-ka-√ƛʼaʔá-s-n ◆ c-kas-√ƛʼaʔa-stu-(3Obj, 1SgSubj)]] or [[c-ka-√ƛʼaʔá-s-n]] [[c-kas-√ƛʼaʔa-stu-(3Obj, 1SgSubj)]]
(I do think the translated hyph is a great idea though.)
Included files:
- personography.xml
- c-rtr.xml
- kw.xml
- h-phar-part1.xml
- affix_aspectual.xml
- affix_k-m.xml
- affix_u-CAPS.xml
- affix_glot-ix.xml
- l-affric.xml
- h-phar-part2.xml
- phar-w.xml
- h.xml
- glottal.xml
- s-rtr.xml
- affix_n-t.xml
- kw-glot.xml
- qw-glot.xml
- lex-suf.xml
- lex-pref.xml
- pron.xml
- particles.xml
Did the last tweak to the root-based-index sorting system per SMK, then started the rewrite of the rendering of entries in PDF, per SMK's guide. I've built orth-generation into the template, rather than having to work with pre-generated special versions of the data. It's coming along; another couple of hours' work and I think it'll be done.
I have a working setup which translates each morpheme in a hyph into either its label element (if the original entry has one), or failing that the first gloss element in its entry, or failing that its id (which may not be ideal). I'm using xsl:key for this so it's quick. I should set up keys instead of some of the other variables I'm using, probably.
Following Montler 2012, we want to be able to include a "translation" or "gloss" of words' hyphs, in both the main alphabetical dictionary entries, and the root based index.
For a hyph like: √ḥac=mín
The translation would look like: √tie=nominalizer
Here is an initial attempt at explaining how to programmatically generate these "translations":
-Replace a root, lexical prefix, or lexical suffix with the first <gloss> in its entry. (Question for ECH: what if the root has no <gloss>, as in the many "Meaning unclear" inferred roots?)
-Replace an affix (including those in the 5 affix files, and pron.xml) with a <label> to be added by ECH and SMK, based on MLW 2003.
Here are some requests for formatting changes to the morpheme index, per yesterday's discussion with ECH.
-if possible, nothing should be indented more than 3 levels.
-make the indents larger, so it's easier to see the different levels
-remove the group headings, and the spaces between groups
-bold the plain root/stem
-include the following on each line:
-Nxa'amxcin word (orth), size 12 font
-first English def (not gloss), size 10 font
-hyph and "translated hyph" in size 8 font
-if possible, page reference to this word's full entry in the main alphabetical dictionary
More to follow re: generating the "translated hyph"!
From suggestions from SMK.
The new combined group-then-sort approach is actually working. There was one last tweak we had to implement: the sort force of clitics is problematic because while they have a specific weight in the numerical sort sequence, which is used to determine the order in which morphemes are listed in the sort key, once that order has been determined, they need to be "downgraded" during the actual sort. We've achieved this by massaging the generated sort key to precede and clitic values with "0000_", which ensures they actually sort before any other sequences with non-clitics in the same position where the preceding part of the sequence is identical, but then stripping the added bit again before we calculate the indent levels.
So this is what we're now doing:
- Creating groups of subforms of the root by extracting them in a specific order from all the related forms, excluding all previously-extracted ones so that each form appears only once under each root;
- Rendering the subgroups in a different order from the discovery order;
- Sorting the items within each subgroup based on a generated sort key which gives a numerical weight to each morpheme discovered working outwards from the root (after stripping duplicated roots in a careful manner which differs depending on the infix separating them);
- Generating an indent level for each item in the subgroup based on comparing the lengths of their sort keys after stripping the common component from the left side of each.
Can this be it? Looks like it might be.
The new numerical sort is doing what it's supposed to do, and I've now added handling for the infixes with repeated roots, but it turns out that we still need to do the initial grouping of forms first, and then use this approach as a secondary sort within the groups. I've also coded an additional pass through an entry sequence which calculates first relative and then absolute values for indents, based on eliminating the longest common initial subsequence and comparing what's left, but whether that will prove to be useful or not remains to be seen. Tomorrow I'll start integrating the two approaches (old grouping and new sorting within groups, except for compounds which already have their own subsort).
If there are two instances of the same root's xml:id in a word's hyph, it's because the root morpheme is split up by an infix. These infixes need to be handled as follows:
-inchoative (xml:id="ʔ"), characteristic (xml:id="CHAR"), out of control (xml:id="OC"): delete the second/rightmost instance of the root. (That is, treat these infixes like suffixes.)
-repetitive (xml:id="REP"): delete the first/leftmost instance of the root. (That is, treat this infix like a prefix.)
A quick "layman's language" recap of the algorithm MDH has already encoded ...
-Group words first by their root or stem, as in the previous approach; list the plain root first.
-Group compounds and sort them as in the previous approach; print these last, as they are most complex words with a given root or stem.
-Assign each morpheme a numerical value (list to be finalized).
-Generate sort keys for words as follows:
-First, look at the numbers of the prefix and suffix nearest the root. Add the lower of these numbers to the sort key.
-Then ignore the affix whose number has been added to the sort key, and repeat the process of comparing and adding numbers, building the sort key until no morphemes are left.
I've implemented the basic features of the new algorithm, and sent some results to SMK to see if they make sense. Remaining to be done: sorting of compounds (which can be pulled in from the previous approach), the handling of repeated instances of the root, and the implementation of indent level (which is currently based purely on the length of the key).
SMK has come up with a rather brilliant plan to prioritize the morphemes closer to the root. She'll write up a document to articulate the decisions we made today, and I'll code it in the form of a recursive XSLT function which will generate a sort key based on the numerical values assigned to the morphemes. The length of the key can, we think, also be used to create an indent level, with the output correction that nothing is indented more than one level more than its predecessor.
This never ends. We are now arriving at a canonical sort order for actual morphemes within morpheme groups, so that the secondary sort key can use zero-padded numbers instead of morpheme ids (which obviously don't sort alphabetically in any ideal way). We have also added a final item to the end of the secondary sort key, which consists of the text of the hyph; this will be used to differentiate two items with identical morphemic structure. Finally, I've added many more latin glyphs to the MosesPhonemicSort collation (which is now our default "sort for all occasions" library) so that it can handle e.g. upper-case letters in morpheme ids, and also so that it ignores morpheme delimiter characters, thereby improving the final-stage part of the key mentioned above.
The XSLT function for generating a secondary sort key had been created in an accretive fashion, and ended up being very difficult to change because altering the sequence of the fifteen (so far) steps required manual adjustment of letter codes for all of them. I've now rewritten it in an optimized manner (optimized in the sense that it's probably slightly slower, but trivial to reorder or expand with new steps, as may happen next week).
I should do this sort of thing more often.
Implemented everything laid out in yesterday's post, and some additional tweaks; we're now able to use document order of affixes as an additional component of the sorting mechanism. We also met and discussed the dictionary organization and layout.
SMK and I worked hard on the secondary sorting today, and we have a working system for the lexical suffix group, which is our core test. The basic idea works like this:
- Each particular group has the first part of its sort key created by its own key function, which is customized depending on how the group should be sorted (so in the case of Lexical Suffixes, the first part of the key is a concatenation of the lexical suffix ids in the order they appear in the item, with underscores between them).
- The custom function then calls a generic getSecondarySortKey function, which creates the remainder of the sort key. This works by generating a series of sort key components, working through the print order of the groups. For each major group, it creates a key based on an underscore followed by a letter prefix (a, b, c etc.), then another underscore, then the xml:id of the morpheme which is in that group. The secondary key function has a parameter enabling it to ignore the primary key component, which has already been handled by the custom function.
- The result is a key on which we can sort each major group. This has been implemented so far for everything up to lexical suffixes -- in other words, it generates keys for each of the major groups which are harvested AFTER lexical suffixes (anything harvested before will not appear in the lexical suffix group anyway, because it will be in a preceding group), and each of those groups is handled in their print order.
- The underscore has been added to the MosesPhonemicCollation class at the beginning of the sort order, so it can be used to massage the sort order when necessary.
- TODO #1: complete the secondary sort key function so that it handles everything still missing (Compounds, Primary Affixes and Lexical Prefixes).
- TODO #2: Add the final three sort categories (Gi, Gii, Giii) at the end of the sequence.
- TODO #3: Write or fix up the custom sort functions for all of the other major groups.
Today I ended up writing some XQuery that can write very thorny XSLT templates for sorting subgroups of subentries in the hierarchical index. I now have a system which can do the initial harvesting, sort most of the subgroups on their identifying morphemes (three groups are not yet sorted because it's not clear how to sort them), and spit out results in a human-readable form for checking. It's not really complicated, just incredibly detailed and picky work that takes ages to do, and even longer to debug and check.
Grouping of subentries is now complete, and SMK is checking the results. Meanwhile, I've written my first subgroup sorting template, heavily recursive, and she'll check the results of that next time she's in. It looks like I have a working approach to the subgroup sorting, although it's pretty complicated.
I've started the process of generating the root-based hierarchy of entries, a phenomenally complicated process, but with SMK's excellent documentation I'm making good progress. I'm still completely at a loss as to how I'll sort entries within the subgroups, but we'll come to that when we've managed to generate all the subgroups accurately.
Met to work through the spec which SMK has created for the morpheme index sorting and nesting; it's extremely complicated, so it will have to wait till I get back before I start implementing it, but looks rigorous and doable.
SMK adds:
The latest spec is in moses/trunk/docs/Action/PrintDictionary/SpecsForRootBasedIndex.odt. A sample sorting of entries with the root ḥawˀiy is in SampleOrderOfSubentries.odt.
MDH can test the process on the files with status values of complete, edited, and light-edited.
Reminder to self on how to generate dictionaries:
- In Oxygen, select all the XML files in the dictionary directory and run the transformation moses_auto_orthography. This generates all the files with automatic orthographical forms in dictionary_test.
- Open the file utilities/generate_names-only_dictionary.xsl, and run it on itself. This uses the files from dictionary_test to create the smaller XML files in the names subfolder.
- Select the XML files in the names subfolder and run the transformations moses_xml_to_pdf_LEARNER_include_names and moses_xml_to_pdf_LINGUIST_include_names. This generates the complete set of pdfs in the names/pdfs subfolder.