Nxaʔamxcín (Moses) Dictionary Blog

August 11, 2010

Missing content lost during transformation: ECH's diagnosis

Posted by on 11 Aug 2010 in Activity log

This is based on an email from ECH, describing the context in which content was lost during the transformation from "unmerged" to "entries_separated" files:

What is happening is this. When there is a sequence in the unmerged file of the following:

<ENTRY level=”002” id=””>
<xx></xx>
<infl>yyyyyy</infl>
zzzzz
zzzzz
zzzz
</ENTRY>

The entire entry is missing from the separated file.

These level 002 entries are derived words in the Lexware database; significantly, the entries are dropped from the transformed output when the inflection band follows the derivation band directly.

This has been observed to happen in qw-glot.xml as well as other files. I'm going to look at that file directly, and find specific examples I can work with, then isolate them and work on a fix.

Results of some digging:

qw-glot_SEPARATED.xml is constructed from the very small Q'W.xml and the much larger Q'W1CDH.xml. Both items from the former are there in qw-glot_SEPARATED.xml, so the dropped items are all from Q'W1CDH.xml.
Some level 002 entries were definitely carried over OK (e.g. entry xml:id="niʔqʼWacʼlqs", "nosebleed"), so it's not just a question of dropping level 002 entries.
It's not just a case of items with no @id being dropped (as I initially thought from ECH's description above); some entries with no @id value are carried over (e.g. I *fill|ed up my basket).

Here is an example which shows the problem. In the following, the outer entry (the root) is carried over, but the inner one is completely dropped:

<ENTRY level="001"  id="√q'ʷáq'ʷ‐">
<rt>√q'ʷáq'ʷ‐</rt>
<ENTRY level="002"  id="">
<ls></ls>
<infl>nominalizer</infl>
<n>s‐√q'ʷáq'ʷ=əl'qʷ</n>
<g>*prairie‐chicken, *sharp‐tailed~grouse</g>
<gc>Y2.33 is JM only</gc>
<k>A46; Y2.33</k>
<var>s‐√q'ʷáq'ʷ=əl'qʷ‐aʔᵃ</var>
<g>?</g>
<gc>claimed by Agnes Miller to be MC, by Jerome Miller to be Colville</gc>
<k>AM, JM</k>
<var>s‐√q'ʷáq'ʷ=əl'qʷ‐aʔ</var>
<g>*prairie~chicken</g>
<k>EP2.31.8</k>
</ENTRY>
</ENTRY>

This appears to be a situation in which the first element of the embedded item (in this case "lexical suffix") is empty, and is followed by <infl>.

I looked at the XSLT code (separate_xml.xsl) and determined that:

An entry is only processed if its first element has a string-length of more than zero; so the empty first item causes the problem here.
However, this is only otherwise branch of a conditional; the first branch (presumably deemed to be the most common) expects to find an @mode attribute on elements. What it does then is to process all following items which have the same @mode attribute.

It looks as though this process was primarily written targetting a situation in which we needed to separate not simply embedded <ENTRY> elements, but also blocks of tags within <ENTRY> elements, which were defined by their sharing an @mode attribute value. However, the qw-glot file doesn't have ANY @mode attributes, while some files have many of them. It appears that there were two distinct methods of structuring entries in the original data, and these were converted into two slightly differing XML structures.

However, this is something of a red herring; I found another entry, in T4CDH.WRK.xml, which does make use of @mode but still exemplifies the problem (its inner <ENTRY> is lost):

  <ENTRY level="001"  id="√k'ᵊř">
    <rt>√k'ᵊř</rt>
    <ENTRY level="002"  id="">
      <lc.ls></lc.ls>
      <infl>nominalizer</infl>
      <n mode="1">s‐t‐√k'ᵊř=álᵊqʷ</n>
      <g mode="1">tree cut with something</g>
      <k mode="1">Y24.74,77</k>
      <il.lc.ls.n mode="1">nawə́nt s‐t‐√k'ᵊř=álᵊqʷ</il.lc.ls.n>
      <df mode="1">groove or deep line cut into a tree</df>
      <k mode="1">Y24.74</k>
    </ENTRY>
  </ENTRY>

So the issue is clearly with the empty tag. It's obvious from the (very simple) XSLT that in such a context, we explicitly stop processing, so nothing is output:

<xsl:if test="(not(preceding-sibling::*) and (name() != 'ENTRY'))">
                        <xsl:if test="string-length(.) > 0">
                        <xsl:element name="ENTRY">
                            <xsl:copy-of select="."></xsl:copy-of>
                            <xsl:for-each select="following-sibling::*[not(@mode)][not(name() = 'ENTRY')]">
                                <xsl:copy-of select="."></xsl:copy-of>
                            </xsl:for-each>
                        </xsl:element>
                            </xsl:if>
                    </xsl:if>

The question now is why -- why did we decide not to process entries that began with an empty tag? I'll write to ECH and see if she has any memory of this, and also keep digging to see if I can find a reason. Ultimately, it should be possible for me to use the same strategy in reverse to FIND all those entries, and output them specifically in one block, which could then be merged back into the new files (once it's gone through all the other processing

March 29, 2010

angle brackets around unattested glosses

Posted by on 29 Mar 2010 in Activity log

We are currently using <angle brackets> to denote a gloss supplied by Ewa
or other editors, rather than by the fluent speaker who actually uttered the Nxa'amxcin example. For example:

hə̣́ll
hə̣́ll ECH
hə̣́lə̣l Y39.109
√hə̣́l-C₂
lazy; <tired>
Y39.109

This indicates that speaker Y glossed this word "lazy", but the editors would also like it to appear in the English-Nxa'amxcin word list under "tired".

However, Martin noted that the angle brackets "amount to an alternative markup system, bypassing the XML, so it's definitely not ideal -- it will make it
difficult to find those particular items, or style them in a particular
way. They should be tagged in the proper way at some point."

So we need to figure out how best to tag them. I suggested we could mark them all with <bibl>ECH</bibl>, but we don't actually want these glosses to appear on the database website. We just want them to be searchable when we're creating the word list.

So this is another issue to be sorted out when we next pick up the project.

file status report

Posted by on 29 Mar 2010 in Activity log

As my first contract comes to an end, here is a summary of the status of all the files we have worked on. I have also added explanatory comments at the top of each active file.

In the tei_xml folder:

c-rtr.xml - completed and posted on database website

h-phar-part1.xml - completed by SMK. ECH needs to check phonemicizations and hyphs, but might as well wait ' til part2 is completed too.

h-phar-part2_xformed.xml - ready to edit. MDH has completed the latest XSLT transformation. Future editors, please look out for missing entries as you continue to edit this file! It could have had the same data loss problems that qw-glot and other files had.

An old copy of h-phar.xml is currently posted on the database site, but that was a mistake.

h.xml - completed and posted on database website

lex-suf-new.xml - contains entries I created for 4 lexical suffixes which could not be found in the main lex-suff.xml folder. We subsequently realized that many lexical suffix cards had never been entered into the Lexware database, so the next step here is to enter the rest of those cards. Then, if further lexical suffixes are still unaccounted-for, we can create new entries for them.

phar-w.xml - completed and posted on database website

qw-glot.xml- edited by SMK as far as line 469, whereupon I discovered many missing lines and entries. This problem occurred during the transformation from the "unmerged" version to the "entries_separated" version. ECH is trying to deduce what went wrong, and will post details on the blog.

s-rtr.xml - completed and posted on database website

In the tei_for_xform folder:

affix_test.xml - a small file with a copy of one entry from the main affix.xml file, made for MDH to test the following XSLT transformation on

affix.xml
-MDH needs to adapt the most recent XSLT transformation to format the dictegs in this file as follows:
-add <phr type="p" subtype="u"> </phr> at the top of each <quote>
-surround *'d words with <gloss> tags
-move bibls up from quotes to their daughter <phr type="n">s and <seg>s.

-Then ECH and SMK need to: proofread against MDK cards from line 1870, phonemicize throughout, check questions in Comment tags.

lex-suf.xml
-MDH needs to use his XSLT transformation to reformat the dictegs. (SMK has formatted the form and sense/def sections manually.)
-ECH or SMK need to phonemicize all the examples.
-This file still needs to be checked against the Lexware printout.
-ECH needs to check SMK's work.
-The file needs to be proofed against MDK's cards, and the missing lexical suffixes need to be entered, as noted above.

In the ready-to-edit folder:

All files - MDH has completed the latest XSLT transformation, but more research is needed regarding data that was lost in the earlier transformation from "unmerged" versions to "entries_separated" versions. ECH is trying to deduce what went wrong, and will post details on the blog.

New transform to fix errors in h-phar-part2

Posted by on 29 Mar 2010 in Activity log

The results for this file were not exactly what we wanted -- some copies of bibl elements were not being made -- so I've revisited the transformation, and I think it's now done correctly. Waiting for SK to confirm.

March 26, 2010

compound lexical suffixes

Posted by on 26 Mar 2010 in Activity log

For future reference, here's how we decided to handle MDK's entries for compound lexical suffixes (e.g. apqən, qnwil, etc.):

-keep compound lexical suffixes as their own entries in the database

-tag them with corresp in hyph - e.g.

<hyph>=<m corresp="ap-1 qin">ápqən</m> </hyph>

-add lexical-suffix-compound to their feature structures:

I added <symbol value="lexical-suffix-compound" /> to feature_system.xml after discussion with Martin and Ewa.

March 25, 2010

Working on XSL processing

Posted by on 25 Mar 2010 in Activity log

Three files which were already edited or in the process of editing when we ran the last series of transformations now need reworking, so we figured out what needs doing to each (each being different). I've rewritten my transformation from the other day so that I can switch on or off various aspects of it, and made some progress with running it on two of the files, but in the process of doing this a new, more serious problem emerged, concerning data which was lost from one or more of the files during a previous transformation in 2006. We think we know what triggered it, and we also think we know which files might be affected; if I can work back through blog entries to confirm exactly what was done, in what order, to the dataset, we should be able to make the same changes with some minor alterations to undo the damage. But this is going to be significant work, so it might have to wait until the fall.

March 24, 2010

Fixed XSL for new @type attributes

Posted by on 24 Mar 2010 in Activity log

@type attributes on various elements now have abbreviated values ("n" or "p" instead of "narrow" or "phonemic"). Updated and tested the XSLT to take account of this.

bibls in notes

Posted by on 24 Mar 2010 in Activity log

If a bibl belongs to the contents of a note, it needs to be inside the note element. Notes are block elements, so any subsequent content starts on the next line.

March 22, 2010

Finished the XSLT to fix our files

Posted by on 22 Mar 2010 in Activity log

With periodic input from SK, I've finished writing the XSLT to transform the existing XML to something much closer to what the editors want to produce: multiple instances of many tags are collapsed into one, and bibl references copied in multiple places. The problem now is that oXygen 11.2 seems to have serious issues running a transformation scenario on the files; it runs out of memory (I've already upped its memory allotment twice), or it simply stops after one file instead of running through all the selected files. In the end I gave up on it, and ran the transformations in oXygen 10.3, which has no problems.

March 19, 2010

root symbols in monomorphemic entries

Posted by on 19 Mar 2010 in Activity log

Contrary to previous discussions ...

Yes, we ARE going to keep the root symbol √ in both <hyph>s and <dicteg>s for monomorphemic entries.

This makes it less work for me to take the √ out in these contexts in all the rest of the files.

Ewa has graciously volunteered to put the √'s back IN in the appropriate places in the current active files.

Nxaʔamxcín (Moses) Dictionary Blog

This is an XML dictionary project based primarily on the materials compiled by the late M. Dale Kinkade during fifteen years of work in the 1960’s and 1970’s with more than a dozen native speakers of the language, but it also includes materials compiled by Ewa Czaykowska-Higgins in the early 1990’s.

Search

XML Feeds

RSS 2.0: Posts
Atom: Posts

What is RSS?

Sidebar 2

This is the "Sidebar 2" container. You can place any widget you like in here. In the evo toolbar at the top of this page, select "Customize", then "Blog Widgets".