This is based on an email from ECH, describing the context in which content was lost during the transformation from "unmerged" to "entries_separated" files:
What is happening is this. When there is a sequence in the unmerged file of the following:
<ENTRY level=”002” id=””> <xx></xx> <infl>yyyyyy</infl> zzzzz zzzzz zzzz </ENTRY>
The entire entry is missing from the separated file.
These level 002 entries are derived words in the Lexware database; significantly, the entries are dropped from the transformed output when the inflection band follows the derivation band directly.
This has been observed to happen in qw-glot.xml
as well as other files. I'm going to look at that file directly, and find specific examples I can work with, then isolate them and work on a fix.
Results of some digging:
qw-glot_SEPARATED.xml
is constructed from the very smallQ'W.xml
and the much larger Q'W1CDH.xml. Both items from the former are there inqw-glot_SEPARATED.xml
, so the dropped items are all fromQ'W1CDH.xml
.- Some level 002 entries were definitely carried over OK (e.g.
entry xml:id="niʔqʼWacʼlqs"
, "nosebleed"), so it's not just a question of dropping level 002 entries. - It's not just a case of items with no
@id
being dropped (as I initially thought from ECH's description above); some entries with no@id
value are carried over (e.g.I *fill|ed up my basket
).
Here is an example which shows the problem. In the following, the outer entry (the root) is carried over, but the inner one is completely dropped:
<ENTRY level="001" id="√q'ʷáq'ʷ‐"> <rt>√q'ʷáq'ʷ‐</rt> <ENTRY level="002" id=""> <ls></ls> <infl>nominalizer</infl> <n>s‐√q'ʷáq'ʷ=əl'qʷ</n> <g>*prairie‐chicken, *sharp‐tailed~grouse</g> <gc>Y2.33 is JM only</gc> <k>A46; Y2.33</k> <var>s‐√q'ʷáq'ʷ=əl'qʷ‐aʔᵃ</var> <g>?</g> <gc>claimed by Agnes Miller to be MC, by Jerome Miller to be Colville</gc> <k>AM, JM</k> <var>s‐√q'ʷáq'ʷ=əl'qʷ‐aʔ</var> <g>*prairie~chicken</g> <k>EP2.31.8</k> </ENTRY> </ENTRY>
This appears to be a situation in which the first element of the embedded item (in this case "lexical suffix") is empty, and is followed by <infl>
.
I looked at the XSLT code (separate_xml.xsl
) and determined that:
- An entry is only processed if its first element has a string-length of more than zero; so the empty first item causes the problem here.
- However, this is only
otherwise
branch of a conditional; the first branch (presumably deemed to be the most common) expects to find an@mode
attribute on elements. What it does then is to process all following items which have the same @mode attribute.
It looks as though this process was primarily written targetting a situation in which we needed to separate not simply embedded <ENTRY>
elements, but also blocks of tags within <ENTRY>
elements, which were defined by their sharing an @mode
attribute value. However, the qw-glot file doesn't have ANY @mode
attributes, while some files have many of them. It appears that there were two distinct methods of structuring entries in the original data, and these were converted into two slightly differing XML structures.
However, this is something of a red herring; I found another entry, in T4CDH.WRK.xml
, which does make use of @mode
but still exemplifies the problem (its inner <ENTRY>
is lost):
<ENTRY level="001" id="√k'ᵊř"> <rt>√k'ᵊř</rt> <ENTRY level="002" id=""> <lc.ls></lc.ls> <infl>nominalizer</infl> <n mode="1">s‐t‐√k'ᵊř=álᵊqʷ</n> <g mode="1">tree cut with something</g> <k mode="1">Y24.74,77</k> <il.lc.ls.n mode="1">nawə́nt s‐t‐√k'ᵊř=álᵊqʷ</il.lc.ls.n> <df mode="1">groove or deep line cut into a tree</df> <k mode="1">Y24.74</k> </ENTRY> </ENTRY>
So the issue is clearly with the empty tag. It's obvious from the (very simple) XSLT that in such a context, we explicitly stop processing, so nothing is output:
<xsl:if test="(not(preceding-sibling::*) and (name() != 'ENTRY'))"> <xsl:if test="string-length(.) > 0"> <xsl:element name="ENTRY"> <xsl:copy-of select="."></xsl:copy-of> <xsl:for-each select="following-sibling::*[not(@mode)][not(name() = 'ENTRY')]"> <xsl:copy-of select="."></xsl:copy-of> </xsl:for-each> </xsl:element> </xsl:if> </xsl:if>
The question now is why -- why did we decide not to process entries that began with an empty tag? I'll write to ECH and see if she has any memory of this, and also keep digging to see if I can find a reason. Ultimately, it should be possible for me to use the same strategy in reverse to FIND all those entries, and output them specifically in one block, which could then be merged back into the new files (once it's gone through all the other processing