I had previously suppressed the rendering of empty <seg>
elements and their following <bibl>
s, since these are placeholders added to the file for ECH to complete when she gets to the entries; I noticed this morning that there's a parallel situation in the case of quotations, where an empty <phr>
is followed by a <bibl>
, so I've added suppression of those in the output.
The idea of retrieving the complete entry for a component morpheme in the context of its container entry was a good one, but the execution up to now has been quite confusing; the component morpheme entry was just dumped into the middle of the container entry without much differentiation. I've now re-worked that whole system, so that the component morphemes are listed after the hyphenated morpheme breakdown in the form of a tab control, with one tab for each morpheme; clicking on a tab retrieves the entry for that morpheme and shows it in the tab box. It retrieves in a similar way to the previous system, detecting if a copy of this morpheme data already exists on the page and cloning it if it does, but it makes the display more obviously separate from the main container entry. As usual, most of the time spent was on the appearance and functionality of the tab control; in addition to normal tab features, it needs to be able to collapse itself again (when you click on the tab for an entry which is already displayed), and I wanted to get the borders working correctly. They still don't quite work in Opera, but they're good in Gecko and Webkit. It works in IE8, although the rounded corners and box-shadow aren't there.
On my local machine, I've also ported the environment to our new build of Cocoon+eXist+FOP, with no problems at all. Tomorrow, I'll carry that over to the Pear location.
Just re-running the conversion seemed to work (although I did turn on the options to collapse forms and collapse senses, which probably made the difference). I've also re-done some of the changes SMK had made to the file which were documented in the <revisionDesc>
. Some 11 or so entries will now be in the new version of the file which had previously been removed from the old, but these will get removed in due course later as we track down duplicates (or they'll be left at the end of the whole process, and be visibly superfluous).
Running some stats on the <dicteg>
s, ECH discovered that a bunch of data is missing from these tags in the rescued.xml file. Tracing back through the original process I followed (and thankfully documented carefully here), it seems that the data was discarded unintentionally during the final phase:
- Ran
rescued_empties_removed_expanded_fixed.xml
throughcollapse_forms_etc.xsl
to producerescued_empties_removed_expanded_fixed_forms_collapsed.xml
.
This process changed data that looks like this:
<dicteg> <cit> <quote>√yə́ʕˀʷ+yəʕˀʷ‐t sn̩cʼələx̣ʷqén<gloss>*strong whirlwind</gloss> </quote> <bibl>JM2.194.7</bibl> </cit> </dicteg>
to this:
<dicteg> <cit> <quote> <seg><gloss>strong</gloss> whirlwind</seg> <bibl>JM2.194.7</bibl> </quote> </cit> </dicteg> <dicteg>
This can be remedied by re-running that final step once I've figured out the problem. Since rescued.xml was created, it has been edited, but only to the extent of deleting about a dozen entries which have been confirmed as dupes or migrated into the main files; it should be easy to discover which these are and remove them.
The problem seems to lie with these two bits of XSLT, although I can't actually see what's wrong with them:
<xsl:variable name="tagFirstTextInQuoteAsPhr" select="true()"/>
<xsl:for-each select="./node()"> <xsl:choose> <xsl:when test="self::text() and not(preceding-sibling::node())"> <!-- Wrap the first text node in a phr tag. --> <xsl:if test="$tagFirstTextInQuoteAsPhr = true()"> <phr type="n"><xsl:value-of select="."/></phr><xsl:text> </xsl:text> <!--Append any <bibl> which is a following-sibling of the parent <quote>.--> <xsl:if test="(parent::node()/following-sibling::bibl) and ($moveBiblIntoQuote = true())"><xsl:copy-of select="parent::node()/following-sibling::bibl"/><xsl:text> </xsl:text></xsl:if> </xsl:if> </xsl:when> <xsl:otherwise> <!-- Apply templates to everything else. --> <xsl:apply-templates select="." /> </xsl:otherwise> </xsl:choose> </xsl:for-each>
We have been discussing removing the duplicate entries from the database, and agreed that this is the best thing to do.
In case we change our minds in the future, I hereby record that the SVN version number before I started removing duplicates was: 136.
As our funding for this phase comes to an end, here is a list of what I'll need to do in the next phase.
-Go through the rest of the list of duplicate xml:ids (duplicate_ids.ods), and merge or remove duplicate entries. (It's probably better to do this file by file, as we finish editing each alphabetical file.)
[Update, 15Nov11: What I meant by the above is that this task will be ongoing as I edit each file! It does not all need to be done before going back to editing the alphabetical files.]
-Finish editing qw-glot.xml
-Enter the Particle file cards to particles.xml (being careful not to create more duplicates!) See the blog entry from 17Mar10:
http://hcmc.uvic.ca/blogs/index.php?blog=10&p=6350&more=1&c=1&tb=1&pb=1
-Lexical Suffix files:
--replace R with ʕ or ḥ as appropriate
--check lex-suf.xml against the Lexware printout
--edit lex-suf-nom.xml
--enter missing Lexical Suffixes (check back through the file cards, as it appears not all of them have been entered); enter just the suffixes, not the example words
--have the example words that are names of people been entered as words? If not, enter them.
--phonemicize the dictegs
-Phonemicize the dictegs in the affix files
-Continue editing the rest of the alphabetical files which currently have status ="unedited". Do k, n, s, and t last, as these files contain many duplicates!
Meanwhile, Ewa is working on:
-additions to the affix files per blog posts 1-4Mar10
-go through affix paradigms to determine which ones still need entries; create entries for these affixes
-edit pron.xml and add feature structures
-proofread h-phar parts 1 and 2
I've now ordered and divided the affix.xml
file. First, I sequenced the entries in the affix.xml
file by @xml:id
attribute, using our Moses collation. (You can use a collation through the normal Saxon mechanism in oXygen if you add the jar file as an extension in the oXygen transformation scenario.) Then I split the file into four roughly equal-length files, using initial-letter boundaries in the @xml:id
s.
I've documented my changes in the revisionDesc, but ECH will have to update the TODO comment; part of it now only relates to one file, and the other half is problematic because it refers to a particular line number. There's also the issue of knowing which entries have been checked against the original cards and which haven't, now they've been re-ordered.
I was intending to generate and supply a complete list of duplicate ids in the db, but it turns out that there are 1355 of them, so it's not really helpful to list them. I've sent the full list to SML and ECH. However, here are the first hundred:
- mix: affix.xml, m.xml
- kaʔ: affix.xml, k.xml
- kiyˀ: affix.xml, n.xml
- ɬəm: affix.xml, t.xml
- maʔ: affix.xml, m.xml
- nas: affix.xml, k.xml
- sal: affix.xml, n.xml
- t: affix.xml, pron.xml
- taʔ: affix.xml, n.xml
- wap: affix.xml, ww-glot.xml
- xit: affix.xml, x.xml
- cʼalˀ: c-glot.xml, k.xml
- cʼalˀən: c-glot.xml, k.xml
- cʼəl: c-glot.xml, k.xml
- sn̩cʼaʔqatkW: c-glot.xml, rescued.xml
- cʼax: c-glot.xml, k.xml
- cʼaʔx: c-glot.xml, k.xml
- kcʼaʔxmenən: c-glot.xml, k.xml
- cʼalˀ_2: c-glot.xml, n.xml
- kən_nacʼalˀsən: c-glot.xml, n.xml
- skacʼcʼalˀ: c-glot.xml, k.xml
- n̩cʼəlˀcʼalˀsən: c-glot.xml, n.xml
- cʼaɬ: c-glot.xml, k.xml
- nacʼaɬən: c-glot.xml, n.xml
- n̩cʼcʼaɬn̩: c-glot.xml, n.xml
- ncʼcʼaɬənˀtxW: c-glot.xml, rescued.xml
- kcʼaʔɬn̩čut: c-glot.xml, k.xml
- sqʼəlˀnaskint: c-glot.xml, k.xml
- cʼař: c-glot.xml, n.xml
- cʼařt: c-glot.xml, k.xml
- cʼawˀ: c-glot.xml, k.xml
- sascʼawˀoxW_sqəlawʔ: c-glot.xml, q.xml
- snacʼawʔsən: c-glot.xml, n.xml
- nacʼawˀəlqWpm̩: c-glot.xml, n.xml
- nacʼawˀɬcʼaʔ: c-glot.xml, n.xml
- nacʼəwmən: c-glot.xml, n.xml
- neʔcʼawpqən: c-glot.xml, n.xml
- kacʼawˀəloptn̩: c-glot.xml, k.xml
- ʔawˀt: c-glot.xml, glottal.xml
- cʼow: c-glot.xml, n.xml
- cʼək: c-glot.xml, k.xml
- ncʼkʼcʼkʼax̣ən: c-glot.xml, n.xml
- cʼəkW: c-glot.xml, n.xml
- n̩cʼkWopsən: c-glot.xml, rescued.xml
- kcʼkWicʼaʔən: c-glot.xml, k.xml
- cʼəkʼW: c-glot.xml, k.xml
- kčˀəkʼWxən: c-glot.xml, k.xml
- cʼəl_2: c-glot.xml, k.xml
- necʼəlot: c-glot.xml, n.xml
- cʼəlˀ: c-glot.xml, n.xml
- cʼəlˀxW: c-glot.xml, n.xml
- cʼəɬ: c-glot.xml, k.xml
- cʼəɬt: c-glot.xml, k.xml
- n̩cʼaʔɬstonən: c-glot.xml, rescued.xml
- kɬcʼmˀosənc: c-glot.xml, rescued.xml
- kɬcʼəmcʼəmtwaxW: c-glot.xml, rescued.xml
- cʼən: c-glot.xml, n.xml
- nacʼə̣np: c-glot.xml, n.xml
- cʼəpq: c-glot.xml, n.xml
- neʔcʼəpq: c-glot.xml, n.xml
- cʼəpʼqʼ: c-glot.xml, n.xml
- kcʼəpʼqʼən: c-glot.xml, k.xml
- n̩cʼəpʼqʼsalos: c-glot.xml, n.xml
- cʼəqʼ: c-glot.xml, k.xml
- ncʼqʼaɬcʼaʔən: c-glot.xml, rescued.xml
- n̩cʼqʼosən: c-glot.xml, rescued.xml
- cʼəsən: c-glot.xml, k.xml
- katcʼsatkWən: c-glot.xml, k.xml
- cʼəxW: c-glot.xml, k.xml
- cʼəxWən: c-glot.xml, k.xml
- ka·cʼəxW: c-glot.xml, k.xml
- na·cʼəxWoxW: c-glot.xml, n.xml
- kcʼxWanaʔn: c-glot.xml, k.xml
- kcʼxWus: c-glot.xml, k.xml
- kcʼxWosč: c-glot.xml, k.xml
- snecʼxWawˀsoxW: c-glot.xml, rescued.xml
- sn̩cʼəxWoxWwel: c-glot.xml, rescued.xml
- skacʼacʼəxW: c-glot.xml, k.xml
- kɬn̩cʼxWapəntaʔ_t_šawɬkW: c-glot.xml, k.xml
- sxWskcʼx̣Wapl̥aʔəm: c-glot.xml, k.xml
- neʔcʼikos: c-glot.xml, n.xml
- cʼikʼ: c-glot.xml, k.xml
- nacʼepʼsmən: c-glot.xml, n.xml
- nacʼipʼcʼipʼšəm: c-glot.xml, n.xml
- cʼex̣WoxW: c-glot.xml, k.xml
- cʼqʼ: c-glot.xml, n.xml
- kʼɬcʼaqʼWonˀən: c-glot.xml, rescued.xml
- cʼaʔqʼWonˀəm: c-glot.xml, rescued.xml
- n̩cʼoʔqa·pasən: c-glot.xml, rescued.xml
- n̩cʼowˀqulˀoxWən: c-glot.xml, rescued.xml
- nacʼopkW: c-glot.xml, n.xml
- niʔcʼuqʼWuʔšən: c-glot.xml, rescued.xml
- leyən_t_ʔencʼoqʼWmaʔ: c-glot.xml, l.xml
- cʼos: c-glot.xml, k.xml
- scʼosəm: c-glot.xml, k.xml
- kcʼosəmtən: c-glot.xml, k.xml
- cʼxW: c-glot.xml, k.xml
- cəs: c-rtr.xml, k.xml
- ʔa: glottal.xml, particles.xml
- ʔacʼx̣: glottal.xml, n.xml
This XQuery will generate them (although it's off-the-cuff and doubtless much slower than it could be):
declare default element namespace "http://www.tei-c.org/ns/1.0"; declare namespace f="http://exist-db.org/f-functions"; declare namespace util="http://exist-db.org/xquery/util"; declare function f:getIds() as xs:string* { let $e := collection('/db/moses')//entry for $id in distinct-values($e/@xml:id[string-length(.) gt 0]) where count($e[@xml:id = $id]) gt 1 return xs:string($id) }; declare variable $ids := f:getIds(); for $i in $ids return concat($i, ': ', util:document-name(collection('/db/moses')//entry[@xml:id=$i][1]), ', ', util:document-name(collection('/db/moses')//entry[@xml:id=$i][2]))
As a partial response to SMK's concerns about the search engine, I've defined the <gloss>
tag as "inline" in the context of <entry>
, <def>
, <sense>
and <seg>
, in the index configuration file. This means that it does not constitute a word-break for the purpose of indexing, so that if you search for "dropped", you'll now find instances of <gloss>
drop</gloss>
ped. A natural by-product, though, is that you'll no longer find this item if you search for "drop". That's the trade-off. Without a working English stemming analyzer for eXist (and there doesn't seem to be one at the moment), there's no way to have your cake and eat it, unfortunately.
Martin and I have been discussing what the search engine looks for. I'm posting our discussion here for reference, because I know I will be confused about it again in the future!
SMK:
The search engine doesn't always find a string within a longer string. For example, some, but not all, instances of "dropped" come up if I search for "drop". I'm not sure why this would be, if it's searching in all text fields. Or does it only find "dropped" if there is a <gloss type="u"> included with "drop"?
MDH:
It searches for words, so if it's looking for "drop", it won't find
"dropped". You don't want to search for "bed" and find "bedazzled".
Currently, the search uses Lucene's StandardAnalyzer to tokenize the
text, which means that it tokenizes on word-boundaries, and does no
stemming.
SMK:
I see. There are a couple of cases in the result set for "drop"
where "drop" is highlighted within "dropped". I'm guessing this is
because it's gloss-tagged? If this is the case, won't we have
effectively stemmed everything on the English side once we have
edited and gloss-tagged the rest of the entries?
MDH:
If you have this:
<gloss>drop</gloss>ped
then it would see the tag as a word-boundary, and find "drop". That's not a bad thing, I suppose.
SMK:
Yeah, so when we have gloss-tagged all the entries, most cases of a given stem, like "drop" will be split off from their affixes. So someone might search for "dropped" and not get any hits - or perhaps only the entries where "dropped" is in a dicteg. And then the user would have to be clever enough to work backwards and realize he should be searching for "drop" too.