Nxaʔamxcín (Moses) Dictionary Blog

December 14, 2010

Fixed a rendering bug

Posted by on 14 Dec 2010 in Activity log

I had previously suppressed the rendering of empty <seg> elements and their following <bibl>s, since these are placeholders added to the file for ECH to complete when she gets to the entries; I noticed this morning that there's a parallel situation in the case of quotations, where an empty <phr> is followed by a <bibl>, so I've added suppression of those in the output.

December 13, 2010

New method of retrieving and displaying component morpheme entries

Posted by on 13 Dec 2010 in Activity log

The idea of retrieving the complete entry for a component morpheme in the context of its container entry was a good one, but the execution up to now has been quite confusing; the component morpheme entry was just dumped into the middle of the container entry without much differentiation. I've now re-worked that whole system, so that the component morphemes are listed after the hyphenated morpheme breakdown in the form of a tab control, with one tab for each morpheme; clicking on a tab retrieves the entry for that morpheme and shows it in the tab box. It retrieves in a similar way to the previous system, detecting if a copy of this morpheme data already exists on the page and cloning it if it does, but it makes the display more obviously separate from the main container entry. As usual, most of the time spent was on the appearance and functionality of the tab control; in addition to normal tab features, it needs to be able to collapse itself again (when you click on the tab for an entry which is already displayed), and I wanted to get the borders working correctly. They still don't quite work in Opera, but they're good in Gecko and Webkit. It works in IE8, although the rounded corners and box-shadow aren't there.

On my local machine, I've also ported the environment to our new build of Cocoon+eXist+FOP, with no problems at all. Tomorrow, I'll carry that over to the Pear location.

November 29, 2010

Missing data retrieved

Posted by on 29 Nov 2010 in Activity log

Just re-running the conversion seemed to work (although I did turn on the options to collapse forms and collapse senses, which probably made the difference). I've also re-done some of the changes SMK had made to the file which were documented in the <revisionDesc>. Some 11 or so entries will now be in the new version of the file which had previously been removed from the old, but these will get removed in due course later as we track down duplicates (or they'll be left at the end of the whole process, and be visibly superfluous).

Data missing from rescued.xml

Posted by on 29 Nov 2010 in Activity log

Running some stats on the <dicteg>s, ECH discovered that a bunch of data is missing from these tags in the rescued.xml file. Tracing back through the original process I followed (and thankfully documented carefully here), it seems that the data was discarded unintentionally during the final phase:

Ran rescued_empties_removed_expanded_fixed.xml through collapse_forms_etc.xsl to produce rescued_empties_removed_expanded_fixed_forms_collapsed.xml.

This process changed data that looks like this:

<dicteg>
                     <cit>
                        <quote>√yə́ʕˀʷ+yəʕˀʷ‐t sn̩cʼələx̣ʷqén<gloss>*strong whirlwind</gloss>
                        </quote>
                        <bibl>JM2.194.7</bibl>
                     </cit>
                  </dicteg>

to this:

<dicteg>
                     <cit>
                        <quote>
                                 <seg><gloss>strong</gloss> whirlwind</seg> <bibl>JM2.194.7</bibl>
                        </quote>
                     </cit>
                  </dicteg>
      <dicteg>

This can be remedied by re-running that final step once I've figured out the problem. Since rescued.xml was created, it has been edited, but only to the extent of deleting about a dozen entries which have been confirmed as dupes or migrated into the main files; it should be easy to discover which these are and remove them.

The problem seems to lie with these two bits of XSLT, although I can't actually see what's wrong with them:

<xsl:variable name="tagFirstTextInQuoteAsPhr" select="true()"/>

<xsl:for-each select="./node()">
      <xsl:choose>
        <xsl:when test="self::text() and not(preceding-sibling::node())">
          <!-- Wrap the first text node in a phr tag.     -->
            <xsl:if test="$tagFirstTextInQuoteAsPhr = true()">
            <phr type="n"><xsl:value-of select="."/></phr><xsl:text>
            </xsl:text>
              <!--Append any <bibl> which is a following-sibling of the parent <quote>.-->
              <xsl:if test="(parent::node()/following-sibling::bibl) and ($moveBiblIntoQuote = true())"><xsl:copy-of select="parent::node()/following-sibling::bibl"/><xsl:text>
              </xsl:text></xsl:if>             
            </xsl:if>

        </xsl:when>
        <xsl:otherwise>  
          <!-- Apply templates to everything else. -->
          <xsl:apply-templates select="." />
        </xsl:otherwise>
        </xsl:choose>
      </xsl:for-each>

Removing duplicates after SVN revision 136

Posted by on 29 Nov 2010 in Activity log

We have been discussing removing the duplicate entries from the database, and agreed that this is the best thing to do.

In case we change our minds in the future, I hereby record that the SVN version number before I started removing duplicates was: 136.

Next Steps

Posted by on 29 Nov 2010 in Activity log

As our funding for this phase comes to an end, here is a list of what I'll need to do in the next phase.

-Go through the rest of the list of duplicate xml:ids (duplicate_ids.ods), and merge or remove duplicate entries. (It's probably better to do this file by file, as we finish editing each alphabetical file.)
[Update, 15Nov11: What I meant by the above is that this task will be ongoing as I edit each file! It does not all need to be done before going back to editing the alphabetical files.]

-Finish editing qw-glot.xml

-Enter the Particle file cards to particles.xml (being careful not to create more duplicates!) See the blog entry from 17Mar10:

http://hcmc.uvic.ca/blogs/index.php?blog=10&p=6350&more=1&c=1&tb=1&pb=1

-Lexical Suffix files:
--replace R with ʕ or ḥ as appropriate
--check lex-suf.xml against the Lexware printout
--edit lex-suf-nom.xml
--enter missing Lexical Suffixes (check back through the file cards, as it appears not all of them have been entered); enter just the suffixes, not the example words
--have the example words that are names of people been entered as words? If not, enter them.
--phonemicize the dictegs

-Phonemicize the dictegs in the affix files

-Continue editing the rest of the alphabetical files which currently have status ="unedited". Do k, n, s, and t last, as these files contain many duplicates!

Meanwhile, Ewa is working on:

-additions to the affix files per blog posts 1-4Mar10
-go through affix paradigms to determine which ones still need entries; create entries for these affixes
-edit pron.xml and add feature structures
-proofread h-phar parts 1 and 2

November 24, 2010

Affix file now split into four

Posted by on 24 Nov 2010 in Activity log

I've now ordered and divided the affix.xml file. First, I sequenced the entries in the affix.xml file by @xml:id attribute, using our Moses collation. (You can use a collation through the normal Saxon mechanism in oXygen if you add the jar file as an extension in the oXygen transformation scenario.) Then I split the file into four roughly equal-length files, using initial-letter boundaries in the @xml:ids.

I've documented my changes in the revisionDesc, but ECH will have to update the TODO comment; part of it now only relates to one file, and the other half is problematic because it refers to a particular line number. There's also the issue of knowing which entries have been checked against the original cards and which haven't, now they've been re-ordered.

November 19, 2010

List of duplicate xml:ids in the db

Posted by on 19 Nov 2010 in Activity log

I was intending to generate and supply a complete list of duplicate ids in the db, but it turns out that there are 1355 of them, so it's not really helpful to list them. I've sent the full list to SML and ECH. However, here are the first hundred:

mix: affix.xml, m.xml
kaʔ: affix.xml, k.xml
kiyˀ: affix.xml, n.xml
ɬəm: affix.xml, t.xml
maʔ: affix.xml, m.xml
nas: affix.xml, k.xml
sal: affix.xml, n.xml
t: affix.xml, pron.xml
taʔ: affix.xml, n.xml
wap: affix.xml, ww-glot.xml
xit: affix.xml, x.xml
cʼalˀ: c-glot.xml, k.xml
cʼalˀən: c-glot.xml, k.xml
cʼəl: c-glot.xml, k.xml
sn̩cʼaʔqatkW: c-glot.xml, rescued.xml
cʼax: c-glot.xml, k.xml
cʼaʔx: c-glot.xml, k.xml
kcʼaʔxmenən: c-glot.xml, k.xml
cʼalˀ_2: c-glot.xml, n.xml
kən_nacʼalˀsən: c-glot.xml, n.xml
skacʼcʼalˀ: c-glot.xml, k.xml
n̩cʼəlˀcʼalˀsən: c-glot.xml, n.xml
cʼaɬ: c-glot.xml, k.xml
nacʼaɬən: c-glot.xml, n.xml
n̩cʼcʼaɬn̩: c-glot.xml, n.xml
ncʼcʼaɬənˀtxW: c-glot.xml, rescued.xml
kcʼaʔɬn̩čut: c-glot.xml, k.xml
sqʼəlˀnaskint: c-glot.xml, k.xml
cʼař: c-glot.xml, n.xml
cʼařt: c-glot.xml, k.xml
cʼawˀ: c-glot.xml, k.xml
sascʼawˀoxW_sqəlawʔ: c-glot.xml, q.xml
snacʼawʔsən: c-glot.xml, n.xml
nacʼawˀəlqWpm̩: c-glot.xml, n.xml
nacʼawˀɬcʼaʔ: c-glot.xml, n.xml
nacʼəwmən: c-glot.xml, n.xml
neʔcʼawpqən: c-glot.xml, n.xml
kacʼawˀəloptn̩: c-glot.xml, k.xml
ʔawˀt: c-glot.xml, glottal.xml
cʼow: c-glot.xml, n.xml
cʼək: c-glot.xml, k.xml
ncʼkʼcʼkʼax̣ən: c-glot.xml, n.xml
cʼəkW: c-glot.xml, n.xml
n̩cʼkWopsən: c-glot.xml, rescued.xml
kcʼkWicʼaʔən: c-glot.xml, k.xml
cʼəkʼW: c-glot.xml, k.xml
kčˀəkʼWxən: c-glot.xml, k.xml
cʼəl_2: c-glot.xml, k.xml
necʼəlot: c-glot.xml, n.xml
cʼəlˀ: c-glot.xml, n.xml
cʼəlˀxW: c-glot.xml, n.xml
cʼəɬ: c-glot.xml, k.xml
cʼəɬt: c-glot.xml, k.xml
n̩cʼaʔɬstonən: c-glot.xml, rescued.xml
kɬcʼmˀosənc: c-glot.xml, rescued.xml
kɬcʼəmcʼəmtwaxW: c-glot.xml, rescued.xml
cʼən: c-glot.xml, n.xml
nacʼə̣np: c-glot.xml, n.xml
cʼəpq: c-glot.xml, n.xml
neʔcʼəpq: c-glot.xml, n.xml
cʼəpʼqʼ: c-glot.xml, n.xml
kcʼəpʼqʼən: c-glot.xml, k.xml
n̩cʼəpʼqʼsalos: c-glot.xml, n.xml
cʼəqʼ: c-glot.xml, k.xml
ncʼqʼaɬcʼaʔən: c-glot.xml, rescued.xml
n̩cʼqʼosən: c-glot.xml, rescued.xml
cʼəsən: c-glot.xml, k.xml
katcʼsatkWən: c-glot.xml, k.xml
cʼəxW: c-glot.xml, k.xml
cʼəxWən: c-glot.xml, k.xml
ka·cʼəxW: c-glot.xml, k.xml
na·cʼəxWoxW: c-glot.xml, n.xml
kcʼxWanaʔn: c-glot.xml, k.xml
kcʼxWus: c-glot.xml, k.xml
kcʼxWosč: c-glot.xml, k.xml
snecʼxWawˀsoxW: c-glot.xml, rescued.xml
sn̩cʼəxWoxWwel: c-glot.xml, rescued.xml
skacʼacʼəxW: c-glot.xml, k.xml
kɬn̩cʼxWapəntaʔ_t_šawɬkW: c-glot.xml, k.xml
sxWskcʼx̣Wapl̥aʔəm: c-glot.xml, k.xml
neʔcʼikos: c-glot.xml, n.xml
cʼikʼ: c-glot.xml, k.xml
nacʼepʼsmən: c-glot.xml, n.xml
nacʼipʼcʼipʼšəm: c-glot.xml, n.xml
cʼex̣WoxW: c-glot.xml, k.xml
cʼqʼ: c-glot.xml, n.xml
kʼɬcʼaqʼWonˀən: c-glot.xml, rescued.xml
cʼaʔqʼWonˀəm: c-glot.xml, rescued.xml
n̩cʼoʔqa·pasən: c-glot.xml, rescued.xml
n̩cʼowˀqulˀoxWən: c-glot.xml, rescued.xml
nacʼopkW: c-glot.xml, n.xml
niʔcʼuqʼWuʔšən: c-glot.xml, rescued.xml
leyən_t_ʔencʼoqʼWmaʔ: c-glot.xml, l.xml
cʼos: c-glot.xml, k.xml
scʼosəm: c-glot.xml, k.xml
kcʼosəmtən: c-glot.xml, k.xml
cʼxW: c-glot.xml, k.xml
cəs: c-rtr.xml, k.xml
ʔa: glottal.xml, particles.xml
ʔacʼx̣: glottal.xml, n.xml

This XQuery will generate them (although it's off-the-cuff and doubtless much slower than it could be):

declare default element namespace "http://www.tei-c.org/ns/1.0";
declare namespace f="http://exist-db.org/f-functions";
declare namespace util="http://exist-db.org/xquery/util";


declare function f:getIds() as xs:string*
{
let $e := collection('/db/moses')//entry
    for $id in distinct-values($e/@xml:id[string-length(.) gt 0])
	where count($e[@xml:id = $id]) gt 1
	return xs:string($id)
};

declare variable $ids := f:getIds();

for $i in $ids
return concat($i, ': ', util:document-name(collection('/db/moses')//entry[@xml:id=$i][1]), ', ', util:document-name(collection('/db/moses')//entry[@xml:id=$i][2]))

Search engine: inlining the gloss tag

Posted by on 19 Nov 2010 in Activity log

As a partial response to SMK's concerns about the search engine, I've defined the <gloss> tag as "inline" in the context of <entry>, <def>, <sense> and <seg>, in the index configuration file. This means that it does not constitute a word-break for the purpose of indexing, so that if you search for "dropped", you'll now find instances of <gloss>drop</gloss>ped. A natural by-product, though, is that you'll no longer find this item if you search for "drop". That's the trade-off. Without a working English stemming analyzer for eXist (and there doesn't seem to be one at the moment), there's no way to have your cake and eat it, unfortunately.

November 18, 2010

Search Engine

Posted by on 18 Nov 2010 in Activity log

Martin and I have been discussing what the search engine looks for. I'm posting our discussion here for reference, because I know I will be confused about it again in the future!

SMK:

The search engine doesn't always find a string within a longer string. For example, some, but not all, instances of "dropped" come up if I search for "drop". I'm not sure why this would be, if it's searching in all text fields. Or does it only find "dropped" if there is a <gloss type="u"> included with "drop"?

MDH:

It searches for words, so if it's looking for "drop", it won't find
"dropped". You don't want to search for "bed" and find "bedazzled".

Currently, the search uses Lucene's StandardAnalyzer to tokenize the
text, which means that it tokenizes on word-boundaries, and does no
stemming.

SMK:

I see. There are a couple of cases in the result set for "drop"
where "drop" is highlighted within "dropped". I'm guessing this is
because it's gloss-tagged? If this is the case, won't we have
effectively stemmed everything on the English side once we have
edited and gloss-tagged the rest of the entries?

MDH:

If you have this:

<gloss>drop</gloss>ped

then it would see the tag as a word-boundary, and find "drop". That's not a bad thing, I suppose.

SMK:

Yeah, so when we have gloss-tagged all the entries, most cases of a given stem, like "drop" will be split off from their affixes. So someone might search for "dropped" and not get any hits - or perhaps only the entries where "dropped" is in a dicteg. And then the user would have to be clever enough to work backwards and realize he should be searching for "drop" too.

Nxaʔamxcín (Moses) Dictionary Blog

This is an XML dictionary project based primarily on the materials compiled by the late M. Dale Kinkade during fifteen years of work in the 1960’s and 1970’s with more than a dozen native speakers of the language, but it also includes materials compiled by Ewa Czaykowska-Higgins in the early 1990’s.

Search

XML Feeds

RSS 2.0: Posts
Atom: Posts

What is RSS?

Sidebar 2

This is the "Sidebar 2" container. You can place any widget you like in here. In the evo toolbar at the top of this page, select "Customize", then "Blog Widgets".