- Generated oggs for two new audio files.
- Fixed the normalization issues with the pron page, except for one ("text run starts with a composing character"), which seems to be an artifact of the XSLT processor as far as I can see. The only remaining invalidities are caused by the non-self-closing
<source>tag, which is an eXist bug that I've diagnosed and reported. - Rewrote the table-rendering code so that the pron guide table doesn't end up being sortable by clicking on the header (which can't work because it's full of colspans and rowspans).
- Fixed a bug whereby
<note>s were not being rendered when they were children of<sense>(I wasn't expecting them there). - Fixed another bug which resulted in divs inside spans in the output, also caused by note elements. Normal pages now validate.
- Cut the site menu down to six items -- it was getting too long for a non-scrolling menu -- by moving all the dictionary pages to a sub-page called "Dictionary". Also removed Fonts from the menu, and substituted a link in the About page.
- Rewrote the PDF generation so that links to morphemes are only included in linguist output mode, so the learner dictionaries don't have links at all. Linguist dictionaries generated without related morphemes have bad links, of course, but I can probably get rid of that build pipeline.
- Rationalized the file structure a bit to segrate XML files from the PDFs generated from them.
Today I've made the following changes to the PDF dictionary generation:
- choice/sic/corr is now handled correctly.
- The linguist dictionary now sorts by orthographical headword.
- Auto-orthographization is now done with phonemic segs instead of hyphs.
- Name entries are only included if the name shows up in the pron.
I've also added a morpheme-linking feature, which enumerates the morphemes in the hyph and provides a page-reference to the entry for each of those single morphemes. Obviously this only works in the dictionaries which include all the related morphemes, but you can see it at work if you look at the names-only_linguist_with_related.pdf file.
- The output we want for the names dictionaries is the one without component morphemes, learner version.
- Linguist dictionaries should have headwords in orth form, and should sort by them.
- Auto-orths should be done from the phonemic pron, not from the hyph, to avoid the reduplication problem.
- In names dictionaries, only include an entry whose name tag is in a pron.
- sic/corr are still not working properly in the PDF output. That's because this is the template for
<def>, and it does not process tags inside<seg>(so a<persName>containing a<choice>just gets output as text):<xsl:template match="def"> <xsl:for-each select="seg"> <xsl:if test="preceding-sibling::seg"> <fo:inline>; </fo:inline> </xsl:if> <xsl:value-of select="normalize-space(.)"/> </xsl:for-each> </xsl:template>Obviously more subtlety is required here.
The audio is stored in home1t/moses/www/audio, all together in one folder. This includes aup files and their associated folders (Audacity projects), mp3s generated by CB, and the OGG files I auto-generated from the mp3s.
We're linking to them like this:
<ref type="audio" target="mosaud:filename.mp3">[word]</ref>
From that, I'm generating an HTML5 <audio> element with two <source> element children, one for mp3 and one for ogg (my Linux Firefox still won't play mp3s). I've substituted a single play button for the standard controls, which are overkill for these tiny extracts.
One problem is that eXist currently outputs the <source> elements with closing tags instead of as self-closers, and that generates an error in the HTML5 validator. I've filed an eXist bug for that.
Another problem we have is that the audio is extremely noisy. It could easily be cleaned up; the question is whether it would be better to clean up the full stretches and re-capture the individual words, or just clean up the segments we're using.
Fixed a display issue (we should show corrs, not sics, from choice elements). Other bugs exist, though, and three more dictionary types are required.
I now have an XSLT file called generate_names_only_dictionary which runs on itself and generates a whole stack of eight different XML files in dictionary_test/names. A subsequent pair of transformation scenarios (moses_xml_to_pdf_*_names_only) run on these XML files to generate 16 PDFs. These break down as follows:
- all the names
- only the fauna
- only the flora
- only the storyPeople
For each of those four, there are two different content sets:
- 1 with only the name entries
- 1 with the name entries plus all the morphemes that constitute them
For each of those two, there are 2 PDFs:
- 1 in learner format
- 1 in linguist format
So there are 16 different PDFs.
None of these dictionaries have any introductory material at all; someone will have to write that stuff, once we know what they're going to be used for and who will be using them.
I think this should cover all the options we can imagine for names-only dictionaries, and I have a two-stage pipeline for creating them so it's easy to regenerate them whenever we want.
We've had to redo two of the global transforms we did to automate some markup tasks, because they proved to have made changes not expected or wanted; this is mainly because the original Lexware encoding (and thus the resulting XML generated from it) varies so greatly from file to file that what works for one file doesn't necessarily work for another. Lessons learned. Luckily SVN helps figure out what went wrong, and allows us to recover from it.
As part of the preparation for work on the auto-hyphenator, I've generated a list of all the distinct forms of morphemes and what they link to, initially using this XQuery (which takes a long while to run):
xquery version "1.0";
declare default element namespace "http://www.tei-c.org/ns/1.0";
for $mText in distinct-values(//m[@corresp])
return concat('morpheme form: ', $mText, ' links to ', string-join(distinct-values(//m[text() = $mText]/@corresp), ', '))
and then trimming the results to remove everything that links only to m:UNASSIGNED, as well as removing links to m:UNASSIGNED from the lists of other links (should have built that into the XQuery). We can now use this list to spot candidates for auto-assignment (starting with those forms which only ever link to a single morpheme entry).
Faced with the task of creating a print dictionary consisting only of the name entries, I was initially stumped because of the incidence of duplicate xml:ids across the collection. Previously, my print dictionary processing has depended on inclusion by manual selection of only those files whose status is complete, across which there are no duplicate ids. However, the names are sprinkled throughout the whole collection, including lots of files which have not yet been edited, and therefore have duplicate ids in them.
After some thought, I set up this process, running in dictionary_test:
- First, the auto-orthography transformation is run against all the files in the dictionary directory (the live files) to create auto-orthographized versions in dictionary_test.
- A new file called master_all.xml XIncludes all the entry files in dictionary_test. This file is obviously invalid because of the duplicate ids, but it can be processed with XSLT.
- Next, a transformation called generate_names-only_dictionary.xsl pulls out all the name entries, along with all the completed root, stem and affix entries to which the name entries link in their morpheme elements, and creates from them a file called master_names-only.xml.
- Finally, the moses_master_to_pdf_LINGUIST transformation scenario is run on the master_names-only.xml file to generate the PDF dictionary.
In the process, I found and fixed a couple of errors including a duplicate id between two name entries, and also noticed a new problem we'll have to work on: the TEI Schematron embedded in the RelaxNG schema for the <gloss> element disallows the presence of @subtype when there is no @type (quite reasonably, perhaps), but we're using @subtype by analogy with what we use on <seg>, while having removed @type from the schema. I guess we should probably handle this by creating a new datatype and using @type on <gloss> instead of @subtype.
Per SMK's blog post task #4: all the duplicate defs have now been collapsed. This was done based on identity through deep-equals, so there may be some cases where near-duplicates still exist, but I don't see a way to automate the decision on whether something which is a near-duplicate should be deemed a duplicate.