The problem of duplicate @xml:id attributes on entries has now become a serious issue for the print dictionary building, because I'm unable to properly process the entire collection properly to produce the book; to build the dictionary I have to use XInclude to create a single XML source file, and when I do that there are over 1600 duplicate ids which prevent some of the processing steps from being successful.
I've taken a quick look at where the duplicates tend to be concentrated, by adding the files in alphabetical order and looking to see how many duplicates occur with each addition. These files create no problems (i.e. they have no duplicates among themselves):
affix_glot-ix.xml affix_k-m.xml affix_n-t.xml affix_u-CAPS.xml c.xml c-glot.xml c-rtr.xml glottal.xml h.xml h-phar-part1.xml h-phar-part2.xml l-affric.xml lex-suff.xml new-data-2013.xml p-glot.xml phar-w.xml qw-glot.xml s-rtr.xml t-glot.xml xw.xml
When I add the remaining files, one by one (and only one at a time), these are the results:
k.xml 100 duplicates. k-glot.xml: 18 kw.xml: 2 kw-glot.xml: 2 l.xml: 3 l-fric.xml: 6 m.xml: 3 n.xml: 97 p.xml: 7 particles.xml: 4 pron.xml: 2 q.xml: 4 q-glot.xml: 3 qw.xml: 1 rescued.xml: 54 s.xml: 2 t.xml: 20 ww-glot.xml: 4 x.xml: 3 x-uvul.xml: 4 yy-glot.xml: 4
What I'm going to do is develop the dictionary output using only the valid files, and then add the others in as they get fixed. In the meantime, it might be worth having a go at some of the low-hanging fruit (the ones with only two or three duplicates). More will show up as we add those in, of course -- there will be duplicates across the currently-excluded files as well as those that they share with the "good" files. So the dictionary PDFs will shrink in size, but I'll be able to start doing things like generating page-references that depend on xml:ids.
Lucene-based fuzzy matching seems to be very broken in the build of eXist I'm using, and in any case it's based on Levenshtein distance, so I've implemented a crude version of the USM/NCD algorithm in XQuery. It's a long way from ideal, though, because it's using base64 versions of strings rather than compressing the actual strings (this is all I can do with eXist's exposed gzip access); using zip seems to be punitive because it would require creating a file on the filesystem or in the db and compressing that. I think a simpler approach would be to take my Java class and strip out all the command-line stuff it contains, then call that directly from XQuery (see the xqSearchUtils java project and the way it's called from the Despatches XQuery for an example). A jar file with a simple XQuery module interface might be very handy indeed.
Media queries...
1. SA found a solution with regards to cutting the soundtrack at the millisecond : Use Audacity! The program was installed on POMME.
2. ES entered & committed the transcripts for cltq3, fraq11, fraq12, fraq13
The call is out, and mine are done.
Working with PS on the MoEML redesign.
...for Rees and Urberg, and reconfigured the Rees structure to allow for abstracts (not available yet).
I've been using the opportunity of the redesign (which gives me a complete new incarnation of the web application working alongside the current one) to fix a whole raft of problems and annoyances going back a long time. Among those completed so far:
- When you ask for a page which doesn't exist, you now see a customized "missing" page (db/data/info/missing.xml), but I also set the HTTP status code, like this (for future reference):
declare variable $dataDoc := if (collection('/db/data')//TEI[@xml:id=$fileId]) then collection('/db/data')//TEI[@xml:id=$fileId] else let $dummy := response:set-status-code(404) return collection('/db/data')//TEI[@xml:id='missing']; - Menu item
<li>elements now have aclass="active"attribute where their target URL matches the current URL. - Schemas (ODD, RNG and SCH) are available through their filenames.
- When the XML view of a document is presented, the teiHeader is automatically expanded to include links to the schemas and a bit more information, to mitigate the current (temporary, I hope) paucity of header information.
- Page contents menus are now generated, not by parsing the XML source document, but by parsing the XHTML rendering of it after expansion and transformation. This is because the content menu has to be generated in a separate process from the original document expansion and conversion, and since
@ids on<div>s are often auto-generated withgenerate-id()during the XSLT transformation, they cannot be matched for linking any other way. - I've begun writing a new module for retrieving information about placenames programmatically. This is largely to support the planned processing of ISE source code through named entity recognition. We will need to be able to do a sort of fuzzy lookup of placenames found in the ISE texts, to identify exact and candidate matches. Right now, the module is producing a gazetteer in the text file format used by e.g. NLTK, as well as a simple lookup text file for ids and matching names; it's also eventually going to be able to take input in the form of a candidate name and produce one or more matches in the form of MoEML ids along with all distinct values of names in MoEML for those ids, with a confidence measure. However, my early tests suggest that the Lucene fuzzy matching (using ft:query with a tilde operator) is actually broken in the build we're using; that's going to be a bit of a problem for us. I might write an XQuery implementation of the USM in order to have something better than Levenshtein Distance, but I don't know how that could be implemented as part of a search. More work to do here.
- We now have the following stylesheets (instead of a single global one):
- global.css (currently empty: may be removed).
- highlights.css (contains rules for search matching and highlighting).
- popups.css (styles for popup boxes).
- primary_source.css (styles specific to the rendering of primary source documents, as opposed to born-digital articles).
- site_page.css (the site chrome, and the main focus of PS's work righ now).
- xml_code.css (styling exclusively for sample code in XML format, which we use in our born-digital documentation files, through
<egXML>elements).