The problem of duplicate @xml:id attributes on entries has now become a serious issue for the print dictionary building, because I'm unable to properly process the entire collection properly to produce the book; to build the dictionary I have to use XInclude to create a single XML source file, and when I do that there are over 1600 duplicate ids which prevent some of the processing steps from being successful.
I've taken a quick look at where the duplicates tend to be concentrated, by adding the files in alphabetical order and looking to see how many duplicates occur with each addition. These files create no problems (i.e. they have no duplicates among themselves):
affix_glot-ix.xml affix_k-m.xml affix_n-t.xml affix_u-CAPS.xml c.xml c-glot.xml c-rtr.xml glottal.xml h.xml h-phar-part1.xml h-phar-part2.xml l-affric.xml lex-suff.xml new-data-2013.xml p-glot.xml phar-w.xml qw-glot.xml s-rtr.xml t-glot.xml xw.xml
When I add the remaining files, one by one (and only one at a time), these are the results:
k.xml 100 duplicates. k-glot.xml: 18 kw.xml: 2 kw-glot.xml: 2 l.xml: 3 l-fric.xml: 6 m.xml: 3 n.xml: 97 p.xml: 7 particles.xml: 4 pron.xml: 2 q.xml: 4 q-glot.xml: 3 qw.xml: 1 rescued.xml: 54 s.xml: 2 t.xml: 20 ww-glot.xml: 4 x.xml: 3 x-uvul.xml: 4 yy-glot.xml: 4
What I'm going to do is develop the dictionary output using only the valid files, and then add the others in as they get fixed. In the meantime, it might be worth having a go at some of the low-hanging fruit (the ones with only two or three duplicates). More will show up as we add those in, of course -- there will be duplicates across the currently-excluded files as well as those that they share with the "good" files. So the dictionary PDFs will shrink in size, but I'll be able to start doing things like generating page-references that depend on xml:ids.