Locating problems arising from Lexware encoding
Posted by mholmes on 22 Oct 2013 in Activity log
In some cases, the person entering data in the Lexware days used the same band name for glosses of example sentences as for defs, so we have some situations in which a def:seg duplicates data in a quote:seg. I've used this XQuery to identify candidate duplicates for SK to look at:
xquery version "3.0"; declare default element namespace "http://www.tei-c.org/ns/1.0"; declare namespace util="http://exist-db.org/xquery/util"; for $t in //TEI for $e in $t//entry let $quoteSegs := for $s in $e//quote/seg where string-length(normalize-space($s)) gt 0 return normalize-space($s), $defs := for $d in $e//def where string-length(normalize-space($d)) gt 0 return normalize-space($d), $dupes := distinct-values($quoteSegs[.=$defs]) where count($dupes) gt 0 return concat(util:document-name($t/@xml:id), ' / ', $e/@xml:id, ' : ', string-join($dupes, ' | '))