Sencoten dictionary: More complexity in the input to deal with
: Martin Holmes
Minutes: 100
SK raised two new issues for the PDF build: first, there are duplicate entries
in the root-based index where the same entry appears multiple times in the source
spreadsheet. For this, I had to adapt the code which combines and normalizes
such cases in the Sen-Eng section of the dictionary, and that seems to have worked.
Second, we’re now facing a handful of cases where there are multiple roots for
an entry, and our root
column and the handling of it is designed only
to cope with a single root. This is more complicated, and it sent me down a new
path, thinking that really what we need to do is generate proper TEI from the
spreadsheet, and then build output from that. I’ve made a start on that, but
in itself it’s horribly complicated because the multiple slightly-differing
instances of the same entry
in the input data. And if we do succeed in
reducing the duplication into a well-organized set of source data, we’re then
going to have to reproduce the duplication in the output anyway, to retain the
repetitive organization by English headword that’s required by the project.
So that’s going to take some time to figure out.