HCMC Journal: Sencoten dictionary: More complexity in the input to deal with

Sencoten dictionary: More complexity in the input to deal with

30 May 2024: Martin Holmes
Minutes: 100

SK raised two new issues for the PDF build: first, there are duplicate entries in the root-based index where the same entry appears multiple times in the source spreadsheet. For this, I had to adapt the code which combines and normalizes such cases in the Sen-Eng section of the dictionary, and that seems to have worked. Second, we’re now facing a handful of cases where there are multiple roots for an entry, and our root column and the handling of it is designed only to cope with a single root. This is more complicated, and it sent me down a new path, thinking that really what we need to do is generate proper TEI from the spreadsheet, and then build output from that. I’ve made a start on that, but in itself it’s horribly complicated because the multiple slightly-differing instances of the same entry in the input data. And if we do succeed in reducing the duplication into a well-organized set of source data, we’re then going to have to reproduce the duplication in the output anyway, to retain the repetitive organization by English headword that’s required by the project. So that’s going to take some time to figure out.