Plan for abstracting cits
Posted by mholmes on 09 Sep 2019 in Activity log
The cits are scattered throughout the entries, and are therefore duplicated all over the place. In preparation for converting them to orthography, we need to centralize them. This is the basic plan:
- Process all cits in all files so that each gets a unique id based on its bibl[s] plus a unique generated thing, and is moved into a separate file called cits.xml. Replace each cit with a ptr target="c:ID" thing.
- Process the cits.xml file to order by id, so that all cits with the same bibls are grouped together.
- Check identity between the cits. In an XSLT tranform of cits.xml, generate a new XSLT file which will contain a stack of very precise templates for ptr target="c:ABCD". For each cit which is a duplicate of a preceding one, a) nuke it from the cit file, and b) create a template to replace any pointers to it such that they point to the earliest preceding one.
- Run that transformation over the collection. That should give us a situation where duplicate cits have been removed, and all pointers normalized.
- Add a diagnostic that checks for ptrs inside sense elements that don't point to a cit, and fix anything found.
- Do a similarity metric on cits to find any more close duplicates, and refer these to SK and ECH to diagnose.
- Fix the website processing to handle the ptrs instead of in-place cits.
- Fix the PDF processing to handle the ptrs.