HCMC Journal: Plan of Action for Extracting Cits

Plan of Action for Extracting Cits

07 July 2022 to 08 July 2022: Martin Holmes
Minutes: 375

Beforehand: Add ptr as a child of sense in the schema, so that cits can be replaced by ptrs.

The proposed process goes as follows:

PHASE 1: Extracting cits

Give every cit a unique xml:id in-place in the entries. The id will be of the form: c_0000_Y29.48, where c_ is common to all, 0000 is a counter through the whole collection, and Y29.48 is a normalized id-friendly version of the content of the final bibl in the cit.

Note for MDH: Create an outer framework function which loads all the entry files, then calls fn:transform to transform each one using a separate stylesheet which has a start-count-from parameter, so that we can get the numbering right.
Reprocess the whole collection to copy all of the unique cits (cits which don’t share a final bibl with any other cit) into one file, sorted by cit; copy all of the other cits into a second file, where they should be grouped by final bibl, and sorted within the group by the normalized plain-text of their entire content.
Reprocess the whole collection to replace cits which are in the entries with pointers of the form: <ptr corresp="cit:c_0000_Y29.48"/>.
Add to the preprocessing of the two dictionary builds an initial step which replaces the cits in their original location based on the ptr elements. Then test to make sure the resulting builds still function as expected.

PHASE 2: Remediating cits

ECH and SK look at each group of cits in the non-unique file to determine which can be merged. Whenever they can merge, they delete each of the duplicates, but replace it with a pointer to the remaining unique cit. ptr elements have the same form as above.
As soon as one or two of these second-level pointers have been created in the cit file, MDH needs to add an extra layer of processing at build time to resolve the two-step pointer system so the builds keep working. At the same time, we should have an automated process that will update the original pointers in the entries to point to the remaining unique entry, and remove the now-unnecessary secondary pointers.
Steps 1 and 2 are cycled until there are no more duplicate cits. At that point, the cits can be reorganized into a single file, or into any convenient collection of files.

PHASE 3: Value added

Now that cits are unique, we can add links from any cit in any entry to any other entry that shares the same cit. This could be achieved in a number of ways, but since it’s not a required feature, we can determine the best approach at our leisure.

Action taken 2022-07-08:

Wrote and tested the code to extract all the cits with ids and replace them with pointers. This is working, except that there remain sixteen instances of empty bibls which need to be filled by ECH/SK; when that’s done, the initial code will work (we think). After that, the downstream processing to put them back at build time needs to be done.