We are really getting there!
The new combined group-then-sort approach is actually working. There was one last tweak we had to implement: the sort force of clitics is problematic because while they have a specific weight in the numerical sort sequence, which is used to determine the order in which morphemes are listed in the sort key, once that order has been determined, they need to be "downgraded" during the actual sort. We've achieved this by massaging the generated sort key to precede and clitic values with "0000_", which ensures they actually sort before any other sequences with non-clitics in the same position where the preceding part of the sequence is identical, but then stripping the added bit again before we calculate the indent levels.
So this is what we're now doing:
- Creating groups of subforms of the root by extracting them in a specific order from all the related forms, excluding all previously-extracted ones so that each form appears only once under each root;
- Rendering the subgroups in a different order from the discovery order;
- Sorting the items within each subgroup based on a generated sort key which gives a numerical weight to each morpheme discovered working outwards from the root (after stripping duplicated roots in a careful manner which differs depending on the infix separating them);
- Generating an indent level for each item in the subgroup based on comparing the lengths of their sort keys after stripping the common component from the left side of each.
Can this be it? Looks like it might be.