Just for the record:
There is a file called properties/xml/process_properties_201-07.sh, which when invoked at the command line with a parameter which is the XML dump of the db will run two XSLT transformations (so far) to produce "complete, enhanced" output in the form of another XML file, which includes transaction chains and lots of other stuff.
These are some of the results coming out of the generation of transaction-chains through XSLT:
This is an example of what I'm pulling out so far, and the sorts of oddities that are being revealed:
<transaction-chain>
<title key="206" property-id="101" property-name="B:103 L:003"/>
<transaction-chain>
<title key="249" property-id="101" property-name="B:103 L:003"/>
<title key="204" property-id="101" property-name="B:103 L:003"/>
<title key="157" property-id="101" property-name="B:103 L:003"/>
<title key="25" property-id="71" property-name="B:011 L:026"/>
</transaction-chain>
<transaction-chain>
<title key="157" property-id="101" property-name="B:103 L:003"/>
<title key="25" property-id="71" property-name="B:011 L:026"/>
</transaction-chain>
</transaction-chain>
This shows nesting chains. Title 206 is the start of the initial chain; 249 is then split from it (while presumably 206 continues?). 249 becomes 204, then the split is re-joined: 157 has both 206 and 204 as preceding-titles.
I don't know if this makes sense -- can a title be split into itself and another title, as seems to be the case here with 206? There do seem to be lots of examples of this in the database.
My system currently captures splits like this well, but it doesn't yet unify chains which come back together again (so the two interior chains in the above example both have 157 -> 25). A subsequent transformation could easily detect such merges and represent them somehow, but it's not clear how. If we don't do that, then you would end up with two distinct chains:
This would be problematic if you were doing stats which depend on the number of transactions. We could, alternatively, collapse all chains of which one is a reduced subset of the other, so you would end up with just one here:
However, this would ignore the fact that 157 has 206 as a preceding title. It's also not clear what should happen with chains which diverge but never re-unite, such as this:
<transaction-chain>
<title key="606" property-id="211" property-name="B:039 L:005"/>
<transaction-chain>
<title key="507" property-id="211" property-name="B:039 L:005"/>
<title key="421" property-id="211" property-name="B:039 L:005"/>
</transaction-chain>
<transaction-chain>
<title key="510" property-id="214" property-name="B:039 L:008"/>
<title key="422" property-id="214" property-name="B:039 L:008"/>
</transaction-chain>
</transaction-chain>
Here you would conceivably have two distinct chains:
and any stats based on these would end up counting the sale of 606 twice (which might well be legitimate, because it is split, so there are arguably two transactions).
It's worth noting that in most of the complex chains I'm seeing, an initial split into two or more titles is then followed by their being re-united very quickly.
Some quick stats:
Today I have:
Moving forward. Tomorrow I should be able to finish transaction chains, and presumably get some idea from JS-R of what kind of output format he would like.
I'm now working on a second transform to be applied to the result of the first. This one already detects sale-to-self situations (although it doesn't find any -- waiting for some known data from JS-R to see why) and possible family sales. I'm now working on building the transaction chains, but I'm not sure whether this can actually be done with XSLT or not, because you need to keep a tally of which items have already been processed, and I can't yet figure out a way to do that.