Just for the record:
There is a file called properties/xml/process_properties_201-07.sh, which when invoked at the command line with a parameter which is the XML dump of the db will run two XSLT transformations (so far) to produce "complete, enhanced" output in the form of another XML file, which includes transaction chains and lots of other stuff.
These are some of the results coming out of the generation of transaction-chains through XSLT:
This is an example of what I'm pulling out so far, and the sorts of oddities that are being revealed:
<transaction-chain>
<title key="206" property-id="101" property-name="B:103 L:003"/>
<transaction-chain>
<title key="249" property-id="101" property-name="B:103 L:003"/>
<title key="204" property-id="101" property-name="B:103 L:003"/>
<title key="157" property-id="101" property-name="B:103 L:003"/>
<title key="25" property-id="71" property-name="B:011 L:026"/>
</transaction-chain>
<transaction-chain>
<title key="157" property-id="101" property-name="B:103 L:003"/>
<title key="25" property-id="71" property-name="B:011 L:026"/>
</transaction-chain>
</transaction-chain>
This shows nesting chains. Title 206 is the start of the initial chain; 249 is then split from it (while presumably 206 continues?). 249 becomes 204, then the split is re-joined: 157 has both 206 and 204 as preceding-titles.
I don't know if this makes sense -- can a title be split into itself and another title, as seems to be the case here with 206? There do seem to be lots of examples of this in the database.
My system currently captures splits like this well, but it doesn't yet unify chains which come back together again (so the two interior chains in the above example both have 157 -> 25). A subsequent transformation could easily detect such merges and represent them somehow, but it's not clear how. If we don't do that, then you would end up with two distinct chains:
This would be problematic if you were doing stats which depend on the number of transactions. We could, alternatively, collapse all chains of which one is a reduced subset of the other, so you would end up with just one here:
However, this would ignore the fact that 157 has 206 as a preceding title. It's also not clear what should happen with chains which diverge but never re-unite, such as this:
<transaction-chain>
<title key="606" property-id="211" property-name="B:039 L:005"/>
<transaction-chain>
<title key="507" property-id="211" property-name="B:039 L:005"/>
<title key="421" property-id="211" property-name="B:039 L:005"/>
</transaction-chain>
<transaction-chain>
<title key="510" property-id="214" property-name="B:039 L:008"/>
<title key="422" property-id="214" property-name="B:039 L:008"/>
</transaction-chain>
</transaction-chain>
Here you would conceivably have two distinct chains:
and any stats based on these would end up counting the sale of 606 twice (which might well be legitimate, because it is split, so there are arguably two transactions).
It's worth noting that in most of the complex chains I'm seeing, an initial split into two or more titles is then followed by their being re-united very quickly.
Some quick stats:
Today I have:
Moving forward. Tomorrow I should be able to finish transaction chains, and presumably get some idea from JS-R of what kind of output format he would like.
I'm now working on a second transform to be applied to the result of the first. This one already detects sale-to-self situations (although it doesn't find any -- waiting for some known data from JS-R to see why) and possible family sales. I'm now working on building the transaction chains, but I'm not sure whether this can actually be done with XSLT or not, because you need to keep a tally of which items have already been processed, and I can't yet figure out a way to do that.
In between other things, I've made a good start on the next stage (based on the notes here). I've coded a transform which does the following (so far):
This is already a long way towards creating the sort of big flat file that will make it easy to trace transaction-chains back in time.
JS-R is able to work with the current spreadsheet for the May presentation, but would like some more elaborate output for the next phase. These are the details we've discussed:
At JS-R's request, added a new "Description" field to the Ethnicities table.
Generated output for JS-R, who's finished the current batch of edits. Also He's asked for a large Notes field on the Owners table or the Ethnicity table (not clear yet which).
Completed the tasks set yesterday, as follows:
xquery version "1.0";
(: The purpose of this query is to find all examples where the same
owner surname has been associated with different ethnicities. :)
declare namespace saxon="http://saxon.sf.net/";
declare option saxon:output "method=text";
let $surnames := distinct-values(//own_surname),
(:return count($surnames):)
$names :=
for $name in $surnames
order by $name
return <name>
<surname>{$name}</surname>
<ids>
{let $ids := //owners[own_surname = $name]/own_owner_id
for $id in $ids return xs:string($id)
}
</ids>
<ethnicities>
{let $ids := //owners[own_surname = $name]/own_owner_id, $eths := //owners_to_ethnicities[ote_owner_id_fk = $ids]
for $eth in $eths
return <eth>{$eth/ote_ethnicity_id_fk/text()}</eth>
}
</ethnicities>
</name>
return
<names>
{for $n in $names
where count(distinct-values($n/ethnicities/eth)) gt 1
return
concat(string-join(($n/surname, ': ',
for $e in distinct-values($n//eth)
return //ethnicities[eth_ethnicity_id = $e]/eth_name/text()), ' '), '
')
}
return
</names>
xquery version "1.0";
(: The purpose of this query is to pull out instances of
owners who have the same surname and forename,
and who are associated with titles that have the same
property. :)
(:declare namespace map="http://www.w3.org/2005/xpath-functions/map";:)
declare namespace saxon="http://saxon.sf.net/";
declare option saxon:output "method=text";
let $owners := //owners,
$dupes := for $curr in $owners where $curr/following-sibling::owners[own_surname = $curr/own_surname and own_forenames = $curr/own_forenames and $curr/own_surname != '' and $curr/own_forenames != ''] return $curr,
$dupeIds := for $d in $dupes
let $owner_set := //owners[own_surname = $d/own_surname and own_forenames = $d/own_forenames]
return <group>
<surname>{$d/own_surname/text()}</surname>
<forenames>{$d/own_forenames/text()}</forenames>
{$owner_set//own_owner_id}
<properties>
{let $owner_ids := $owner_set//own_owner_id/text(),
$titles_for_group := (//owners_to_titles[ott_owner_id_fk = $owner_ids], //sellers_to_titles[stt_owner_id_fk = $owner_ids])
for $t in $titles_for_group
return
if ($t/ott_owner_id_fk) then
<title><title_id>{$t/ott_title_id_fk/text()}</title_id> <property_id>{//titles[ttl_title_id = $t/ott_title_id_fk/text()]/ttl_property_id_fk/text()}</property_id></title>
else
<title><title_id>{$t/stt_title_id_fk/text()}</title_id> <property_id>{//titles[ttl_title_id = $t/stt_title_id_fk/text()]/ttl_property_id_fk/text()}</property_id></title>
</properties>
</group>
return
<result>
{for $d in
$dupeIds
where count($d//property_id) gt count(distinct-values($d//property_id))
order by $d/surname, $d/forenames
return
('
',
$d/surname/text(),
', ',
$d/forenames/text(),
'
owner ids: ',
for $o in $d//own_owner_id
return ($o/text(), ' '),
for $t in $d//title return
(
'
	title: ', $t/title_id/text(),
'		property: ', $t/property_id/text()
),
'

'
)
}
</result>
An unwanted side-effect from the way I've deduped owners was that I've ended up with duplicate records in the owners_to_ethnicities table (different primary keys, but the same owner and ethnicity). This is a quick way to fix that, which I've now implemented:
DELETE ote2 FROM owners_to_ethnicities AS ote1, owners_to_ethnicities AS ote2 WHERE ote1.ote_owner_id_fk = ote2.ote_owner_id_fk AND ote1.ote_ethnicity_id_fk = ote2.ote_ethnicity_id_fk AND ote2.ote_ote_id > ote1.ote_ote_id
Similar processes may have to be run on other linking tables. It's probably safer to allow the duplicate records to be created by the merge, then examine and de-dupe them, than it would be to make the de-dupe process itself, which is already complicated enough, more messy.
:: Next Page >>