Category: Activity log

05/07/12

Permalink 12:27:51 pm, by mholmes, 61 words, 172 views   English (CA)
Categories: Activity log; Mins. worked: 5

Transaction-chain processing script

Just for the record:

There is a file called properties/xml/process_properties_201-07.sh, which when invoked at the command line with a parameter which is the XML dump of the db will run two XSLT transformations (so far) to produce "complete, enhanced" output in the form of another XML file, which includes transaction chains and lots of other stuff.

Permalink 11:52:02 am, by mholmes, 612 words, 179 views   English (CA)
Categories: Activity log; Mins. worked: 180

Early results from transaction-chain processing

These are some of the results coming out of the generation of transaction-chains through XSLT:

This is an example of what I'm pulling out so far, and the sorts of oddities that are being revealed:

<transaction-chain>
<title key="206" property-id="101" property-name="B:103 L:003"/>
  <transaction-chain>
     <title key="249" property-id="101" property-name="B:103 L:003"/>
     <title key="204" property-id="101" property-name="B:103 L:003"/>
     <title key="157" property-id="101" property-name="B:103 L:003"/>
     <title key="25" property-id="71" property-name="B:011 L:026"/>
  </transaction-chain>
  <transaction-chain>
     <title key="157" property-id="101" property-name="B:103 L:003"/>
     <title key="25" property-id="71" property-name="B:011 L:026"/>
  </transaction-chain>
</transaction-chain>

This shows nesting chains. Title 206 is the start of the initial chain; 249 is then split from it (while presumably 206 continues?). 249 becomes 204, then the split is re-joined: 157 has both 206 and 204 as preceding-titles.

I don't know if this makes sense -- can a title be split into itself and another title, as seems to be the case here with 206? There do seem to be lots of examples of this in the database.

My system currently captures splits like this well, but it doesn't yet unify chains which come back together again (so the two interior chains in the above example both have 157 -> 25). A subsequent transformation could easily detect such merges and represent them somehow, but it's not clear how. If we don't do that, then you would end up with two distinct chains:

  • 206 -> 249 -> 204 -> 157 -> 25
  • 206 -> 157 -> 25

This would be problematic if you were doing stats which depend on the number of transactions. We could, alternatively, collapse all chains of which one is a reduced subset of the other, so you would end up with just one here:

  • 206 -> 249 -> 204 -> 157 -> 25

However, this would ignore the fact that 157 has 206 as a preceding title. It's also not clear what should happen with chains which diverge but never re-unite, such as this:

<transaction-chain>
  <title key="606" property-id="211" property-name="B:039 L:005"/>
  <transaction-chain>
     <title key="507" property-id="211" property-name="B:039 L:005"/>
     <title key="421" property-id="211" property-name="B:039 L:005"/>
  </transaction-chain>
  <transaction-chain>
     <title key="510" property-id="214" property-name="B:039 L:008"/>
     <title key="422" property-id="214" property-name="B:039 L:008"/>
  </transaction-chain>
</transaction-chain>

Here you would conceivably have two distinct chains:

  • 606 -> 507 -> 421
  • 606 -> 510 -> 422

and any stats based on these would end up counting the sale of 606 twice (which might well be legitimate, because it is split, so there are arguably two transactions).

It's worth noting that in most of the complex chains I'm seeing, an initial split into two or more titles is then followed by their being re-united very quickly.

Some quick stats:

  • 713 primary chains exist (meaning that there are 713 chains which start from a title which has no preceding-title).
  • 411 of these primary chains go nowhere (in other words, there is no subsequent title, so no transactions take place other than the primary title purchase).
  • Therefore there are 302 instances of actual usable chains involving one or more sale.
  • 253 of those chains are simple, in that there are no splits. (There could be unions, though, because I'm not detecting those yet).
  • 53 of the chains split into sub-chains.
  • 28 of the chains involve more than one property.
  • 186 titles appear in more than one root transaction chain (suggesting there may be up to 100 merges between root chains, something distinct from the examples above where a root chain splits and then merges again).
  • 40 root chains feature the same title more than once (meaning that the chain splits, then merges again at some point).

04/07/12

Permalink 12:59:21 pm, by mholmes, 160 words, 168 views   English (CA)
Categories: Activity log; Mins. worked: 180

Processing chain and transaction-chain-building progress

Today I have:

  • Written a basic script to run the two saxon transforms on my original source data. This script will have more transformations added to it eventually, forming a full process from db output XML to CSV file for JS-R.
  • Added detection of liquidated properties and liquidated property controls (although I'm still working on data that doesn't have the required identification of purchasers to allow detection of actual liquidations -- still waiting on JS-R to add that to the db).
  • Implemented basic transaction-chain detection. This is remarkably slow, but does appear to be working. So far it's listing all titles in a single-title transaction chain. Next I need to do something when I reach a fork in the chain (perhaps generate a nested chain, which could be un-nested in the next transformation).

Moving forward. Tomorrow I should be able to finish transaction chains, and presumably get some idea from JS-R of what kind of output format he would like.

03/07/12

Permalink 02:29:44 pm, by mholmes, 89 words, 169 views   English (CA)
Categories: Activity log; Mins. worked: 120

Enhancing the complete data XML

I'm now working on a second transform to be applied to the result of the first. This one already detects sale-to-self situations (although it doesn't find any -- waiting for some known data from JS-R to see why) and possible family sales. I'm now working on building the transaction chains, but I'm not sure whether this can actually be done with XSLT or not, because you need to keep a tally of which items have already been processed, and I can't yet figure out a way to do that.

29/06/12

Permalink 03:13:05 pm, by mholmes, 90 words, 190 views   English (CA)
Categories: Activity log; Mins. worked: 180

Next stage: first day's programming

In between other things, I've made a good start on the next stage (based on the notes here). I've coded a transform which does the following (so far):

  • Expands each owner to include full ethnicity information inside the owner element.
  • Expands each title to include a full property element inside it, as well as a list of expanded owners, and a list of preceding titles.

This is already a long way towards creating the sort of big flat file that will make it easy to trace transaction-chains back in time.

15/05/12

Permalink 02:41:11 pm, by mholmes, 504 words, 201 views   English (CA)
Categories: Activity log; Mins. worked: 90

Long discussion: next stage

JS-R is able to work with the current spreadsheet for the May presentation, but would like some more elaborate output for the next phase. These are the details we've discussed:

  • The initial db XML output should go through a transformation which basically takes all the information encoded in relations and makes it explicit on individual records. So, for instance, all owners should have explicit ethnicities realized on the owner record; each title should have complete copies of all its owners; and so on. This will make it much easier, and faster, to generate other views of the data.
  • This output also needs to include some new boolean flags on titles:
    • Sale to self (as currently created during the spreadsheet transaction transform).
    • Possible family sale (ditto).
    • liquidated property: properties sold by a Japanese owner to custodians or the state between beginning 1943 and end 1946. The custodian category would be set as an institution type by JS-R.
    • Control for liquidated property: any title which is not flagged as above, nor is it a family transaction or a sale to self, which takes place from 1943-01-01 through 1946-12-12.
  • We need a view which constitutes a chain of transactions, constructed by preceding title. The way to construct the chains is:
    • Order titles by date ascending.
    • Start from the first.
    • Look for another title which as this one as its preceding title. Add that to the chain, and continue.
    • Every time you add a title to a chain, flag the title as having been used.
    • If you find two titles with the current title as preceding, then you have a fork. Annotate the end of the current chain to point to those two titles, and start new chains from each of those titles. Annotate the first link in the new chains to point back to the fork title.
    • If your current chain hits a title which has already been used, then you have a merge. In that case, split the previously-constructed chain into two, and annotate the break points, and stop your current chain, annotating the end of it, so that you end up with two chains which end, pointing to another single chain which continues.
    This view will have to be realized in XML (although it's not clear whether XSLT can be used to create it -- probably) because it's not a 2-dimensional matrix. Since it's in XML, each title can bring along a full copy of all its data, including flags such as LIQ and LIQ_CONTROL, and we can then generate matrix views of the chains which are flattened in various ways. Titles in this view have an inherent generation number, which can be output to spreadsheets.
  • In the current spreadsheet output, the LIQ and LIQ_CONTROL flags would be output, along with a generation number for any title which has one of these flags, constituting the count of transactions subsequent to the custodian transaction (in the case of LIQ titles), or the first transaction following 1943-01-01 (in the case of LIQ_CONTROL properties).

14/05/12

Permalink 08:25:07 am, by mholmes, 12 words, 190 views   English (CA)
Categories: Activity log; Mins. worked: 15

Added a new field

At JS-R's request, added a new "Description" field to the Ethnicities table.

11/05/12

Permalink 02:20:45 pm, by mholmes, 31 words, 269 views   English (CA)
Categories: Activity log, Tasks; Mins. worked: 30

Generated output spreadsheet

Generated output for JS-R, who's finished the current batch of edits. Also He's asked for a large Notes field on the Owners table or the Ethnicity table (not clear yet which).

09/05/12

Permalink 11:16:51 am, by mholmes, 655 words, 200 views   English (CA)
Categories: Activity log; Mins. worked: 210

Two tasks completed

Completed the tasks set yesterday, as follows:

  • Generated a list of all surnames for which multiple ethnicities are associated with owners bearing those surnames (so, for instance, if there are several owners with the surname Lee, and some have ethnicity Chinese while others have Unknown, that surname is added to the list). I used XQuery running against the XML output of the db to do this:
    xquery version "1.0";
    
    (: The purpose of this query is to find all examples where the same 
       owner surname has been associated with different ethnicities. :)
    
    declare namespace saxon="http://saxon.sf.net/";
    declare option saxon:output "method=text";
    
    let $surnames := distinct-values(//own_surname),
    (:return count($surnames):)
    $names := 
    for $name in $surnames
    order by $name
    return <name>
    <surname>{$name}</surname>
    <ids>
    {let $ids := //owners[own_surname = $name]/own_owner_id
    for $id in $ids return xs:string($id)
    }
    </ids>
    <ethnicities>
    {let $ids := //owners[own_surname = $name]/own_owner_id, $eths := //owners_to_ethnicities[ote_owner_id_fk = $ids]
    for $eth in $eths
    return <eth>{$eth/ote_ethnicity_id_fk/text()}</eth>
    }
    </ethnicities>
    </name>
    
    return 
    <names>
    {for $n in $names
    where count(distinct-values($n/ethnicities/eth)) gt 1
    return 
    concat(string-join(($n/surname, ': ',
    for $e in distinct-values($n//eth)
    return //ethnicities[eth_ethnicity_id = $e]/eth_name/text()), ' '), '&#xa;')
    }
    return 
    </names>
    
  • Generated a list of owners who have the same surname and forename, and are associated with titles on the same property. These are likely to be either duplicate owners or cross-generation family transactions. Again, this was done with XQuery:
    xquery version "1.0";
    
    (: The purpose of this query is to pull out instances of 
       owners who have the same surname and forename, 
       and who are associated with titles that have the same
       property. :)
       
       (:declare namespace map="http://www.w3.org/2005/xpath-functions/map";:)
       
    declare namespace saxon="http://saxon.sf.net/";
    declare option saxon:output "method=text";
       
       let $owners := //owners,
       $dupes := for $curr in $owners where $curr/following-sibling::owners[own_surname = $curr/own_surname and own_forenames = $curr/own_forenames and $curr/own_surname != '' and $curr/own_forenames != ''] return $curr,
       $dupeIds := for $d in $dupes 
          let $owner_set := //owners[own_surname = $d/own_surname and own_forenames = $d/own_forenames]
          return <group>
          <surname>{$d/own_surname/text()}</surname>
          <forenames>{$d/own_forenames/text()}</forenames>
          {$owner_set//own_owner_id}
          <properties>
          {let $owner_ids := $owner_set//own_owner_id/text(),
                $titles_for_group := (//owners_to_titles[ott_owner_id_fk = $owner_ids], //sellers_to_titles[stt_owner_id_fk = $owner_ids])
            for $t in $titles_for_group
            return
            if ($t/ott_owner_id_fk) then
            <title><title_id>{$t/ott_title_id_fk/text()}</title_id> <property_id>{//titles[ttl_title_id = $t/ott_title_id_fk/text()]/ttl_property_id_fk/text()}</property_id></title>
            else
            <title><title_id>{$t/stt_title_id_fk/text()}</title_id> <property_id>{//titles[ttl_title_id = $t/stt_title_id_fk/text()]/ttl_property_id_fk/text()}</property_id></title>
          </properties>
          </group>
          
          
       
       return 
       <result>
       {for $d in
       $dupeIds
       where count($d//property_id) gt count(distinct-values($d//property_id))
       order by $d/surname, $d/forenames
       return 
       ('&#x0a;',
       $d/surname/text(),
       ', ',
       $d/forenames/text(),
       '&#x0a;owner ids: ',
       for $o in $d//own_owner_id
       return ($o/text(), ' '),
       for $t in $d//title return
        (
        '&#x0a;&#09;title: ', $t/title_id/text(),
        '&#09;&#09;property: ', $t/property_id/text()
        ),
       '&#x0a;&#x0a;'
       )
       }
    </result>
    

08/05/12

Permalink 03:32:52 pm, by mholmes, 135 words, 187 views   English (CA)
Categories: Activity log; Mins. worked: 60

Deleting duplicate owners_to_ethnicities records

An unwanted side-effect from the way I've deduped owners was that I've ended up with duplicate records in the owners_to_ethnicities table (different primary keys, but the same owner and ethnicity). This is a quick way to fix that, which I've now implemented:

DELETE ote2
FROM owners_to_ethnicities AS ote1, owners_to_ethnicities AS ote2
WHERE ote1.ote_owner_id_fk = ote2.ote_owner_id_fk
AND ote1.ote_ethnicity_id_fk = ote2.ote_ethnicity_id_fk
AND ote2.ote_ote_id > ote1.ote_ote_id

Similar processes may have to be run on other linking tables. It's probably safer to allow the duplicate records to be created by the merge, then examine and de-dupe them, than it would be to make the de-dupe process itself, which is already complicated enough, more messy.

:: Next Page >>

Properties

A database project to collect historical data on properties and titles.

Reports

Categories

April 2014
Sun Mon Tue Wed Thu Fri Sat
 << <   > >>
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30      

XML Feeds