Following our meeting last week, I've now drafted a preliminary shot at a project proposal for phase two, and forwarded it to JS-R for additional data, edits and gap-filling.
There are 117. Here's how I found them, for future reference:
SELECT a.`ttl_title_id`, b.`ttl_title_id`, a.`ttl_title_code` FROM `titles` a, `titles` b WHERE a.`ttl_title_code` = b.`ttl_title_code` AND a.`ttl_title_id` <> b.`ttl_title_id` AND a.`ttl_title_id` < b.`ttl_title_id` ORDER BY a.`ttl_title_id` LIMIT 0, 1000
Negative duration values, in the case of 11 records in the VW_trans_composite_eth_prop view, have revealed some data entry errors in the "preceding titles" and "related titles (newer)" fields. This suggests that there may be other such errors which are not apparent, because their dates happen to be in the correct sequence. Therefore we propose creating an automated check on the data, to locate such inconsistencies. This is an outline of how the check would be done:
However, I see a problem with this. If a title is created as a result of a merger of two or more preceding titles, each of which has a different date, it will be impossible to tell whether there is an error or not. Ditto in the case of subsequent titles, in the case of a subdivision. Waiting for word from JS-R on how to handle this.
I have now carried out the insertion, and constructed a detailed audit trail to pass on to JS-R and VG. Here's the opening section of the audit document (the rest is very long, and consists of precise records of every operation):
____AUDIT TRAIL FOR THE AUTOMATED INSERTION OF SPREADSHEET DATA____
Use this document to check the validity of inserted records and trace the source of any errors that may be found.
WHAT HAS BEEN DONE:
- A new title record has been inserted for each record in the spreadsheet.
- The title has been linked to its property. All properties referred to in the spreadsheet appear to have already existed in the database. so no new property records were created.
- All owners recorded in the spreadsheet have either been:
- discovered as already existing in the database, or
- inserted as new records into the database
and each owner has been linked to their title(s).
- A new mortgage record has been inserted for each mortgage in the spreadsheet.
- All lenders recorded in the spreadsheet have either been:
- discovered as already existing in the database, or
- inserted as new records into the database
and each lender has been linked to their mortgage(s).
- A simple algorithm has been used to determine whether owners and lenders are individual or institutional, and their records have been configured accordingly. I have strong confidence in this algorithm, but there may be occasional errors.
- All "other documents" recorded in the spreadsheet have either been:
- discovered as already existing in the database, or
- inserted as new records into the database
and each document has been linked to its title(s).
WHAT HAS NOT BEEN DONE:
- No attempt has been made to set values for the ethnicity of lenders or owners. This cannot be done in an automated fashion.
- No attempt has been made to discover seller information for the new records. This is also best done by humans, but it's possible we might be able to automate it if time permits.
BACKUPS
I have backups of the database taken immediately before and immediately after the process.
AUTO_INCREMENT VALUES PRIOR TO INSERTION OF NEW RECORDS FROM SPREADSHEET
These values represent the FIRST id and LAST id of a record inserted as part of this process. Within this range, data in the table has been inserted by the automated process.
TABLE FIRST ID LAST ID
titles 2300 2738
owners 1863 2292
lenders 553 691
mortgages 929 1306
other_docs 388 518
owners_to_titles 9678 10363
mortgages_to_titles 3246 3623
docs_to_titles 1086 1328
lenders_to_mortgages 1130 1564
No new records were inserted in the props table; presumably all the new titles relate to properties already in the database.
FULL INSERTION RECORD
This is the complete record of record creation through the process. Use this to figure out exactly what happened in the case of any given record.
Operation # 1
Processing record with id # d1e1150
Found 1 property records.
Using property id #109
Inserted title record # 2300
Processing owners
Inserted owner id # 1863
Inserted owner-to-title link id # 9678
Inserted mortgage id # 929
Inserted mortgage-to-title link id # 3246
Processing lenders for this mortgage.
Inserted lender id # 553
Inserted lender_to_mortgage link id # 1130
Processing other documents.
Inserted other_doc id # 388
Inserted doc-to-title link id # 1086
Finished processing record d1e1150.
[...]
I've now finished and tested the code for importing records into the db. The end result (which creates 438 primary title inserts, but thousands of ancillary records) was too large to run in a single operation, so I'm using <xsl:result-document> to split it into five sub-operations. Each one produces a detailed report on what it's done, so that, combined with the source XML file, should be all we need in terms of an audit trail.
It took a little longer than expected because of two additional aspects I hadn't planned for initially. Both the lenders and the owners tables have mechanisms for distinguishing between individuals and institutions, and I ended up adding some code to determine whether new records being added fell into one or the other category, by examining the name of the new owner or lender. In addition, I decided to parse out the forenames and surnames of individuals.
I would post the script here, but it's very long, very messy, based on invented XML in the source document, and it's a one-shot task that we won't have to do again, so it's not of any general use. However, the techniques I've figured out for this probably will be used again.
I haven't yet run it against the live database, only against the dev db; I don't want to make major changes right at the end of the day, so I'm going to do it tomorrow morning, after taking a fresh backup, and recording the upper bounds of the current db tables, so that I can report to J-SR and VG, and they can look at the results.
Working with a single record, I've got the following stages done:
I'm now working on lenders. Once I have one record working, I'll start trying blocks of 5, 10, 20, 50, etc. till the whole thing works.
I've decided to go the PHP route for the moment. These two blocks of code would be run before and after running the PHP process, and I'm putting them here so I can get them when I need them. They only need to be run once, and I'll certainly be hacking away with my scripts for many hours in between.
# Add the extra columns we need.
ALTER TABLE `props` ADD `temp_tracking` VARCHAR(1024);
ALTER TABLE `titles` ADD `temp_tracking` VARCHAR(1024);
ALTER TABLE `owners` ADD `temp_tracking` VARCHAR(1024);
ALTER TABLE `lenders` ADD `temp_tracking` VARCHAR(1024);
ALTER TABLE `mortgages` ADD `temp_tracking` VARCHAR(1024);
ALTER TABLE `other_docs` ADD `temp_tracking` VARCHAR(1024);
# Delete the extra columns we added.
ALTER TABLE `props` DROP COLUMN `temp_tracking`;
ALTER TABLE `titles` DROP COLUMN `temp_tracking`;
ALTER TABLE `owners` DROP COLUMN `temp_tracking`;
ALTER TABLE `lenders` DROP COLUMN `temp_tracking`;
ALTER TABLE `mortgages` DROP COLUMN `temp_tracking`;
ALTER TABLE `other_docs` DROP COLUMN `temp_tracking`;
I've written the basic code to generate the SQL to add the core title table records, built in the PHP framework for running queries, and tested with a single record to demonstrate that it's practical. Now I can start building the more complicated queries.
The insertion code I'm writing needs to use stored procedures in order to do slightly complex things such as IF THEN ELSE structures. However, I've discovered that it's impossible to call a stored procedure which has more than one statement in it from inside phpMyAdmin because the CLIENT_MULTI_STATEMENTS flag is not set by phpMyAdmin in its connection string. This is not configurable in phpMyAdmin; you'd need to change the actual source to make this possible.
I confirmed this by running a test at the mysql command line on mysqldev.hcmc.uvic.ca (like this: mysql -u hcmc -p properties_dev). From the CLI, I could successfully define a procedure and call it, but from inside phpMyAdmin I could only define the procedure; attempts to call it result in Error 1312 (0A000): PROCEDURE proc_name can't return a result set in the given context. The same applies to SQL scripts uploaded into phpMyAdmin via the Import function.
So it seems to me that one useful working method would be one whereby I have phpMyAdmin open (for easy checking of the results of my script), and I copy my script from the local machine (where it's generated by XSLT which runs against the XML data from the spreadsheet) over to mysqldev; then I run the script from the command line, and check the results. Not ideal at all. But possible.
Another approach would be to generate not SQL but PHP, with embedded SQL, and run the PHP script; the PHP connection method enables you to set the requisite flag when making the connection. That might be easier, but again I'd need to save it and copy it over to home1t in order to be able to run it; and the extra embedding, given the scale of the operation I want to do, might be a problem. So the first option is probably the best.
It's taken me half the day to figure this out. Curses.
Now that I have the data in a usable XML format (an example record is below), I need to plan the process by which SQL statements are generated and sequenced in order to add it to the database. There are a number of key issues that need to be carefully handled:
This is the proposed sequence:
<record gen-id="d1e30029" orig-id="5701075"> <property> <district>196</district> <plan>196</plan> <block>70</block> <lot>2</lot> <census-tract>57.01</census-tract> </property> <title> <title-code>J68726L</title-code> <traces/> <t1-doc-lots>1, 2 and 3</t1-doc-lots> <traces>H9471L,H9472L, na, na, na, na, na, na, na, na, na, na</traces> <date>1981-09-02</date> <ownership-split>Shek Holdings Co. Ltd. (Incorporation no. 241563): 7/10 A. Youngson Investments Ltd (Incorporation no. 240974): 3/10</ownership-split> <notes/> </title> <owners> <owner> <name>Shek Holdings Co. Ltd. (Incorporation no. 241563)</name> <occupation/> <street-num>539</street-num> <street>Main St</street> <city/> <province/> <post-code/> <undivided-interest>7/10</undivided-interest> </owner> <owner> <name>A. Youngson Investments Ltd (Incorporation no. 240974)</name> <occupation/> <street-num>940</street-num> <street>West King Edward Ave</street> <city>n</city> <province/> <post-code/> <undivided-interest>3/10</undivided-interest> </owner> </owners> <mortgages> <mortgage gen-id="mort_d1e30029_d1e30236"> <doc-num>M H6566</doc-num> <year/> <cancelled>n</cancelled> <cancelled-date/> <value/> <modified/> <modified-date/> <modified-by/> <term-date/> <interest/> <remarks>Lots 1 and 2</remarks> <lenders> <lender>The Imperial Life Assurance Company of Canada</lender> </lenders> </mortgage> <mortgage gen-id="mort_d1e30029_d1e30263"> <doc-num>M H7796</doc-num> <year>1980</year> <cancelled>y</cancelled> <cancelled-date>1981</cancelled-date> <value/> <modified/> <modified-date/> <modified-by/> <term-date/> <interest/> <remarks>Lots 1 and 2</remarks> <lenders> <lender>Canadian Imperial Bank of Commerce</lender> </lenders> </mortgage> <mortgage gen-id="mort_d1e30029_d1e30295"> <doc-num>M H45708</doc-num> <year>1980</year> <cancelled>y</cancelled> <cancelled-date>1981</cancelled-date> <value/> <modified/> <modified-date/> <modified-by/> <term-date/> <interest/> <remarks>Lots 1 and 2</remarks> <lenders> <lender>Canadian Imperial Bank of Commerce</lender> </lenders> </mortgage> </mortgages> <other-docs> <doc> <doc-num>AR H6567</doc-num> <year>1980</year> <parties>The Imperial Life Assurance Company of Canada</parties> <remarks>Lots 1 and 2</remarks> </doc> </other-docs> </record>
I've now finished the second stage of this work, with nest_data.xsl complete and working; all mortgages, lenders and associated documents are now organized into a structured XML document which can be used to generate SQL. In the process, I had to manually fix some of the column headers in unwrapped_fods.xml, because they were inconsistent in terms of spelling and case; that means the original unwrap_fods.xsl can't be run again to re-run the whole process without doing the same manual intervention. However, I don't think that will be necessary, since the data in unwrapped_fods.xml is complete. So now we have unwrapped_fods_nested.xml, and I can start generating SQL from that.
I've spent most of today figuring out an approach for importing the old spreadsheet data discussed in previous posts. This is what I've got to so far:
<owners> element containing only one <owner> element for each owner who actually exists. (The XML format is invented just for this purpose, because it's purely transitional.) I now have to do the same for the 22 mortgage blocks.Once I have this XML, the plan is to create four XSL stylesheets which produce SQL inserts for the property, mortgage, owner and title tables. Each of these tables will have a new field added to it which can be used to store a unique value for each source record (row in the original table); then these fields can be used to create records in the linking tables. Finally, a process will have to be worked out which can generate lists of potential duplicate records, which can then be checked by a human, and merged if appropriate. This problem is not directly related to the current one, but the issue of duplicates will be much worsened by the automated adding of data through this process.
The following remaining tasks from yesterday are now completed:
This remains to be done:
This is the SQL for the db side of the changes:
/** ** SECOND PHASE: June 30 changes. */ /* Set the institution type field to 1 (the default) for all records. */ UPDATE `owners` SET `own_insttype_id_fk` = "1"; /* ** I wanted to delete records in the owners_to_ethnicities table where there's both an 11 ** and a 10; in this case, we don't need the 10 because the 11 supercedes it. */ /* ** First check what we have. There were 97 records before deletion. */ SELECT * FROM `owners_to_ethnicities` AS ote1 WHERE ote1.ote_ethnicity_id_fk = "10" AND EXISTS (SELECT * FROM `owners_to_ethnicities` AS ote2 WHERE ote1.ote_owner_id_fk = ote2.ote_owner_id_fk AND ote2.ote_ethnicity_id_fk = "11"); /* ** We can't actually delete records because you can't query the same table in a subquery as you're deleting from in the main clause; this is a bit of a showstopper. So we'll have to take a different approach. We first set the value for the ethnicity 10 (Private institution), then override it where there's an 11. Finally we set the value for 9. */ SELECT * FROM `owners` WHERE EXISTS (SELECT * FROM `owners_to_ethnicities` AS ote1 WHERE ote1.ote_ethnicity_id_fk = "10" AND owners.own_owner_id = ote1.ote_owner_id_fk); /* 266 results. */ UPDATE `owners` SET `own_insttype_id_fk` = "3" WHERE EXISTS (SELECT * FROM `owners_to_ethnicities` AS ote1 WHERE ote1.ote_ethnicity_id_fk = "10" AND owners.own_owner_id = ote1.ote_owner_id_fk); SELECT * FROM `owners` WHERE EXISTS (SELECT * FROM `owners_to_ethnicities` AS ote1 WHERE ote1.ote_ethnicity_id_fk = "11" AND owners.own_owner_id = ote1.ote_owner_id_fk); /* 97 results */ UPDATE `owners` SET `own_insttype_id_fk` = "2" WHERE EXISTS (SELECT * FROM `owners_to_ethnicities` AS ote1 WHERE ote1.ote_ethnicity_id_fk = "11" AND owners.own_owner_id = ote1.ote_owner_id_fk); SELECT * FROM `owners` WHERE EXISTS (SELECT * FROM `owners_to_ethnicities` AS ote1 WHERE ote1.ote_ethnicity_id_fk = "9" AND owners.own_owner_id = ote1.ote_owner_id_fk); /* 10 results */ UPDATE `owners` SET `own_insttype_id_fk` = "4" WHERE EXISTS (SELECT * FROM `owners_to_ethnicities` AS ote1 WHERE ote1.ote_ethnicity_id_fk = "9" AND owners.own_owner_id = ote1.ote_owner_id_fk); /* ** Now we can delete the three values from the ethnicities table, which will result in the deletion of the linked records. */ SELECT * FROM ethnicities where eth_ethnicity_id in (9,10,11); /* 3 results */ SELECT * FROM owners_to_ethnicities where ote_ethnicity_id_fk in (9,10,11); /* 373 results: 266 + 97 + 10 */ DELETE FROM ethnicities where eth_ethnicity_id in (9,10,11); /* Check that linking records were deleted. */ SELECT * FROM owners_to_ethnicities where ote_ethnicity_id_fk in (9,10,11); /* 0 results */
From the task list in the previous post, this is what I've accomplished this afternoon (tested in dev, and migrated to live):
The following remain to be completed:
This is the SQL for the db side of the changes:
/* ** This is a list of changes in db structure and massages to data ** which were done starting June 29 2011. See the blog posting of ** the same date for more details and explanation. */ ALTER TABLE titles CHANGE COLUMN ttl_price ttl_consideration int(11) default NULL; ALTER TABLE titles ADD ttl_declaredvalue int(11) default NULL; ALTER TABLE titles ADD ttl_marketvalue int(11) default NULL; ALTER TABLE titles ADD ttl_transferdate date not NULL AFTER ttl_date; ALTER TABLE owners ADD own_head_office varchar(128); CREATE TABLE IF NOT EXISTS `sellers_to_titles` ( `stt_stt_id` int(11) NOT NULL auto_increment, `stt_owner_id_fk` int(11) NOT NULL, `stt_title_id_fk` int(11) NOT NULL, PRIMARY KEY (`stt_stt_id`), KEY `stt_owner_id_fk` (`stt_owner_id_fk`), KEY `stt_title_id_fk` (`stt_title_id_fk`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci; ALTER TABLE `sellers_to_titles` ADD CONSTRAINT `sellers_to_titles_ibfk_1` FOREIGN KEY (`stt_owner_id_fk`) REFERENCES `owners` (`own_owner_id`) ON DELETE CASCADE ON UPDATE CASCADE, ADD CONSTRAINT `sellers_to_titles_ibfk_2` FOREIGN KEY (`stt_title_id_fk`) REFERENCES `titles` (`ttl_title_id`) ON DELETE CASCADE ON UPDATE CASCADE; CREATE TABLE IF NOT EXISTS `lawyers_to_titles` ( `ltt_ltt_id` int(11) NOT NULL auto_increment, `ltt_owner_id_fk` int(11) NOT NULL, `ltt_title_id_fk` int(11) NOT NULL, PRIMARY KEY (`ltt_ltt_id`), KEY `ltt_owner_id_fk` (`ltt_owner_id_fk`), KEY `ltt_title_id_fk` (`ltt_title_id_fk`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci; ALTER TABLE `lawyers_to_titles` ADD CONSTRAINT `lawyers_to_titles_ibfk_1` FOREIGN KEY (`ltt_owner_id_fk`) REFERENCES `owners` (`own_owner_id`) ON DELETE CASCADE ON UPDATE CASCADE, ADD CONSTRAINT `lawyers_to_titles_ibfk_2` FOREIGN KEY (`ltt_title_id_fk`) REFERENCES `titles` (`ttl_title_id`) ON DELETE CASCADE ON UPDATE CASCADE; CREATE TABLE IF NOT EXISTS `institution_types` ( `ins_ins_id` int(11) NOT NULL auto_increment, `ins_name` varchar(45) collate utf8_unicode_ci NOT NULL, `ins_desc` varchar(256) collate utf8_unicode_ci NOT NULL, PRIMARY KEY (`ins_ins_id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci; INSERT INTO `institution_types` (ins_ins_id, ins_name, ins_desc) VALUES (1, "", "Not an institution."), (2, "Ethnic institution", "Insert description of what this means."), (3, "Private institution", "Insert description of what this means."), (4, "Public institution", "Insert description of what this means.") ; ALTER TABLE `owners` ADD `own_insttype_id_fk` int(11) default 1; ALTER TABLE `owners` ADD CONSTRAINT `owners_ibfk_1` FOREIGN KEY (`own_insttype_id_fk`) REFERENCES `institution_types` (`ins_ins_id`) ON DELETE CASCADE ON UPDATE CASCADE;
Met with JS-R and VG to discuss the next phase of data entry. These are the changes we need to make:
Coming down the pipe, there's also a requirement to parse a complex spreadsheet containing many hundreds of title records into the db somehow. I'm thinking this:
Currently we're using ethnicity to store info which is not really suited to it (Private_institution and Public_institution). This info belongs in the institution field, but that's currently a boolean, so it needs to be turned into a lookup to another table. Then the data from the ethnicities field needs to be migrated to it, and deleted from the ethnicities tables, simplifying ethnicity and munging.
Right now we have a system which munges ethnicities based on this algorithm:
(SELECT CASE WHEN group_concat(ethnicities.eth_name separator ', ') LIKE "%Japanese%" AND group_concat(ethnicities.eth_name separator ', ') LIKE "%Chinese%" THEN "c" WHEN group_concat(ethnicities.eth_name separator ', ') LIKE "%Japanese%" AND (SELECT COUNT(DISTINCT(ethnicities.eth_name))) > 1 THEN "b" WHEN group_concat(ethnicities.eth_name separator ', ') LIKE "%Chinese%" AND (SELECT COUNT(DISTINCT(ethnicities.eth_name))) > 1 THEN "a" WHEN (SELECT COUNT(DISTINCT(ethnicities.eth_name))) > 1 THEN "d" ELSE SUBSTRING_INDEX(group_concat(ethnicities.eth_name separator ', '), ',', 1) END) AS munged_ethnicity
However, there are now more ethnicities, including Private_institution, which need their own specific handling during the munge process. That algorithm needs to be rewritten, and in the process we should replace the anonymous variables with something more useful.
On JS-R's instructions, simplified the view which creates the abbreviated pseudo-field in the titles table which contains property information:
RENAME TABLE `props_abbr` TO `props_abbr_old`; CREATE VIEW `props_abbr` AS select `props`.`prp_property_id` AS `prpabbr_property_id`,concat(_utf8' B:',LPAD(`props`.`prp_block`, 3, '0'),_utf8' L:',`props`.`prp_lot`) AS `prpabbr_property_info` from `props`;
This also pads the block data with leading zeroes. We can't do that with the lot info because lot identifiers are not numeric.
It appears that we should not have based the titles_to_prectitles table on the traces data; this did include some links to previous titles, but also some title codes which were simply written on the record and were not necessarily intended to point to previous titles. Therefore this is what I've done:
Following my previous post, this is the SQL to complete the titles_to_prectitles table based on the content of the titles_to_titles table:
INSERT INTO titles_to_prectitles (ttp_title_id_fk, ttp_prectitle_id_fk) SELECT ttt_newer_title_id_fk, ttt_older_title_id_fk FROM titles_to_titles as ttt WHERE NOT EXISTS ( SELECT * FROM titles_to_prectitles AS ttp WHERE ( ttp.ttp_title_id_fk = ttt.ttt_newer_title_id_fk AND ttp.ttp_prectitle_id_fk = ttt.ttt_older_title_id_fk ) )
This results in the insertion of 647 new records into titles_to_prectitles.
The same operation can be done in reverse, to add 214 more records to titles_to_titles:
INSERT INTO titles_to_titles (ttt_newer_title_id_fk, ttt_older_title_id_fk) SELECT ttp_title_id_fk, ttp_prectitle_id_fk FROM titles_to_prectitles as ttp WHERE NOT EXISTS ( SELECT * FROM titles_to_titles AS ttt WHERE ( ttp.ttp_title_id_fk = ttt.ttt_newer_title_id_fk AND ttp.ttp_prectitle_id_fk = ttt.ttt_older_title_id_fk ) )
This results in both tables having 2193 records.
It appears that many forward links were created (titles_to_titles) from titles to subsequent titles which were not mirrored by corresponding backward links (titles_to_prectitles). Since we only used titles_to_prectitles in our views, many title progressions are not yet reflected accurately.
The solution is to create titles_to_prectitles links directly from the titles_to_titles table -- they should essentially be mirrors of each other. This should be possible fairly simply using SQL, and changes would immediately be reflected in the views. We should also do the reverse, ensuring that both tables are in sync.
This SQL gives us (if I'm correct in my assumptions) an overview of the situation we're in with regard to records missing from titles_to_prectitles:
SELECT ttt.*, old_titles.ttl_title_code AS older_title_code, new_titles.ttl_title_code AS newer_title_code FROM `titles_to_titles` AS ttt LEFT JOIN titles AS old_titles on ttt.ttt_older_title_id_fk = old_titles.ttl_title_id LEFT JOIN titles AS new_titles on ttt.ttt_newer_title_id_fk = new_titles.ttl_title_id WHERE NOT EXISTS ( SELECT * FROM titles_to_prectitles AS ttp WHERE ( ttp.ttp_title_id_fk = ttt.ttt_newer_title_id_fk AND ttp.ttp_prectitle_id_fk = ttt.ttt_older_title_id_fk ) ) LIMIT 0 , 700
This can be extended to create tehe missing records in titles_to_prectitles, and then re-engineered to do the reverse and create any missing records in titles_to_titles.
In breaks between workshop sessions, I tried to address the problem of the length of time queries are taking by using temporary tables constructed directly in the query instead of joining to views. The approach I took (joining to a temporary table created in the join clause) had no effect -- the query took exactly the same time. But there may be a way to create a temporary table once earlier in the query, and then join to it later. Still working on this.
Reworked VW_owners_full so that it includes more records and data (a regular join changed to a left join):
DROP VIEW IF EXISTS owners_full; CREATE VIEW VW_owners_full AS ( SELECT owners.*, owners_to_ethnicities.ote_ethnicity_id_fk, ethnicities.eth_name, owners_to_titles.ott_title_id_fk AS title_id, titles.ttl_title_id, titles.ttl_title_code, titles.ttl_date FROM owners_to_titles LEFT JOIN owners ON owners.own_owner_id = owners_to_titles.ott_owner_id_fk LEFT JOIN titles ON owners_to_titles.ott_title_id_fk = titles.ttl_title_id LEFT JOIN owners_to_ethnicities ON owners.own_owner_id = owners_to_ethnicities.ote_owner_id_fk LEFT JOIN ethnicities ON owners_to_ethnicities.ote_ethnicity_id_fk = ethnicities.eth_ethnicity_id order by owners.own_surname )
DROP VIEW IF EXISTS VW_trans_composite_eth_prop; CREATE VIEW VW_trans_composite_eth_prop AS ( SELECT seller_titles.ttl_title_id AS seller_title_id, seller_titles.ttl_date AS seller_title_date, seller_titles.ttl_title_code AS seller_title_code, buyer_titles.ttl_title_id AS buyer_title_id, buyer_titles.ttl_date AS buyer_title_date, buyer_titles.ttl_title_code AS buyer_title_code, DATEDIFF(buyer_titles.ttl_date, seller_titles.ttl_date) as seller_duration_days, census_tracts.census_tract_code, props.*, sellers.concat_owners AS concat_sellers, sellers.concat_ethnicities AS seller_ethnicities, sellers.total_owners AS total_sellers, sellers.munged_ethnicity AS seller_munged_eth, sellers.total_institutional AS institutional_sellers, buyers.concat_owners AS concat_buyers, buyers.concat_ethnicities AS buyer_ethnicities, buyers.total_owners AS total_buyers, buyers.munged_ethnicity AS buyer_munged_eth, buyers.total_institutional AS institutional_buyers FROM titles AS buyer_titles LEFT JOIN titles_to_prectitles ON buyer_titles.ttl_title_id = titles_to_prectitles.ttp_title_id_fk LEFT JOIN titles AS seller_titles ON seller_titles.ttl_title_id = titles_to_prectitles.ttp_prectitle_id_fk LEFT JOIN VW_titles_composite_eth AS sellers ON seller_titles.ttl_title_id = sellers.ttl_title_id LEFT JOIN VW_titles_composite_eth AS buyers ON buyer_titles.ttl_title_id = buyers.ttl_title_id LEFT JOIN props AS props ON seller_titles.ttl_property_id_fk = props.prp_property_id LEFT JOIN census_tracts ON props.prp_census_tract_id_fk = census_tracts.census_tract_id ORDER BY seller_titles.ttl_title_id )
Revised the VW_titles_composite_eth view so that it includes even owners who have no ethnicity (are institutional):
DROP VIEW IF EXISTS VW_titles_composite_eth; CREATE VIEW VW_titles_composite_eth AS ( SELECT titles.ttl_title_id, titles.ttl_title_code, titles.ttl_date, CAST(group_concat(owners.own_owner_id separator ', ') AS CHAR) AS concat_owners, SUM(owners.own_institutional) AS total_institutional, CAST(group_concat(owners_to_ethnicities.ote_ethnicity_id_fk separator ', ') AS CHAR) AS concat_eth_ids, group_concat(ethnicities.eth_name separator ', ') AS concat_ethnicities, (SELECT IF (COUNT(owners.own_owner_id) > 1, 'multiple', 'single')) AS multi_owner, (COUNT(owners.own_owner_id)) AS total_owners, (SELECT CASE WHEN group_concat(ethnicities.eth_name separator ', ') LIKE "%Japanese%" AND group_concat(ethnicities.eth_name separator ', ') LIKE "%Chinese%" THEN "c" WHEN group_concat(ethnicities.eth_name separator ', ') LIKE "%Japanese%" AND (SELECT COUNT(DISTINCT(ethnicities.eth_name))) > 1 THEN "b" WHEN group_concat(ethnicities.eth_name separator ', ') LIKE "%Chinese%" AND (SELECT COUNT(DISTINCT(ethnicities.eth_name))) > 1 THEN "a" WHEN (SELECT COUNT(DISTINCT(ethnicities.eth_name))) > 1 THEN "d" ELSE SUBSTRING_INDEX(group_concat(ethnicities.eth_name separator ', '), ',', 1) END) AS munged_ethnicity FROM titles LEFT JOIN owners_to_titles ON owners_to_titles.ott_title_id_fk = titles.ttl_title_id LEFT JOIN owners ON owners_to_titles.ott_owner_id_fk = owners.own_owner_id LEFT JOIN owners_to_ethnicities ON owners.own_owner_id = owners_to_ethnicities.ote_owner_id_fk LEFT JOIN ethnicities ON owners_to_ethnicities.ote_ethnicity_id_fk = ethnicities.eth_ethnicity_id GROUP BY titles.ttl_title_id )
The latest requirement is a field containing a comma-separated list of the owner names where the owners are institutional.
JS-R needed to add the full property information to the transactions view, so I've added a new view:
DROP VIEW IF EXISTS VW_trans_composite_eth_prop; CREATE VIEW VW_trans_composite_eth_prop AS ( SELECT seller_titles.ttl_title_id AS seller_title_id, seller_titles.ttl_date AS seller_title_date, seller_titles.ttl_title_code AS seller_title_code, buyer_titles.ttl_title_id AS buyer_title_id, buyer_titles.ttl_date AS buyer_title_date, buyer_titles.ttl_title_code AS buyer_title_code, DATEDIFF(buyer_titles.ttl_date, seller_titles.ttl_date) as seller_duration_days, census_tracts.census_tract_code, props.*, sellers.concat_owners AS concat_sellers, sellers.concat_ethnicities AS seller_ethnicities, sellers.total_owners AS total_sellers, sellers.munged_ethnicity AS seller_munged_eth, buyers.concat_owners AS concat_buyers, buyers.concat_ethnicities AS buyer_ethnicities, buyers.total_owners AS total_buyers, buyers.munged_ethnicity AS buyer_munged_eth FROM titles AS buyer_titles LEFT JOIN titles_to_prectitles ON buyer_titles.ttl_title_id = titles_to_prectitles.ttp_title_id_fk LEFT JOIN titles AS seller_titles ON seller_titles.ttl_title_id = titles_to_prectitles.ttp_prectitle_id_fk LEFT JOIN VW_titles_composite_eth AS sellers ON seller_titles.ttl_title_id = sellers.ttl_title_id LEFT JOIN VW_titles_composite_eth AS buyers ON buyer_titles.ttl_title_id = buyers.ttl_title_id LEFT JOIN props AS props ON seller_titles.ttl_property_id_fk = props.prp_property_id LEFT JOIN census_tracts ON props.prp_census_tract_id_fk = census_tracts.census_tract_id ORDER BY seller_titles.ttl_title_id )
This takes even longer to generate than the original, unfortunately.
Had to use a separate view to aggregate owners against titles. It's working, but it's mighty slow!
DROP VIEW IF EXISTS VW_trans_composite_eth; CREATE VIEW VW_trans_composite_eth AS ( SELECT seller_titles.ttl_title_id AS seller_title_id, seller_titles.ttl_date AS seller_title_date, seller_titles.ttl_title_code AS seller_title_code, buyer_titles.ttl_title_id AS buyer_title_id, buyer_titles.ttl_date AS buyer_title_date, buyer_titles.ttl_title_code AS buyer_title_code, DATEDIFF(buyer_titles.ttl_date, seller_titles.ttl_date) as seller_duration_days, census_tracts.census_tract_code, sellers.concat_owners AS concat_sellers, sellers.concat_ethnicities AS seller_ethnicities, sellers.total_owners AS total_sellers, sellers.munged_ethnicity AS seller_munged_eth, buyers.concat_owners AS concat_buyers, buyers.concat_ethnicities AS buyer_ethnicities, buyers.total_owners AS total_buyers, buyers.munged_ethnicity AS buyer_munged_eth FROM titles AS buyer_titles LEFT JOIN titles_to_prectitles ON buyer_titles.ttl_title_id = titles_to_prectitles.ttp_title_id_fk LEFT JOIN titles AS seller_titles ON seller_titles.ttl_title_id = titles_to_prectitles.ttp_prectitle_id_fk LEFT JOIN VW_titles_composite_eth AS sellers ON seller_titles.ttl_title_id = sellers.ttl_title_id LEFT JOIN VW_titles_composite_eth AS buyers ON buyer_titles.ttl_title_id = buyers.ttl_title_id LEFT JOIN props ON seller_titles.ttl_property_id_fk = props.prp_property_id LEFT JOIN census_tracts ON props.prp_census_tract_id_fk = census_tracts.census_tract_id ORDER BY seller_titles.ttl_title_id )
I've completed two of the three tasks from this morning: the VW_sellers_duration view has been modified to rename its fields, and the VW_mortgate_ethnicity view has been created.
DROP VIEW IF EXISTS VW_mortgage_ethnicity; CREATE VIEW VW_mortgage_ethnicity AS (SELECT mortgages.*, titles.ttl_title_id, titles.ttl_title_code, titles.ttl_date, census_tracts.census_tract_code, CAST(group_concat(owners.own_owner_id separator ', ') AS CHAR) AS concat_owners, CAST(group_concat(owners_to_ethnicities.ote_ethnicity_id_fk separator ', ') AS CHAR) AS concat_eth_ids, group_concat(ethnicities.eth_name separator ', ') AS concat_ethnicities, (SELECT IF (COUNT(owners.own_owner_id) > 1, 'multiple', 'single')) AS multi_owner, (COUNT(owners.own_owner_id)) AS total_owners, (SELECT CASE WHEN group_concat(ethnicities.eth_name separator ', ') LIKE "%Japanese%" AND group_concat(ethnicities.eth_name separator ', ') LIKE "%Chinese%" THEN "c" WHEN group_concat(ethnicities.eth_name separator ', ') LIKE "%Japanese%" AND (SELECT COUNT(DISTINCT(ethnicities.eth_name))) > 1 THEN "b" WHEN group_concat(ethnicities.eth_name separator ', ') LIKE "%Chinese%" AND (SELECT COUNT(DISTINCT(ethnicities.eth_name))) > 1 THEN "a" WHEN (SELECT COUNT(DISTINCT(ethnicities.eth_name))) > 1 THEN "d" ELSE SUBSTRING_INDEX(group_concat(ethnicities.eth_name separator ', '), ',', 1) END) AS munged_ethnicity FROM mortgages LEFT JOIN mortgages_to_titles ON mortgages.mgg_mortgage_id = mortgages_to_titles.mtt_mortgage_id_fk LEFT JOIN titles on mortgages_to_titles.mtt_title_id_fk = titles.ttl_title_id LEFT JOIN owners_to_titles ON owners_to_titles.ott_title_id_fk = titles.ttl_title_id LEFT JOIN owners ON owners_to_titles.ott_owner_id_fk = owners.own_owner_id LEFT JOIN owners_to_ethnicities ON owners.own_owner_id = owners_to_ethnicities.ote_owner_id_fk LEFT JOIN ethnicities ON owners_to_ethnicities.ote_ethnicity_id_fk = ethnicities.eth_ethnicity_id LEFT JOIN props ON titles.ttl_property_id_fk = props.prp_property_id LEFT JOIN census_tracts ON props.prp_census_tract_id_fk = census_tracts.census_tract_id GROUP BY mortgages.mgg_mortgage_id)
I'm now working on the transaction view, but I've run into an issue variously known as the "Cartesian product" or "Cascading aggregates" problem: in simple terms, when I aggregate both the buyers and the sellers in a single row, I end up with multiple instances. For example, if there are two buyers and three sellers, then each buyer appears three times and each seller twice, for spurious totals of six buyers and six sellers.
What this means is that my strategy for munging buyers OR sellers works only when we're only looking at one of them; when we need to munge both of them in the same row, a different strategy has to be adopted. Right now, I don't know what that strategy might be, but I'm doing some research.
Met with JS-R and worked through requirements for new views of the data:
Started experimenting with building a view of the titles table which will include a sort of composite ethnicity column, as detailed in the preceding post. I now have the key owner and ethnicity data munged together in a view created like this:
CREATE VIEW VW_titles_composite_eth AS (SELECT titles.ttl_title_id, titles.ttl_title_code, titles.ttl_date, CAST(group_concat(owners.own_owner_id separator ', ') AS CHAR) AS concat_owners, CAST(group_concat(owners_to_ethnicities.ote_ethnicity_id_fk separator ', ') AS CHAR) AS concat_eth_ids, group_concat(ethnicities.eth_name separator ', ') AS concat_ethnicities, (SELECT IF (COUNT(owners.own_owner_id) > 1, 'multiple', 'single')) AS multi_owner, (COUNT(owners.own_owner_id)) AS total_owners, (SELECT CASE WHEN group_concat(ethnicities.eth_name separator ', ') LIKE "%Japanese%" AND group_concat(ethnicities.eth_name separator ', ') LIKE "%Chinese%" THEN "c" WHEN group_concat(ethnicities.eth_name separator ', ') LIKE "%Japanese%" AND (SELECT COUNT(DISTINCT(ethnicities.eth_name))) > 1 THEN "b" WHEN group_concat(ethnicities.eth_name separator ', ') LIKE "%Chinese%" AND (SELECT COUNT(DISTINCT(ethnicities.eth_name))) > 1 THEN "a" WHEN (SELECT COUNT(DISTINCT(ethnicities.eth_name))) > 1 THEN "d" ELSE SUBSTRING_INDEX(group_concat(ethnicities.eth_name separator ', '), ',', 1) END) AS munged_ethnicity FROM titles JOIN owners_to_titles ON owners_to_titles.ott_title_id_fk = titles.ttl_title_id JOIN owners ON owners_to_titles.ott_owner_id_fk = owners.own_owner_id JOIN owners_to_ethnicities ON owners.own_owner_id = owners_to_ethnicities.ote_owner_id_fk JOIN ethnicities ON owners_to_ethnicities.ote_ethnicity_id_fk = ethnicities.eth_ethnicity_id GROUP BY titles.ttl_title_id)
This has the following values, as previously requested:
JS-R has made the following request for a new column in the titles table, which would handle a kind of munged-together value for all the owners' ethnicities. It would be a string field, presumably (unless we want to give these numbers to keep them short).
Maybe we need two new variables as characteristics of titles:
1. Multiple or single owner:
a. multiple
b. single
2. Ownership ethnicity (don't worry here whether they are individuals or groups):
a. Mixed ethnicity with Chinese
b. Mixed Ethnicity with Japanese (include Japanese provisional)
c. Mixed ethnicity with Chinese and Japanese
d. Mixed ethnicity--ALL Others (all not included above, including mixes that include the Asia: other category but neither Japanese or Chinese, if they include both Asia other and Japanese or Chinese, categorize as indicated above)
e. ELSE--> copy the single ethnicity (or shared ethnicity) here
It turns out that this one should be buyers instead of vendors: Vendor ethnicity by mortgage institutional or not (we will fill in this data as we go) (with analysis possible by date and by sample). So here we go with:
Buyer ethnicity by mortgage institutional or not (we will fill in this data as we go) (with analysis possible by date and by sample)
Here's the SQL:
CREATE VIEW VW_buyers_and_lenders AS (SELECT titles.ttl_title_id, titles.ttl_title_code, titles.ttl_date, owners.own_owner_id, owners.own_surname, owners.own_forenames, buyer_ethnicities.eth_name AS buyer_ethnicity, mortgages.mgg_mortgage_id, mortgages.mgg_doc_num, mortgages.mgg_value, mortgages.mgg_interest, lenders.lnd_display_name, lenders.lnd_individual, lender_ethnicities.eth_name AS lender_ethnicity FROM titles JOIN owners_to_titles ON titles.ttl_title_id = owners_to_titles.ott_title_id_fk LEFT JOIN owners ON owners_to_titles.ott_owner_id_fk = owners.own_owner_id LEFT JOIN owners_to_ethnicities ON owners.own_owner_id = owners_to_ethnicities.ote_owner_id_fk LEFT JOIN ethnicities AS buyer_ethnicities ON owners_to_ethnicities.ote_ethnicity_id_fk = buyer_ethnicities.eth_ethnicity_id JOIN mortgages_to_titles ON titles.ttl_title_id = mortgages_to_titles.mtt_title_id_fk LEFT JOIN mortgages ON mortgages_to_titles.mtt_mortgage_id_fk = mortgages.mgg_mortgage_id LEFT JOIN lenders_to_mortgages ON mortgages.mgg_mortgage_id = lenders_to_mortgages.ltm_mortgage_id_fk LEFT JOIN lenders ON lenders_to_mortgages.ltm_lender_id_fk = lenders.lnd_lender_id LEFT JOIN lenders_to_ethnicities ON lenders.lnd_lender_id = lenders_to_ethnicities.lte_lender_id_fk LEFT JOIN ethnicities AS lender_ethnicities ON lenders_to_ethnicities.lte_ethnicity_id_fk = lender_ethnicities.eth_ethnicity_id)
I've included lender ethnicities and lender "individual", even though no data has been entered for those yet.
We now have six views, which I've renamed to make them more obviously views in the phpMyAdmin interface:
I think these cover all the current needs. VW_owners_full is just a view of all the core data about owners; I was intending to use it as part of other views, but didn't need to. I'll leave it there in case it's useful.
Created a new view for sellers with full details of the transaction, but also including ownership duration:
CREATE VIEW VW_sellers_duration AS (SELECT old_titles.ttl_title_id AS old_title_id, old_titles.ttl_date AS old_title_date, old_titles.ttl_title_code AS old_title_code, new_titles.ttl_title_id AS new_title_id, new_titles.ttl_date AS new_title_date, new_titles.ttl_title_code AS new_title_code, DATEDIFF(new_titles.ttl_date, old_titles.ttl_date) as duration_days, census_tracts.census_tract_code, sellers.own_owner_id AS seller_id, sellers.own_surname AS seller_surname, sellers.own_forenames AS seller_forenames, seller_eth.eth_name as seller_ethnicity FROM titles AS new_titles JOIN titles_to_prectitles ON new_titles.ttl_title_id = titles_to_prectitles.ttp_title_id_fk JOIN titles AS old_titles ON old_titles.ttl_title_id = titles_to_prectitles.ttp_prectitle_id_fk JOIN owners_to_titles AS sellers_to_titles on old_titles.ttl_title_id = sellers_to_titles.ott_title_id_fk JOIN owners AS sellers ON sellers_to_titles.ott_owner_id_fk = sellers.own_owner_id LEFT JOIN owners_to_ethnicities AS sellers_to_ethnicities ON sellers.own_owner_id = sellers_to_ethnicities.ote_owner_id_fk LEFT JOIN ethnicities AS seller_eth ON sellers_to_ethnicities.ote_ethnicity_id_fk = seller_eth.eth_ethnicity_id LEFT JOIN props ON old_titles.ttl_property_id_fk = props.prp_property_id LEFT JOIN census_tracts ON props.prp_census_tract_id_fk = census_tracts.census_tract_id)
This slightly amended version gets all buyers linked to a title even where other info is missing:
SELECT buyers.own_owner_id AS buyer_id, buyers.own_surname AS buyer_surname, buyers.own_forenames AS buyer_forenames, buyer_eth.eth_name AS buyer_ethnicity, census_tracts.census_tract_code, titles.ttl_title_id, titles.ttl_title_code, titles.ttl_date, titles_to_prectitles.ttp_prectitle_id_fk, sellers.own_owner_id AS seller_id, sellers.own_display_name AS seller_name, seller_eth.eth_name AS seller_ethnicity FROM owners AS buyers LEFT JOIN owners_to_titles on buyers.own_owner_id = owners_to_titles.ott_owner_id_fk LEFT JOIN titles on owners_to_titles.ott_title_id_fk = titles.ttl_title_id LEFT JOIN titles_to_prectitles ON titles.ttl_title_id = titles_to_prectitles.ttp_title_id_fk LEFT JOIN owners_to_ethnicities ON buyers.own_owner_id = owners_to_ethnicities.ote_owner_id_fk LEFT JOIN ethnicities AS buyer_eth ON owners_to_ethnicities.ote_ethnicity_id_fk = buyer_eth.eth_ethnicity_id JOIN owners_to_titles AS sellers_to_titles ON titles_to_prectitles.ttp_prectitle_id_fk = sellers_to_titles.ott_title_id_fk LEFT JOIN owners AS sellers ON sellers_to_titles.ott_owner_id_fk = sellers.own_owner_id LEFT JOIN owners_to_ethnicities AS sellers_to_ethnicities ON sellers.own_owner_id = sellers_to_ethnicities.ote_owner_id_fk LEFT JOIN ethnicities AS seller_eth ON sellers_to_ethnicities.ote_ethnicity_id_fk = seller_eth.eth_ethnicity_id LEFT JOIN props ON titles.ttl_property_id_fk = props.prp_property_id LEFT JOIN census_tracts ON props.prp_census_tract_id_fk = census_tracts.census_tract_id ORDER BY buyers.own_surname
Here are the latest requests:
I have code elsewhere on the blog which is close to the second of these. The first is more convoluted.
The next view is this one:
Vendor ethnicity by buyer ethnicity (with analysis possible by date and by sample)
For this we need an individual "transfer" record which includes the following columns:
I could start by creating a comprehensive view of owners, though; I could then join to this view in the creation of other views, making the process a bit more modular. Here's the sql:
DROP VIEW IF EXISTS owners_detail CREATE VIEW owners_full AS (SELECT owners.*, owners_to_ethnicities.ote_ethnicity_id_fk, ethnicities.eth_name, owners_to_titles.ott_title_id_fk AS title_id, titles.ttl_title_id, titles.ttl_title_code, titles.ttl_date FROM (((owners RIGHT JOIN owners_to_titles ON owners.own_owner_id = owners_to_titles.ott_owner_id_fk) JOIN titles on owners_to_titles.ott_title_id_fk = titles.ttl_title_id) JOIN owners_to_ethnicities ON owners.own_owner_id = owners_to_ethnicities.ote_owner_id_fk) JOIN ethnicities ON owners_to_ethnicities.ote_ethnicity_id_fk = ethnicities.eth_ethnicity_id ORDER BY owners.own_surname)
However, I ended up building the entire thing as a single view, like this:
DROP VIEW IF EXISTS buyers_and_sellers CREATE VIEW buyers_and_sellers AS (SELECT buyers.own_owner_id AS buyer_id, buyers.own_surname AS buyer_surname, buyers.own_forenames AS buyer_forenames, buyer_eth.eth_name AS buyer_ethnicity, census_tracts.census_tract_code, titles.ttl_title_id, titles.ttl_title_code, titles.ttl_date, titles_to_prectitles.ttp_prectitle_id_fk, sellers.own_owner_id AS seller_id, sellers.own_display_name AS seller_name, seller_eth.eth_name AS seller_ethnicity FROM owners AS buyers JOIN owners_to_titles on buyers.own_owner_id = owners_to_titles.ott_owner_id_fk JOIN titles on owners_to_titles.ott_title_id_fk = titles.ttl_title_id JOIN titles_to_prectitles ON titles.ttl_title_id = titles_to_prectitles.ttp_title_id_fk JOIN owners_to_ethnicities ON buyers.own_owner_id = owners_to_ethnicities.ote_owner_id_fk JOIN ethnicities AS buyer_eth ON owners_to_ethnicities.ote_ethnicity_id_fk = buyer_eth.eth_ethnicity_id JOIN owners_to_titles AS sellers_to_titles ON titles_to_prectitles.ttp_prectitle_id_fk = sellers_to_titles.ott_title_id_fk JOIN owners AS sellers ON sellers_to_titles.ott_owner_id_fk = sellers.own_owner_id JOIN owners_to_ethnicities AS sellers_to_ethnicities ON sellers.own_owner_id = sellers_to_ethnicities.ote_owner_id_fk JOIN ethnicities AS seller_eth ON sellers_to_ethnicities.ote_ethnicity_id_fk = seller_eth.eth_ethnicity_id JOIN props ON titles.ttl_property_id_fk = props.prp_property_id JOIN census_tracts ON props.prp_census_tract_id_fk = census_tracts.census_tract_id ORDER BY buyers.own_surname)
Right now this brings back 500 records, so I assume there are only 500 instances where we have a transfer from one person to another in which we know both the buyer and the seller. However, I need to check this carefully; I'll also probably have to loosen up the query to get us transfers for which less info is available.
The next view I've completed is the following:
Total transfers by property (sortable by decade and by sample)
because we have all the data for it.
SELECT props.*, titles.ttl_title_id, titles.ttl_title_code, titles.ttl_date, census_tracts.census_tract_code FROM (`props` RIGHT JOIN titles on titles.ttl_property_id_fk = props.prp_property_id) JOIN census_tracts on props.prp_census_tract_id_fk = census_tracts.census_tract_id ORDER BY titles.ttl_date
SELECT props.*, titles.ttl_title_id, titles.ttl_title_code, titles.ttl_date, census_tracts.census_tract_code, titles_to_prectitles.ttp_prectitle_id_fk AS `prectitle_id` FROM ((`props` RIGHT JOIN titles on titles.ttl_property_id_fk = props.prp_property_id) JOIN census_tracts on props.prp_census_tract_id_fk = census_tracts.census_tract_id) RIGHT JOIN titles_to_prectitles on titles_to_prectitles.ttp_title_id_fk = titles.ttl_title_id ORDER BY titles.ttl_date
DROP VIEW IF EXISTS trans_by_property CREATE VIEW trans_by_property AS (SELECT props.prp_property_id, props.prp_property_code, props.prp_district, props.prp_plan, props.prp_block, props.prp_lot, titles.ttl_title_id, titles.ttl_title_code, titles.ttl_date, census_tracts.census_tract_code, titles_to_prectitles.ttp_prectitle_id_fk AS `prectitle_id` FROM ((`props` RIGHT JOIN titles on titles.ttl_property_id_fk = props.prp_property_id) JOIN census_tracts on props.prp_census_tract_id_fk = census_tracts.census_tract_id) RIGHT JOIN titles_to_prectitles on titles_to_prectitles.ttp_title_id_fk = titles.ttl_title_id ORDER BY props.prp_property_id)
Sent an XLS dump of this view to JS-R, and also created a user for him so he can read and export views himself.
We're beginning the process of creating specific views which can be used for statistical analysis of the data. This is the set of views requested by JS-R:
Now we have to formalize those in terms of the various tables and fields that need to be involved. This is my first attempt at the first view:
CREATE VIEW trans_by_date AS SELECT titles_to_prectitles.ttp_ttp_id, titles.ttl_title_id, titles_to_prectitles.ttp_prectitle_id_fk, titles.ttl_title_code, titles.ttl_date, props.prp_census_tract_id_fk, census_tracts.census_tract_code FROM ((titles_to_prectitles JOIN titles on titles_to_prectitles.ttp_title_id_fk = titles.ttl_title_id) JOIN props on titles.ttl_property_id_fk = props.prp_property_id) JOIN census_tracts on props.prp_census_tract_id_fk = census_tracts.census_tract_id ORDER BY titles.ttl_date
With help from JN, this is the SQL that creates links where there is a match:
INSERT INTO `titles_to_prectitles` (`ttp_title_id_fk`, `ttp_prectitle_id_fk`) SELECT `tt2`.`ttl_title_id` AS `id_1` , `tt1`.`ttl_title_id` AS `id2` FROM `titles` AS `tt1` INNER JOIN `titles` AS `tt2` ON tt1.ttl_title_code = tt2.ttl_traces
This checks that there are no duplicate records in the linking table (in case the manual data entry had already created one of the ones we've just created):
SELECT ttp1.ttp_ttp_id, ttp2.ttp_ttp_id FROM titles_to_prectitles ttp1, titles_to_prectitles ttp2 WHERE ttp1.ttp_title_id_fk = ttp2.ttp_title_id_fk and ttp1.ttp_prectitle_id_fk = ttp2.ttp_prectitle_id_fk and ttp1.ttp_ttp_id < ttp2.ttp_ttp_id
This clears out the Traces field wherever these links have been created:
UPDATE titles ttl1 SET ttl_traces = "" WHERE ttl_traces IN (SELECT * FROM (SELECT ttl_title_code FROM titles ttl2) as codes)
Note that the nested subquery is necessary because MySQL won't allow you to use the table you're updating in a WHERE clause.
These are the results:
From a total of 2274 titles:
1115 titles still have data in their Traces field. Some of these items might be linkable by a human. For instance:
Obviously these sorts of decisions can only be made by a human.
This simpler query shows up more potential hits:
SELECT * FROM `titles` AS `tt1` WHERE EXISTS ( SELECT * FROM `titles` WHERE tt1.ttl_title_code = titles.ttl_traces )
This query brings back the candidate ids:
SELECT `tt1`.`ttl_title_id` AS `id_1`, `tt2`.`ttl_title_id` AS `id2` FROM `titles` AS `tt1` LEFT JOIN `titles` AS `tt2` ON tt1.ttl_title_code = tt2.ttl_traces
We'll put more work into this tomorrow.
Came to automate the linking of titles to preceding titles as specified in the previous post, but found that with this data, manual intervention was essential. I initially isolated 38 candidate records like this:
SELECT `ttl_title_id` FROM `titles` AS `ts1` WHERE NOT EXISTS (SELECT * FROM `titles_to_prectitles` WHERE `ttp_title_id_fk` = `ts1`.`ttl_title_id`) AND EXISTS (SELECT `ttl_title_id` FROM `titles` AS `ts2` WHERE TRIM(`ts1`.`ttl_traces`) = `ts2`.`ttl_title_id`) ORDER BY `ts1`.`ttl_title_id`
But then I looked at each candidate individually, because in many cases, there was only a partial entry in the Traces field. For instance, looking at a record with "2144K" in the Traces field, we find there are two possible candidates for the precedint title, 2144K-19 and 2144K-21. We can select the correct one by checking the Property field, so (for instance) we can tell that if the Property field for the candidate record has 254 B:39 L:19, and 2144K-19 also has 254 B:39 L:19, then that's the correct preceding title.
With this method I was able to match nearly all the 38 candidates to the correct preceding record(s).
We end up with the following:
Don't know how useful this will be.
The manual work on preceding titles has been done, so the automated component can now be undertaken. This is the task:
Caveat: there are some items where there are brackets around one or more characters; we'll probably have to ignore those (and lookups will fail anyway).
Meeting with JS-R and his RA today, preparing for the next phase of data entry. Mostly this will be manual fixes to the existing records, but there are a couple of things I need to do.
I had to programmatically split out the forenames and surnames into separate fields from the display name field, in both the owners and lenders table. We agreed to split on the last space, and then fix the results manually. Here's how to get the surname:
SELECT RIGHT( `own_display_name` , LOCATE( ' ', REVERSE( `own_display_name` ) ) - 1 ) FROM `owners`
and this is the bit preceding the surname:
SELECT LEFT( `own_display_name` , CHAR_LENGTH( `own_display_name`) - LOCATE( ' ', REVERSE( `own_display_name` ) ) ) FROM `owners`
This is the update command for surnames:
UPDATE `owners` SET `own_surname` = (SELECT RIGHT( `own_display_name` , LOCATE( ' ', REVERSE( `own_display_name` ) ) - 1 )) WHERE CHAR_LENGTH(`own_surname`) = 0
and this is for forenames:
UPDATE `owners` SET `own_forenames` = (SELECT LEFT( `own_display_name` , CHAR_LENGTH( `own_display_name`) - LOCATE( ' ', REVERSE( `own_display_name` ) ) )) WHERE CHAR_LENGTH(`own_forenames`) = 0
These are the commands for lenders:
UPDATE `lenders` SET `lnd_surname` = (SELECT RIGHT( `lnd_display_name` , LOCATE( ' ', REVERSE( `lnd_display_name` ) ) - 1 )) WHERE CHAR_LENGTH(`lnd_surname`) = 0 UPDATE `lenders` SET `lnd_forenames` = (SELECT LEFT( `lnd_display_name` , CHAR_LENGTH( `lnd_display_name`) - LOCATE( ' ', REVERSE( `lnd_display_name` ) ) )) WHERE CHAR_LENGTH(`lnd_forenames`) = 0
Tested on dev, and implemented on live.
Updates mentioned in the preceding post:
ALTER TABLE `props` ADD `prp_street_num` varchar(45) collate utf8_unicode_ci default NULL; ALTER TABLE `props` ADD `prp_street` varchar(45) collate utf8_unicode_ci default NULL;
In class PropProperty:
$this->addField(new MdhStrField('prp_street_num', 'Street number', '', array(), 45, true));
$this->addField(new MdhStrField('prp_street', 'Street name', '', array(), 45, true));
New table for ethnicities for lenders:
-- -- Table structure for table `lenders_to_ethnicities` -- CREATE TABLE IF NOT EXISTS `lenders_to_ethnicities` ( `lte_lte_id` int(11) NOT NULL auto_increment, `lte_lender_id_fk` int(11) NOT NULL, `lte_ethnicity_id_fk` int(11) NOT NULL, PRIMARY KEY (`lte_lte_id`), KEY `lte_ethnicity_id_fk_idx` (`lte_ethnicity_id_fk`), KEY `lte_lender_id_fk_idx` (`lte_lender_id_fk`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=1 ; -- -- Constraints for table `lenders_to_ethnicities` -- ALTER TABLE `lenders_to_ethnicities` ADD CONSTRAINT `lenders_to_ethnicities_ibfk_1` FOREIGN KEY (`lte_lender_id_fk`) REFERENCES `lenders` (`lnd_lender_id`) ON DELETE CASCADE ON UPDATE CASCADE, ADD CONSTRAINT `lenders_to_ethnicities_ibfk_2` FOREIGN KEY (`lte_ethnicity_id_fk`) REFERENCES `ethnicities` (`eth_ethnicity_id`) ON DELETE CASCADE ON UPDATE CASCADE;
In class PropLender:
$this->addField(new MdhOneToManyField('lnd_ethnicities', 'Ethnicities',
'lnd_lender_id',
'ethnicities',
'eth_ethnicity_id',
'eth_name',
'lenders_to_ethnicities',
'lte_lte_id',
'lte_lender_id_fk',
'lte_ethnicity_id_fk',
true,
'eth_name',
true));
Linking titles to preceding titles:
-- -- Table structure for table `titles_to_prectitles` -- CREATE TABLE IF NOT EXISTS `titles_to_prectitles` ( `ttp_ttp_id` int(11) NOT NULL auto_increment, `ttp_title_id_fk` int(11) NOT NULL, `ttp_prectitle_id_fk` int(11) NOT NULL, PRIMARY KEY (`ttp_ttp_id`), KEY `ttp_title_id` (`ttp_title_id_fk`), KEY `ttp_prectitle_id` (`ttp_prectitle_id_fk`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=3157 ; -- -- Constraints for table `titles_to_prectitles` -- ALTER TABLE `titles_to_prectitles` ADD CONSTRAINT `titles_to_prectitles_ibfk_1` FOREIGN KEY (`ttp_title_id_fk`) REFERENCES `titles` (`ttl_title_id`) ON DELETE CASCADE ON UPDATE CASCADE, ADD CONSTRAINT `titles_to_prectitles_ibfk_2` FOREIGN KEY (`ttp_prectitle_id_fk`) REFERENCES `titles` (`ttl_title_id`) ON DELETE CASCADE ON UPDATE CASCADE;
In class PropTitles:
$this->addField(new MdhOneToManyField('ttl_prec_titles', 'Preceding titles',
'ttl_title_id',
'titles',
'ttl_title_id',
'ttl_title_code',
'titles_to_prectitles',
'ttp_ttp_id',
'ttp_title_id_fk',
'ttp_prectitle_id_fk',
true, 'ttl_title_code', true));
My original intention was to populate this last "Preceding titles" field programmatically by parsing the data in the current Traces field, which should also be pointing back at previous title data. However, after looking at some of that data, I think it's going to have to be done by a human. Here's an example of why:
If you go to the Titles table, then choose the third property in the drop-down Property list (005 B:103 L:B), and filter by it, you'll see five results. These should, in theory, represent a succession of titles relating to the same property, by date; the Traces field for each one should point back to the preceding one. However:
In short, writing code that does a reasonable job of gleaning the preceding title data from what exists in the Traces field is not going to be practical or reliable; a human should go through the data and work this out, I think.
Points arising out of a meeting with JS-R this morning:
The main realization that emerged for me was that the core of the research focus relates to transactions (events whereby one property is transferred from one owner to another); but transactions do not constitute datapoints in the records themselves. Transactions are primarily represented by titles. Each title has a date (which is the date ownership changed), and information about the new owners; but it doesn't have any information about the previous title or owners. Titles are related only in that they relate to the same property.
What we need to do is to generate a Transactions view which brings together all the data about each transaction: property, date, old title, new title, old owner(s), new owner(s), and ethnicities of both sets of owners. This is more complicated than it seems. I think the first stage is probably to create a field in the existing Titles table which hard-links to the previous title on the same property. This can be computed based on Property and date, but there is also the Traces field which should already contain a list of older titles on the same property. The first thing to do is to calculate the previous title based on property and date, and then compare it against the data in the text field to see if we have an acceptable degree of consistency. There are also issues regarding the unification of titles; one title may have multiple precursors rather than just one.
The following is a set of questions that we are going to attempt to answer through queries against the current database, using specially-constructed views, to test the difficulty of addressing research questions through a programmer writing SQL; these will then act as a testbed for an attempt to imagine ways in which an interface could be coded which would allow the researcher to frame such questions without the intervention of a programmer, and get tabular results.
Immediate tasks for me:
Meanwhile, JS-R and his RA will be adding ethnicity data to Owners and Lenders, and splitting out their names, as well as adding the new address info to Titles.