I've spent most of the day working on building the new transaction view in the form of a spreadsheet, and I think it's about 2/3 done. I have ethnicity-munging working in the XSLT, as well as munging of institution types for multiple institutions. In the process, I've found and fixed some data-entry issues (as always).
I now have a two-stage XSLT process. The first stage, working on the XML output from the db, takes several minutes; this generates a view of the database which is centred specifically on transactions, importing all relevant data for each transaction into the transaction itself. Once this is done, the second-stage process then processes that output to create a spreadsheet; this is much faster (a few seconds), and it's easier to work on and test the second process without having to run the first process over and over again.
I have a script currently generating a list of candidate duplicate owners. This is how it was done:
#!/bin/bash
#This script is designed to run a series of comparison tests of xml-encoded owner
#records in an attempt to discover possible duplicates, which are then to be investigated
#by the PI manually.
#Threshold below which to consider a possible dupe
MINSIM=0.1
#First, paths to files.
USM_JAR=/home/mholmes/WorkData/netbeans/uniSimMetric/dist/uniSimMetric.jar
NCD_COMMAND="ncd -l "
INPUTFILE=/home/mholmes/WorkData/history/stanger-ross/properties/xml/owners_12_04_27_flattened.txt
OUTFILE="/home/mholmes/WorkData/history/stanger-ross/properties/xml/owner_dupe_candidates_`date +%Y%m%d`.txt"
#Echo the start out to the output file.
echo "Possible duplicate owners found by string comparison using USM">$OUTFILE
echo "">$OUTFILE
#Initialize a counter
C=0
#Read in the inputs line by line
cat $INPUTFILE | while read line;
do
#Ignore empty lines. This ensures we can read five lines forward (there are five empty lines at the end of the file).
let "C=$C+1"
LEN=${#line}
if [ $LEN -gt "3" ];
then
for ((N=$C+1; N<$C+6; N++))
do
STR2=`awk NR==${N} $INPUTFILE`;
#Call the USM to compare them.
USM=`java -jar $USM_JAR -compare -str1="$line" -str2="$STR2"`
#Call NCD to compare them
# NCD=`$NCD_COMMAND "$line" "$STR2"`
#NCD outputs the second string on the command line before the score; we need to remove it.
# NCD=${NCD/$STR2}
#If the threshold similarity is greater than the specified value, output info to the output file.
if [[ "$USM" < "$MINSIM" ]];
then
echo "Found similarity"
echo $line | sed -n 's/.*<owners><own_owner_id>\(.*\)<\/own_owner_id>.*/\1/p'>>$OUTFILE
echo $STR2 | sed -n 's/.*<owners><own_owner_id>\(.*\)<\/own_owner_id>.*/\1/p'>>$OUTFILE
echo "">>$OUTFILE
fi
done
fi
done
#Display the output file.
`gedit $OUTFILE`
echo "Done!"
exit
This is successfully producing a list of candidate matches right now, outputting the ids of the two candidates followed by a blank line, for each candidate match.
Isolated seven more problematic transactions, and fixed two of them (bad dates); started creating xsl:keys to speed up processing, and mapped tract and property information into transactions.
Suggestion from JS-R:
Maybe three variables. 1. A "true"/"false" variable--is at least one of the buyers an institution? 2. If true, what is the institution type: (1) Ethnic (2) Private (3) Public (4) Multiple 3. If there is only one institution among the buyers, what is its name?
At JS-R's request, dealt with the problem of owner names which were prefixed with "Transfer: " in the following way:
Created a new column for this info:
ALTER TABLE `owners` ADD COLUMN `own_diff_trans_name` BOOLEAN DEFAULT False NOT NULL AFTER `own_display_name`;
Checked how many rows would be affected (139):
SELECT * FROM `owners` WHERE LEFT(`own_display_name` , 10 ) = "Transfer: ";
Moved the info from the display name field to the new field:
UPDATE `owners` SET `own_diff_trans_name` = True WHERE LEFT( `own_display_name` , 10 ) = "Transfer: "; UPDATE `owners` SET `own_display_name` = REPLACE(`own_display_name`, "Transfer: ", "");
Checked the results:
SELECT * FROM `owners` WHERE `own_diff_trans_name` = True;
One task from the meeting the other day was to create a new view of transactions which adds a number of financial fields. I've now done that, using the following SQL:
DROP VIEW IF EXISTS VW_trans_composite_eth_prop_2; CREATE VIEW VW_trans_composite_eth_prop_2 AS ( SELECT seller_titles.ttl_title_id AS seller_title_id, seller_titles.ttl_date AS seller_title_date, seller_titles.ttl_title_code AS seller_title_code, seller_titles.ttl_consideration AS seller_title_consideration, seller_titles.ttl_declaredvalue AS seller_title_declaredvalue, seller_titles.ttl_marketvalue AS seller_title_marketvalue, buyer_titles.ttl_title_id AS buyer_title_id, buyer_titles.ttl_date AS buyer_title_date, buyer_titles.ttl_title_code AS buyer_title_code, DATEDIFF(buyer_titles.ttl_date, seller_titles.ttl_date) as seller_duration_days, buyer_titles.ttl_consideration AS buyer_title_consideration, buyer_titles.ttl_declaredvalue AS buyer_title_declaredvalue, buyer_titles.ttl_marketvalue AS buyer_title_marketvalue, census_tracts.census_tract_code, props.*, sellers.concat_owners AS concat_sellers, sellers.concat_ethnicities AS seller_ethnicities, sellers.total_owners AS total_sellers, sellers.munged_ethnicity AS seller_munged_eth, sellers.total_institutional AS institutional_sellers, buyers.concat_owners AS concat_buyers, buyers.concat_ethnicities AS buyer_ethnicities, buyers.total_owners AS total_buyers, buyers.munged_ethnicity AS buyer_munged_eth, buyers.total_institutional AS institutional_buyers FROM titles AS buyer_titles LEFT JOIN titles_to_prectitles ON buyer_titles.ttl_title_id = titles_to_prectitles.ttp_title_id_fk LEFT JOIN titles AS seller_titles ON seller_titles.ttl_title_id = titles_to_prectitles.ttp_prectitle_id_fk LEFT JOIN VW_titles_composite_eth AS sellers ON seller_titles.ttl_title_id = sellers.ttl_title_id LEFT JOIN VW_titles_composite_eth AS buyers ON buyer_titles.ttl_title_id = buyers.ttl_title_id LEFT JOIN props AS props ON seller_titles.ttl_property_id_fk = props.prp_property_id LEFT JOIN census_tracts ON props.prp_census_tract_id_fk = census_tracts.census_tract_id ORDER BY seller_titles.ttl_title_id )
Met with JS-R and had long discussions about the next phase of the project, and immediate needs. Arising out of this, two short-term tasks for me: