My P4 to P5 conversion is now working, and producing valid output on abstracts, entries, and the project metadata file. I may do more work on this, but I'll be moving on to the -ography stuff next.
More work ahead of release next month: introducing new recommendation to use xml-model.
I have a script currently generating a list of candidate duplicate owners. This is how it was done:
#!/bin/bash
#This script is designed to run a series of comparison tests of xml-encoded owner
#records in an attempt to discover possible duplicates, which are then to be investigated
#by the PI manually.
#Threshold below which to consider a possible dupe
MINSIM=0.1
#First, paths to files.
USM_JAR=/home/mholmes/WorkData/netbeans/uniSimMetric/dist/uniSimMetric.jar
NCD_COMMAND="ncd -l "
INPUTFILE=/home/mholmes/WorkData/history/stanger-ross/properties/xml/owners_12_04_27_flattened.txt
OUTFILE="/home/mholmes/WorkData/history/stanger-ross/properties/xml/owner_dupe_candidates_`date +%Y%m%d`.txt"
#Echo the start out to the output file.
echo "Possible duplicate owners found by string comparison using USM">$OUTFILE
echo "">$OUTFILE
#Initialize a counter
C=0
#Read in the inputs line by line
cat $INPUTFILE | while read line;
do
#Ignore empty lines. This ensures we can read five lines forward (there are five empty lines at the end of the file).
let "C=$C+1"
LEN=${#line}
if [ $LEN -gt "3" ];
then
for ((N=$C+1; N<$C+6; N++))
do
STR2=`awk NR==${N} $INPUTFILE`;
#Call the USM to compare them.
USM=`java -jar $USM_JAR -compare -str1="$line" -str2="$STR2"`
#Call NCD to compare them
# NCD=`$NCD_COMMAND "$line" "$STR2"`
#NCD outputs the second string on the command line before the score; we need to remove it.
# NCD=${NCD/$STR2}
#If the threshold similarity is greater than the specified value, output info to the output file.
if [[ "$USM" < "$MINSIM" ]];
then
echo "Found similarity"
echo $line | sed -n 's/.*<owners><own_owner_id>\(.*\)<\/own_owner_id>.*/\1/p'>>$OUTFILE
echo $STR2 | sed -n 's/.*<owners><own_owner_id>\(.*\)<\/own_owner_id>.*/\1/p'>>$OUTFILE
echo "">>$OUTFILE
fi
done
fi
done
#Display the output file.
`gedit $OUTFILE`
echo "Done!"
exit
This is successfully producing a list of candidate matches right now, outputting the ids of the two candidates followed by a blank line, for each candidate match.
Arranged and confirmed meeting next week with DF, DR, SA and myself to discuss next steps with their Cascade website.
Received payment for HB's 2012 1st quarter.
Deposited payment; receipt filed in HCMC records.
| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|---|---|---|---|---|---|
| << < | Current | > >> | ||||
| 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 8 | 9 | 10 | 11 | 12 | 13 | 14 |
| 15 | 16 | 17 | 18 | 19 | 20 | 21 |
| 22 | 23 | 24 | 25 | 26 | 27 | 28 |
| 29 | 30 | |||||