I had to find all instances of criminals in which both the given-name and surname fields are identical and the number of such duplicates
Here's the SQL I used for the example:
SELECT `surname`, `given_name`, COUNT(`surname`) AS Instances FROM criminals GROUP BY `surname`, `given_name` HAVING ( COUNT(`surname`) > 1 ) AND ( COUNT(`given_name`) > 1)
and here's the same SQL with abstracted table and field names:
SELECT `field1Name`, `field2Name`, COUNT(`field1Name`) AS Instances FROM tableName GROUP BY `field1Name`, `field2Name` HAVING ( COUNT(`field1Name`) > 1 ) AND ( COUNT(`field2Name`) > 1)
When I put a value into the given-names textfield in the search box and then tried to execute a search I got back an error message which began with the following:
Fatal error: Uncaught exception 'PDOException' with message 'SQLSTATE[42S22]: Column not found: 1054 Unknown column 'c.given' in 'where clause''
It took a couple of hours wading through the form, and the json and zend code working with the form and the data model until I found this line of code in includes/classes/Search.php
protected function _prepareCriminals() {
$this->_select->joinLeft(
array('c' => 'criminals'),
'c.criminal_id = tf.criminal_id_fk',
array('c.criminal_id', 'c.surname', 'c.given_name', 'c.age')
);
...
if ($this->inData('given-name')) {
$this->_select->where('LOWER(c.given-name) LIKE ?', '%' . strtolower($this->_data['given-name']) . '%');
}
I changed one "given-name" to a "given_name" (see below) and the queries now seem to work:
protected function _prepareCriminals() {
$this->_select->joinLeft(
array('c' => 'criminals'),
'c.criminal_id = tf.criminal_id_fk',
array('c.criminal_id', 'c.surname', 'c.given_name', 'c.age')
);
if ($this->inData('given-name')) {
$this->_select->where('LOWER(c.given_name) LIKE ?', '%' . strtolower($this->_data['given-name']) . '%');
}
so, apparently because given-name was not a member of the c array, the instance of LOWER(c.given-name) was interpreted as LOWER(c.given)-name. name of course was null so presumably had no effect, resulting in c.given being the name of the field in the table to search, which of course doesn't exist generating the error I saw.
Previously, if there were more than 20 years in the range of returned data, the years appearing along the x-axis in the flot graph overwrote each other.
I changed the init-flot.js file by replacing this:
xaxis: {
tickSize: 1,
tickFormatter: function(val, axis) {
if (!yearTotals[val]) {
return val;
} else {
return val + '<br />' + yearTotals[val];
}
}
}
with this:
xaxis: {
ticks: 20,
tickDecimals: 0,
tickFormatter: function(val, axis) {
return ' ' + val;
}
}
The ticks limits the number of years explicitly labelled to a maximum of 20, thus taking care of the overlapping year display on the x-axis when there are lots of years in the chart.
The tickDecimals controls whether fractions of years get ticks on the x-axis. For small year ranges, if this is not set to 0, you get values like 1744.5 appearing, which is nonsensical for this dataset.
Since not all years get a tick on the x-axis, it didn't seem to make sense to display the total hits, so I've removed that from the tickFormatter.
I prefer the year label to be left justified to the left side of the column in the graph, rather than centered on the left side of the column, so I added those spaces to push the date to the left.
Original website (i.e. Stew's version): http://web.uvic.ca/~lang02/bailey/
New development website (note: requires Netlink login): http://web.uvic.ca/~lang02/bailey_v2/
XML schema: http://web.uvic.ca/~lang02/bailey/schema/bailey_trialfile_proofing.rng
The Data
The data has gone through a few schema changes, all documented on the blog. See, in chronological order (i.e. earliest first):
For each set of data that SD sends, I do the following:
Once I get the file back from SD, I do the following:
Validation
As noted at the top of this document, the schema file is at http://web.uvic.ca/~lang02/bailey/schema/bailey_trialfile_proofing.rng . SD validates against this file so it always needs to be current.
The Website
The development website ( http://web.uvic.ca/~lang02/bailey_v2/ ) is a feature-complete, and mostly design-complete (I went for a muted look) HTML5-validated website. The site employs Jquery-driven Javascript to handle AJAX requests.
On the search form, the Jquery plugin bsmSelect is used to make multiple select fields more user-friendly. The plugin homepage is here: http://plugins.jquery.com/project/bsmSelect , and the blog post is here: http://hcmc.uvic.ca/blogs/index.php?blog=36&p=7864&more=1&c=1&tb=1&pb=1
The site search – both the form and the results – is handled by the Search suite of classes located at includes/classes (Search.php and the Search subdirectory). The Search_Form class, located at Search/Form.php, is responsible for displaying all of the search fields and populating them with the proper values. All of the classes are fully documented.
The site uses the Zend Framework's DB library for database interaction. The classes are located at includes/classes/Zend . All of the classes used in the site follow the PEAR class naming convention for easy autoloading. The PEAR convention dictates the following:
Each subdirectory that a class is in should be represented in the name, with single underscores separating each level of the hierarchy. For example, if a class lives in Search/Form.php, its class name would be Search_Form . Similarly, if a class lives in My/Name/Is/Bob.php, its class name would be My_Name_Is_Bob . This convention has two primary benefits:
The search results pages – both the results table and a single record – use the Model_ series of classes to display information, as well as either Search_Result (for the results summary) or Search_Single (for a detailed record view). Each row of a table (trials, trial_files, etc.) is represented by a class, which allows for easy placement of formatting functions and the like.
The Chart / Data Visualization
The chart to visualize results uses Flot, a Jquery-based plotting tool:
http://code.google.com/p/flot/
To accomplish the 'stacking' of the results, I used the stacking plugin, which is included wtih Flot:
http://people.iola.dk/olau/flot/examples/stacking.html
The Flot chart gets its information from a dynamic JSON dataset, which is based on a submitted search form. When a search is submitted, it's stored in the user's session for easy re-use on either the search page (to modify results) or the chart page. This is also how the “passing” of data from the chart to the results table (and vice versa) is accomplished – both just read and parse any search information in the session. The results themselves are not saved in the session, just the search query parameters.
The Flot initialization code is in js/init-flot.js – it's almost all 'stock' and taken from the examples, with the minor exception of a modification I had to make to get the totals for each year display under the year name in on the X axis of the chart.
I'm overall very happy with Flot, as is SD. It's fairly simple, decently fast, and, unlike some alternatives, still under active development.
[SA 120116 made changes affecting what's talked about in this post, see http://hcmc.uvic.ca/blogs/index.php?blog=36&p=9010&more=1&c=1&tb=1&pb=1]
Met with SD yesterday primarily to discuss two things: how to link together criminals from different trials that are the same person, and how to link trials in different trial_files to each other (i.e. for a bunch of criminals tried for the same crime). We determined that the simplest, quickest, and therefore most prudent way to do this is to use an "id" attribute to link together the elements as required. So, if two <criminal> elements both describe the same person, then each <criminal> element in question should look like this:
<criminal id="myuniqueidentifier">
(stuff)
</criminal>
where "myuniqueidentifir" can be any string as long as it's the same for each criminal that should be treated as the same person. <criminal> elements that don't correspond to any other <criminal> element do <strong>not</strong> need an id attribute.
Similarly, for <trial> elements that refer to the same trial (but a different criminal):
<trial rel="5">
(stuff)
</trial>
In the case of trials, 'rel' does not have to be absolutely unique, just unique within one court session (i.e the same trial_file_start_date and trial_file_end_date values). SD requested this to make it easier to keep track of the numbers.
I will be able to alter the import script to account for these new attributes, but I'm not sure if SD will have a chance to go over all of the data before my contract is up.
In my previous post about making francotoile.uvic.ca work, I outlined writing a VirtualHost configuration that could successfully rewrite a pear.hcmc.uvic.ca Tomcat URL to francotoile.uvic.ca. It seems, however, that with this configuration, HTTP session information isn't transmitted between Tomcat and Apache. In other words, when using the URL rewritten with the VirtualHost, a new session is created every page load and does not persist. This does not happen when using the original URL (the pear.hcmc.uvic.ca one).
I had hoped to finish this post off with "to fix this, just add...", but, sadly, I've yet to find a solution. Martin has pointed out that both http://mariage.uvic.ca/ and http://bcgenesis.uvic.ca/ , which also use mod_rewrite and mod_jk, handle sessions correctly.
It turns out that both of those sites use cookies, not sessions. Since a cookie is a natural choice in this case anyway - a language preference for the site - I've changed the code in my i18n eXist plugin to use cookies instead of sessions, which does the trick. Martin has suggested to use both in tandem, so that the system tries to use sessions but falls back on a cookie if session are unavailable. I will try to implement this enhancement before my time is up.
Because unix7.uvic.ca can't handle dates earlier than Dec 14, 1901, the proprietary date format in the XML files - currently 1759Dec31 - will need to be changed to a standard YYYY-MM-DD format, i.e. 1759-12-31. The XML -> MySQL import script can't do any date processing, so the date needs to be in the proper format in the original data.
The unix.uvic.ca machines - at least, the ones I've used - have a 32-bit architecture. Because of this, PHP's timestamp handling is limited from Fri, 13 Dec 1901 20:45:54 GMT to Tue, 19 Jan 2038 03:14:07 GMT (see http://uk3.php.net/manual/en/function.date.php ). 64-bit architecture allows for basically limitless timestamps.
This limitation poses a problem because the Bailey data importer deals exclusively with dates before 1901.
UPDATE table SET field = !field
I've finished writing and testing a PHP script to import an XML file (validated against http://web.uvic.ca/~lang02/bailey/schema/bailey_trialfile_proofing.rng ) into the MySQL database. The script using SimpleXMLElement to load and loop through the file. The database handling is done by Zend_Db, since that's already being used in the main Bailey website.
I wrote the importer in such a way that new foreign keys (e.g. crime_normalizeds, judges, etc.) don't need to be entered into the MySQL database beforehand. Since the XML is validated against the schema, I can be reasonably certain that all of the values in the data are intentional. So, if the importer comes across a value that's not already in the MySQL database (say, a new judge), it simply inserts that value and continues with the import.
The importer isn't yet online, but once it is I'll publish its URL to this blog.
While working on the import process to bring the XML data into the website database, it occurred to me that, after all of the changes to the schema, the <case> element doesn't serve much of a function anymore. While it used to group together multiple trial files, all of the useful grouping information (namely the trial dates) has been moved to the <trial_file> elements themselves. So, we don't need the extra level in the hierarchy.
For that matter, we also don't need the <regular_cases> element that wraps up the <case> elements, since it's also an extra level of hierarchy. The schema really just needs to be: data -> one or more trial files
Thus, I've removed <regular_cases> <case> from the schema and modified the XSLT transformation scripts accordingly. I also modified the data that's already been processed.
These changes will make the PHP script that converts the XML to MySQL simpler.
I've decided to scrap the eXist rebuild for a few reasons:
The XML schema was not meant as a final destination for the data. SA created it as an ad hoc intermediary step between SD's raw data and the MySQL database. So, the schema doesn't lend itself well to being the backbone of an eXist architecture, particularly when it comes to searching and generating human-readable values. The structure of the data is highly relational and is a good fit for SQL, which, of course, was SA's original intention.
That said, my original motivation for the eXist rebuild still stands. The current method of translating the XML to SQL involves using XSLT transformations to generate SQL statements. However, every time the schema is updated with new structure or values - which is happening a lot - the XSLT stylesheets need to be updated as well. So, to get around these difficulties, I'm going to write a PHP script that will convert the XML to MySQL, likely use SimpleXML. It won't be the fastest script I've ever written, but since it will be a one-use-per-dataset kind of thing, speed isn't a big issue.
After some consultation with Greg I've decided to scrap the PHP/MySQL version of Bailey that I built and write the site in eXist. After building FrancoToile and the Lansdowne Lecture site in eXist I'm comfortable working with it, and this way the Bailey XML data can be used directly without being shoehorned into a MySQL database. The allowed values for crimes, outcomes, judges, etc. is constantly changing in the schema, so maintaining those keys in MySQL would be a long-term pain. Using eXist cuts out the XML-to-MySQL middle man. I don't expect the basic functionality of the site to take long - it'll likely be done next week sometime.
Received the 1770-79 XLS spreadsheet from SD. Minor change to the format as he's added "Outcome Durn - Yrs" and "Outcome Durn - Other" to handle outcome duration values (see previous blog post: http://hcmc.uvic.ca/blogs/index.php?blog=36&p=8333&more=1&c=1&tb=1&pb=1 ).
Received the proofed 1780-84 data file from SD. After adding "HighTreason" as a crime_normalized, the file has 100% validation.
As of today, these are the data sets that are complete (i.e. use the latest schema and are 100% valid):
Another schema change for the data: we're going to utilize the hitherto ignored outcome_duration element for outcomes. Until this point, normalized outcomes (i.e. the outcome_normalized element) with years attached, such as "transported for 1 year", "hulks for five years", etc. were simply listed as T1, Hulks5, etc. But, because the list of T# and Hulks# normalized outcomes is growing, SD and I have decided to move the numbers from the outcome_normalized element into the outcome_duration element where they belong.
Furthermore, to allow for flexibility when entering the duration, the outcome_duration element has been expanded with a number of child elements. These could also be attributes, but I've chosen to use elements since everything else in the schema uses elements rather than attributes. The modified element looks like this:
<outcome_duration>
<years>(number of years, or blank for none)</years>
<months>(number of months, or blank for none)</months>
<weeks>(number of weeks, or blank for none)</weeks>
<days>(number of days, or blank for none)</days>
<other>(Life, Remainder, or blank)</other>
</outcome_duration>
I also changed the acceptable values of outcome_normalized so that, instead of accepting T#, Hulks#, and GB#, "Transport", "Hulks", and "GoodBehavior" are the new accepted values. Also, "SelfTL" and "SelfTR" have become "SelfTransport" with either "Life" or "Remainder" in the outcome_duration/other.
The values of years, months, weeks, and days must be an integer (or blank), while the value of other must correspond to a value in the setofNormalizedDurationOthers list in the RNG schema.
Spent quite a bit of time figuring out how to add x-axis totals to each axis on the Flot chart - Flot calls these "ticks" - to display the total number of items for each year (along with the year number, which was already being displayed). Since I'm using the Flot stacks plugin to display multiple sets of data on each axis, it was a bit of a complex process.
The most time consuming part of the process was finding a "Flot-y" way to do it. The API documentation is OK, but a bit too informal for my tastes and glosses over a lot of things. After a lot of testing, debugging, and stepping through functions via the Chrome developer suite, I couldn't find anything suitable that could be achieved by native Flot functions dealing with the ticks. There are plenty of ways to format the data already in the tick (i.e. the year number in this case), but not to add data to the tick. So, I declared a global variable var yearTotals = [], and, in the success function that's triggered after the AJAX call that grabs the JSON data for the chart, I added a for loop to get the totals for each year and add them to the array:
function onDataReceived(series) {
$.each(series, function(index, value) {
var entry = new Array();
entry['label'] = index;
entry['data'] = value.data;
chartData.push(entry);
for (i = 0; i < value.data.length; i++) {
row = value.data[i];
if (!yearTotals[row[0]]) {
yearTotals[row[0]] = row[1];
} else {
yearTotals[row[0]] += row[1];
}
}
});
plotWithOptions(chartData);
}
Now that SD is putting his data into Excel spreadsheets, the import procedure is (thankfully!) a lot simpler than the first process:
Step 1:
Step 2:
Step 3:
<?oxygen RNGSchema="http://web.uvic.ca/~lang02/bailey/schema/bailey_trialfile_proofing.rng"type="xml"?>Judges:
Adair-J
Gould-H
Hotham-N
Nares-G
Willes-J
Willes-E
Normalized outcome:
T3
Respite delays:
ThreeDays
TenDays
If you want to loop through a series of XML elements in an XSLT stylesheet and group by more than one element (for example, the combination of <first_name> and <last_name> elements), use the concat function to create a 'virtual key':
<xsl:for-each-group select="row" group-by="concat(First_Name, ' ', Last_Name)">
<xsl:variable name="theKey" select="current-grouping-key()"/>
<xsl:for-each-group>
Met with SD today to discuss XML structure, namely the possible elimination of the rec_rep element. We came to the logical conclusion that, since the rec_rep element itself no longer has any useful information (it's all been moved to child elements), the element has become redundant and can be removed. So, I updated the proofing file to reflect the structural change and wrote an XSLT procedure to transform the files.
See: http://hcmc.uvic.ca/blogs/index.php?blog=36&p=7853&more=1&c=1&tb=1&pb=1
Finished writing the procedure, started by SA, for how convert a raw Word data file from SD into a valid XML file. Note that, since SD will be using an Excel spreadsheet for all data beginning from the year 1810, this procedure is only valid for pre-1810 data (of which there will be plenty). I'll have to write a second procedure for converting SD's Excel files to XML once he's finished the first set of data.
The procedure is saved in a text file which will be passed back to SA upon his return. Here's the procedure verbatim:
=============================================
1) Clean text data
=============================================
save raw text as 1_cleanText.txt
replace all multiple tab with single tab
GREP search for
\t+
replace with
\t
normalize indicator of multiple trial_files in one case
GREP search for
^\s*?&\s*?&.*$
replace with
ADDITIONAL_TRIAL_FILE_IN_CASE
turn remaining ampersand characters into entities
search for
&
replace with
&
put four tabs on each trial line
put space in front of all lines with four tabs
GREP search for
^(.*?\t.*?\t.*?\t.*?\t.*?)$
replace with
\1 [space character preceding the slash 1]
add tab to all lines with only three tabs
GREP search for
^(\w.*?\t.*?\t.*?\t.*?)$
replace with
\1\t [space character preceding the slash 1]
add two tabs to all lines with only two tabs
GREP search for
^(\w.*?\t.*?\t.*?)$
replace with
\1\t\t [space character preceding the slash 1]
manually add three tabs to all lines with only one tab
GREP manually search for
^(\w.*?\t.*?)$
replace with
\1\t\t\t [space character preceding the slash 1]
put tab before instances of TRIAL that don't have one
GREP search for
^TRIAL
replace with
\tTRIAL
put TRIAL onto previous line
GREP search for
(.*?\t.*?\t.*?\t.*?)\r(\tTRIAL.*?)
replace with
\1\2
GREP search for:
Name\s*?Crime\s*?Respite\s*?Pardoned\s*?Executed\s*?\r
replace with:
[nothing]
GREP search for:
^\t*?(.*?Recorder's Report.*?)$
replace with
RECORDER_REPORT_HEAD \2
normalize smart quotes
search for
’
replace with
'
search for
‘
replace with
'
search for
“
replace with
"
search for
”
replace with
"
eliminate empty lines
GREP search for
\r\s*?\r
replace with
\r
repeat until no instances
put extraneous lines as comments on end of previous line
GREP search for
(.*?\t.*?\t.*?\t.*?\t.*?)\r\t(.*?)
replace with
\1\t\t\2
repeat until no hits
manually edit any lines still remaining
GREP search for
^\t
ensure there are 6 tabs in preceding line and copy extra line after sixth tab
ensure there are no more than 6 tabs in any line
GREP search for
^(.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?)\t
replace with
\1
repeat until no hits
removing leading spaces at start of line
GREP search for
^ [space character]
replace with
[nothing]
remove trailing spaces after tab
GREP search for
\t +[space character before +]
replace with
\t
remove leading spaces before tab
GREP search for
+\t [space character before +]
replace with
\t
Manually check in spreadsheet for any lines that should obviously be included as comments in the preceding line
File should now have no empty lines
Each line should be one of
- start with RECORDER_REPORT_HEAD
- consist of ADDITIONAL_TRIAL_FILE_IN_CASE
- contain a trial_file record with 7 tab-delimited fields:
CriminalName TAB Crime TAB Respite TAB outcome TAB ExecutionDate TAB Trial Ref TAB extra
sample extract
RECORDER_REPORT_HEAD 4-7, 9-10 DEC 1799 – Recorder's Report -> F, 31 Jan 1800
Thomas Scott (M1) Robbery Pleasure (RR, 16 July) Free (18 July 1800) [HO 13/13, pp.6-7] TRIAL -- OBSP 1799-1800, pp.8-9 [0.8]
Bartholomew Foley (M1) St in Dwelling Pleasure (RR) TL (15 Feb 1800) [HO 13/12, pp.394-5] TRIAL -- OBSP 1799-1800, pp.9-10 [0.4]
Peter Chapman als Harry Read Burglary To Die (RR) ----- W, 26 Feb 1800 als Harry Kirk (L) (19)TRIAL -- OBSP 1799-1800, pp.88-92 [4.6]
ADDITIONAL_TRIAL_FILE_IN_CASE
John Hall (L) (33) Burglary To Die (RR) ----- W, 26 Feb 1800 TRIAL -- OBSP 1799-1800, pp.88-92 [4.6]
ADDITIONAL_TRIAL_FILE_IN_CASE
Joseph Jones (L) Burglary To Die (RR) ----- W, 26 Feb 1800 TRIAL -- OBSP 1799-1800, pp.88-92 [4.6 but PG]
=============================================
2) add rec_rep, rec_rep_head, case, trial_file elements
=============================================
save file as file 2_mainElements.txt
create rec_rep containers
search for
^RECORDER_REPORT_HEAD(.*?)$
replace with
</rec_rep>\r<rec_rep>\r<rec_rep_head>\1</rec_rep_head>
delete </rec_rep> line at top of document
add <regular_cases> line at top of document
add </rec_rep> line at end of document
add </regular_cases> line at bottom of document
create trial_files
GREP search for
<trial_file>(.*?)</trial_file>
replace with
<case>\r\1\r</case>
remove case elements for multiple trial_files in one case
GREP search for
</case>\r\tADDITIONAL_TRIAL_FILE_IN_CASE\r<case>\r
replace with
[nothing]
Each line of file should now be one of:
<rec_rep>
<rec_rep_head> . . . </rec_rep_head>
<case>
<trial_file> . . . </trial_file>
</case>
</rec_rep>
sample extracts:
<rec_rep>
<rec_rep_head> 4-7, 9-10 DEC 1799 – Recorder's Report -> F, 31 Jan 1800</rec_rep_head>
<case>
<trial_file>Thomas Scott (M1) Robbery Pleasure (RR, 16 July) Free (18 July 1800) [HO 13/13, pp.6-7] TRIAL -- OBSP 1799-1800, pp.8-9 [0.8]</trial_file>
</case>
<case>
<trial_file>Bartholomew Foley (M1) St in Dwelling Pleasure (RR) TL (15 Feb 1800) [HO 13/12, pp.394-5] TRIAL -- OBSP 1799-1800, pp.9-10 [0.4]</trial_file>
</case>
<case>
<trial_file>Peter Chapman als Harry Read Burglary To Die (RR) ----- W, 26 Feb 1800 als Harry Kirk (L) (19)TRIAL -- OBSP 1799-1800, pp.88-92 [4.6]</trial_file>
<trial_file>John Hall (L) (33) Burglary To Die (RR) ----- W, 26 Feb 1800 TRIAL -- OBSP 1799-1800, pp.88-92 [4.6]</trial_file>
<trial_file>Joseph Jones (L) Burglary To Die (RR) ----- W, 26 Feb 1800 TRIAL -- OBSP 1799-1800, pp.88-92 [4.6 but PG]</trial_file>
</case>
</rec_rep>
=============================================
3) create criminal elements
=============================================
save file as 3_criminal
create criminal element
GREP search for
<trial_file>(.*?)\t
replace with
<trial_file><criminal>\1</criminal>
move jury element outside of criminal element
GREP search for
(<criminal>.*?)(\([A-Z]\d*?\))(.*?</criminal>)
replace with
\1\3\2
create and populate surname and given_name elements
GREP search for
(<criminal>)(.*?) (.*?) [trailing space character]
replace with
\1<surname>\2</surname><given_names>\3</given_names>
create aliases, age and gender elements
search for
</given_names>
replace with
</given_names><aliases></aliases><age></age><gender>Male</gender>
populate age element (gender are all assigned 1 above)
GREP search for
(</age><gender>Male</gender>).*?\((\d+?)\).*?(</criminal>)
replace with
\2\1\3
sample:
<trial_file><criminal><surname>Thomas</surname><given_names>Scott</given_names><aliases></aliases><age></age><gender>1</gender></criminal>(M1)Robbery Pleasure (RR, 16 July) Free (18 July 1800) [HO 13/13, pp.6-7] TRIAL -- OBSP 1799-1800, pp.8-9 [0.8]</trial_file>
=============================================
4) create trials elements
=============================================
save file as 4_trials
add trials, trial, judge, jury elements to trials element
search for
</criminal>
replace with
</criminal><trials><trial><judge></judge><jury><jury_type></jury_type><jury_subtype></jury_subtype></jury></trial></trials>
populate jury element
GREP search for
(</jury_type><jury_subtype>)(</jury_subtype></jury></trial></trials>)\(([A-Z])(\d*?)\)
replace with
\3\1\4\2
add crime element inside trial element
search for
</trial>
replace with
<crime><crime_text></crime_text><crime_normalized></crime_normalized><crime_group></crime_group></crime></trial>
populate crime element
GREP search for
(</crime_text>.*?</trial>)(.*?)\t
replace with
\2\1
add empty mercy_appeals element
search for
<trial_ref>
replace with
<mercy_appeals></mercy_appeals><trial_ref>
sample
<trial_file><criminal><surname>Thomas</surname><given_names>Scott</given_names><aliases></aliases><age></age><gender>1</gender></criminal><trials><trial><judge></judge><jury><jury_type>M</jury_type><jury_subtype>1</jury_subtype></jury><crime><crime_text>Robbery</crime_text><crime_normalized></crime_normalized><crime_group></crime_group></crime><mercy_appeals</mercy_appeals><trial_ref>TRIAL -- OBSP 1799-1800, pp.8-9 [0.8]</trial_ref></trial></trials>Pleasure (RR, 16 July) Free (18 July 1800) [HO 13/13, pp.6-7] </trial_file>
=============================================
5) create respites elements
=============================================
save file as 5_respites
add respites elements
search for
</trials>
replace with
</trials><respites><respite><respite_text></respite_text><respite_normalized></respite_normalized><respite_delay></respite_delay><respite_punishment></respite_punishment></respite></respites>
populate respite_text element
GREP search for
(</respite_text>.*?</respites>)(.*?)\t
replace with
\2\1
sample
<trial_file><criminal><surname>Thomas</surname><given_names>Scott</given_names><aliases></aliases><age></age><gender>1</gender></criminal><trials><trial><judge></judge><jury><jury_type>M</jury_type><jury_subtype>1</jury_subtype></jury><crime><crime_text>Robbery</crime_text><crime_normalized></crime_normalized><crime_group></crime_group></crime><mercy_appeals</mercy_appeals><trial_ref>TRIAL -- OBSP 1799-1800, pp.8-9 [0.8]</trial_ref></trial></trials><respites><respite><respite_text>Pleasure (RR, 16 July)</respite_text><respite_normalized></respite_normalized><respite_delay></respite_delay><respite_punishment></respite_punishment></respite></respites>Free (18 July 1800) [HO 13/13, pp.6-7] </trial_file>
=============================================
6) create outcomes elements
=============================================
save file as 6_outcomes
add outcome, outcome_text, outcome_ref elements
search for
</respites>
replace with
</respites><outcomes><outcome><outcome_text></outcome_text><outcome_ref></outcome_ref></outcome></outcomes>
populate outcome_text element
GREP search for
(</outcome_text><outcome_ref></outcome_ref></outcome></outcomes>)(.*?)\t
replace with
\2\1
populate outcome_ref element
GREP search for
(<outcome_text>.*?)\[(.*?)\](</outcome_text><outcome_ref>)
replace with
\1\3\2
populate outcome_ref for those records not caught by regexp above
(.*?)\[(HO.*?)\](.*?)(</outcome_ref>)
replace with
\1\3\2\4
add rest of outcome elements
search for
<outcome_ref>
replace with
<outcome_normalized></outcome_normalized><outcome_group></outcome_group><outcome_duration></outcome_duration><outcome_date></outcome_date><outcome_location></outcome_location><outcome_exceptional></outcome_exceptional><outcome_ref>
populate outcome_date element
i.e. execution date, taken from 6th tab field of original clean text
not taken from the date provided in the outcome_text
GREP search for
(</outcome_date>.*?</outcomes>)(.*?)\t
replace with
\2\1
sample:
<trial_file><criminal><surname>Thomas</surname><given_names>Scott</given_names><aliases></aliases><age></age><gender>1</gender></criminal><trials><trial><judge></judge><jury><jury_type>M</jury_type><jury_subtype>1</jury_subtype></jury><crime><crime_text>Robbery</crime_text><crime_normalized></crime_normalized><crime_group></crime_group></crime><mercy_appeals</mercy_appeals><trial_ref>TRIAL -- OBSP 1799-1800, pp.8-9 [0.8]</trial_ref></trial></trials><respites><respite><respite_text>Pleasure (RR, 16 July)</respite_text><respite_normalized></respite_normalized><respite_delay></respite_delay><respite_punishment></respite_punishment></respite></respites><outcomes><outcome><outcome_text>Free (18 July 1800) </outcome_text><outcome_normalized></outcome_normalized><outcome_group></outcome_group><outcome_duration></outcome_duration><outcome_date></outcome_date><outcome_location></outcome_location><outcome_exceptional></outcome_exceptional><outcome_ref>HO 13/13, pp.6-7</outcome_ref></outcome></outcomes></trial_file>
=============================================
7) create remaining elements
=============================================
save file as 7_allELements
add remaining elements in trial_file element:
search for
</outcomes>
replace with
</outcomes><trial_file_printed_sources></trial_file_printed_sources><trial_file_other_documents> </trial_file_other_documents><trial_file_notes></trial_file_notes>
put extraneous text into trial_file_notes field
GREP search for
(</trial_file_notes>)\t(.*?)(</trial_file>)
replace with
\2\1\3
find tab delimited snippets of text and put them into the trial_file_notes field
GREP search for
(<trial_file>.*?)\t(.*?)(<.*?)(</trial_file_notes>)(.*?)$
replace with
\1\3\2\4\5
manually search for instances of tab in trial_file lines and correct them
search for
<trial_file>.*?\t
fix manually
sample of trial_file:
<trial_file><criminal><surname>Thomas</surname><given_names>Scott</given_names><aliases></aliases><age></age><gender>1</gender></criminal><trials><trial><judge></judge><jury><jury_type>M</jury_type><jury_subtype>1</jury_subtype></jury><crime><crime_text>Robbery</crime_text><crime_normalized></crime_normalized><crime_group></crime_group></crime><mercy_appeals</mercy_appeals><trial_ref>TRIAL -- OBSP 1799-1800, pp.8-9 [0.8]</trial_ref></trial></trials><respites><respite><respite_text>Pleasure (RR, 16 July)</respite_text><respite_normalized></respite_normalized><respite_delay></respite_delay><respite_punishment></respite_punishment></respite></respites><outcomes><outcome><outcome_text>Free (18 July 1800) </outcome_text><outcome_normalized></outcome_normalized><outcome_group></outcome_group><outcome_duration></outcome_duration><outcome_date></outcome_date><outcome_location></outcome_location><outcome_exceptional></outcome_exceptional><outcome_ref>HO 13/13, pp.6-7</outcome_ref></outcome></outcomes><trial_file_printed_sources></trial_file_printed_sources><trial_file_other_documents> </trial_file_other_documents><trial_file_notes>blah blah</trial_file_notes></trial_file>
add elements to rec_rep_head
search for
<rec_rep_head>
replace with
<rec_rep_head><rec_rep_start_date></rec_rep_start_date><rec_rep_end_date></rec_rep_end_date><rec_rep_pub_date></rec_rep_pub_date><rec_rep_notes></rec_rep_notes>
populate start and end date fields
GREP search for
(</rec_rep_start_date><rec_rep_end_date>)(.*?</rec_rep_notes>)\t(.*?)\t
replace with
\3\1\3\2
populate rec_rep_pub_date element
GREP search for
(</rec_rep_pub_date><rec_rep_notes>)(</rec_rep_notes>).*?(\d+? .*? \d{1,4})(.*?)(</rec_rep_head>)
replace with
\3\1\4\2\5
normalize rep_rec_start_date
GREP search for
(<rec_rep_start_date>)(\d+).*?([A-Z]{3}).*?(\d{4})
replace with
\1\4\3\2
add zero to day of month as needed (have to use ZERO placeholder because GREP engine treats \10 as the tenth selected element, rather than the first followed by a literal zero)
GREP search for
([A-Za-z])(\d</rec_rep_start_date>)
replace with
\1!ZERO!\2
search for
!ZERO!
replace with
0
normalize rep_rec_end_date
GREP search for
(<rec_rep_end_date>).*?(\d{2}) ([A-Z]{3}[A-Z]*?) (\d{4})
replace with
\1\4\3\2
get the "20-2" instances
search for
(<rec_rep_end_date>).*?(\d)\d-(\d) ([A-Z]{3})[A-Z]*? (\d{4})(</rec_rep_end_date>)
replace with
\1\5\4\2\3\6
get the "1-3" instances
search for
(<rec_rep_end_date>).*? (\d)-(\d) ([A-Z]{3})[A-Z]*? (\d{4})(</rec_rep_end_date>)
replace with
\1\5\4!ZERO!\3\6
search for
!ZERO!
replace with
0
get the remaining instances (typically those containing commas)
GREP search for
(<rec_rep_end_date>.*?\d), (\d)replace with
replace manually with YYYYMMMdd (e.g. 1562JAN07)
normalize rec_rep_pub_date
GREP search for
(<rec_rep_pub_date>)(\d+).*?([A-Z]{3}).*?(\d{4})
replace with
\1\4\3\2
add zero to day of month as needed
GREP search for
([A-Za-z])(\d</rec_rep_pub_date>)
replace with
\1!ZERO!\2
search for
!ZERO!
replace with
0
=============================================
8) transform outcome_text, crime_text, and respite_text
=============================================
Use 7_to_8.xsl to transform 7_all_elements.xml into 8_ready_to_proof.xml.
Depending on the quirks of the particular data set, you may need to do any of the following:
- Adjust the regular expressions used in transform_functions.xsl and 7_to_8.xsl
- Add or remove irregular outcome_text and crime_text values in irregulars.xml. This file
provides a way to transform text that falls outside of the regular expression patterns.
- Adjust value_lists.xml to reflect the most up-to-date list of 'regular' crimes, outcomes,
and respites.
=============================================
9) Apply XML structural changes
=============================================
Use 8_to_9.xsl to transform the result of #8 into 9_schema_changes.xml.
This brings in structural changes that were introduced in March 2011. See blog:
http://hcmc.uvic.ca/blogs/index.php?blog=36&p=7824&more=1&c=1&tb=1&pb=1
=============================================
10) Apply further structural changes
=============================================
Use 9_to_10.xsl to transform the result of #9 into 10.xml. This incorporates further
structural changes, namely removing the rec_rep_head element and moving the data into
individual trial files. See blog:
http://hcmc.uvic.ca/blogs/index.php?blog=36&p=7853&more=1&c=1&tb=1&pb=1
=============================================
11) Add start_year element
=============================================
Add a <start_year> element as a direct child of <data> to the top of the result of #10. The value of this
element should be the starting year of the data set. For example, for the 1800s data set:
<data>
<start_year>1800</start_year>
(trial info and rest of file)
...
</data>
I've now got two datasets in the MySQL database, 1790-1799 and 1800-1809. Here's the relatively simple procedure for importing an XML dataset into the MySQL database:
#1 is the most important step: if the file validates, then all of the other pieces will fall into place.
SD and I met to discuss two issues: some data oddities that I wasn't sure how to handle, and the re-shuffling of the recorder report dates.
#1: Data Oddities
We've made the following changes to the value lists (respites, crimes, outcomes):
#2: Recorder's Report Dates
We also discussed a better way to handle rec_rep_start_date, rec_rep_end_date, and rec_rep_pub_date. These currently cover multiple trial_files within the same case, but, given that the dates may change for trials within one case, we've decided to move the dates to within the trial_file element. This will result in more redundant data but will also give SD the flexibility to enter date oddities. Specifically, these changes will be made:
SD's latest request for the search interface is that, when a trial file has more than one associated trial, respite, and/or outcome, only the first record in each of those subtables should be searchable. For example, if a trial file has two outcomes - 'Transported' and 'Military Service' - only the 'Transported' outcome should be searchable. I had planned to accomplish this in the search query by imposing a LIMIT, for example:
SELECT .... FROM trial_files WHERE outcomes.outcome_id IN (SELECT outcome_id FROM outcomes WHERE outcomes.trial_file_id_fk = trial_files.trial_file_id LIMIT 1)
However, it turns out that, as of version 5.5, MySQL doesn't yet support LIMIT in subqueries:
#1235 - This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'
So that approach won't work. It's for the best, though, since that relies on the order of the records in the database. What I will do now is add a "rank" column to the respites, outcomes, and trials tables. Only records with a rank of 1 will be searchable. In the detail view for a record, respites, outcomes and crimes will be displayed in order of rank. The SQL:
ALTER TABLE `trials` ADD `rank` INT NULL DEFAULT NULL;
ALTER TABLE `outcomes` ADD `rank` INT NULL DEFAULT NULL;
ALTER TABLE `respites` ADD `rank` INT NULL DEFAULT NULL;
I've manually added ranks to the 1790s data in the database; future rank will be taken from the order of the elements in the XML files.
I've linked up the search form with the chart plotting tool, so now it's possible to:
Contrary to my last blog post, the search data is not passed between the views via AJAX. Although that would be 'sexier', it doesn't lend itself well to storing results for the future. Speaking of which, my next task is to devise some sort of clever method for bookmarking search results, which'll be tricky since search data is stored in the session after the initial search (to avoid POST problems). I think some sort of DB-stored session ID may be the answer.
Note that development thus far has been done on my local machine but will be pushed to the Bailey website soon.
After having good success experimenting with Flot, an open Javascript chart-plotting script based on JQuery, I've decided that it's officially "the one" for the Bailey project. The Flot homepage is here: http://code.google.com/p/flot/
I chose Flot for three main reasons over Simile's Timeplot and Google's Visualization API:I've already been working to integrate Flot for the past week or so, and am progressing at a good rate. I've written the bare-bones search form and the bare-bones visualization page; the next step is to allow the results for one to be exported to the other. I don't foresee any problems with the integration and am looking forward to seeing the tools working together soon. I'd like to design a single page that incorporates both views (search form + visualization) and uses AJAX to pass the data around, perhaps using tabs to separate the two. That way I can avoid having to constantly POST data.
Met with SD yesterday to discuss his questions and concerns about the structure of the XML file. He's been proofing the data for the last couple of weeks. As a result of his reading, he's realized that the current XML structure is lacking in a few areas. So, we're going to make the following changes:
I spent a large chunk of today exploring possible Javascript-based visualization tools for the histogram feature, which is intended to show trends over time based on any number of criteria. There seem to be two clear frontrunners:
Made some good progress on the search interface today. Instead of having one Search class to handle both searching and returning one record, I wrote a separate Search_Single class that handles fetching a trial file based on a passed ID. Since a trial file can have multiple trials, respites, and outcomes (but only one criminal), the Search_Single class grabs those associated records via separate queries. Meanwhile, the main Search class simply does a LEFT JOIN on the trials, outcomes, and respites table to find at least one record that matches the search criteria.
I chose the trial_file as the 'main' table for the search because it's the parent for trials, respites, outcomes, and criminals. Since there's only one criminal per trial file but potentially multiple trials, respites, and outcomes, the trial file makes the most sense because we don't want the same criminal to show up multiple times in the search results, which would happen if we were to search with the trials table as primary.
I've sent the 1800 decade XML data file to SD for an initial proofing. Stewart did the bulk of the raw-to-XML transformation using a series of GREP statements. For this final transformation, I used XSLT to convert raw text to 'normalized' values that match against the RNG validation file. Here's how the validation currently breaks down:
The remaining errors will need a 'trained' eye to be fixed on a case-by-case basis. Once SD is happy with the data then he'll return it and I'll perform a further transformation that will import the data into the MySQL database (for which the XSLT spreadsheets have, luckily, already been written).
I've also been spending some time this morning ensuring that the XSL files used in the transformation are well-documented.
While writing the XSLT stylesheet for the Bailey project XML data, I stumbled across a misleading 'gotcha' when using the XPath function empty(). Basically, empty() will ALWAYS return false as long as the element being tested exists. So, if your XML chunk looks like this:
<alias />
And your XPath looks like this (in this case, within XSLT):
<xsl:when test="empty(./alias)">
That test will always return false since <alias> exists. To test whether an element is empty, use the string() function. For example, if we want to determine whether <alias> is empty (which it is):
<xsl:when test="not(string(./alias))">
string() will return the string value of the element, which, in this case, is empty (but not empty() - hah).
P.S.: Martin reminded me that empty() is actually for sequences, which explains the weirdness when used with an element. But that, of course, leads one to wonder why empty() doesn't throw an error when used with a non-sequence...
In preparation for the completion of the XSLT stylesheet that will transform the XML data into a valid data set, SD has made some revisions to the list of outcomes and outcome groups. These revisions are mostly just involve refining current outcomes and making them more explicit (e.g. expanding "Executed" to "Hanged", "HangedOnSite", "Gibbeted", "Dissected", "Burned", and "Drawn").
I've got some work ahead of me to integrate the new outcomes - mostly just accounting for them when encountering irregular data in the XML - but for the most part it should be fairly straightforward. The new outcomes are in the XSLT, so now it's just a matter of adjusting the stylesheet.
Met with SD today so that I could get more familiar with the project. SD went over how he'd like the data displayed, and we talked about the current state of the project in general. He is pleased with the progress and is excited to see what lies ahead. He'll double-check the list of crimes, respites and outcomes before I send him the XML file for proofing. After we hammer down the conversion and validation process for the current data set (1800s), then the rest of the data (i.e. the rest of the 19th century) will fall into place relatively quickly.
So, the project is progressing well and both SD and myself are happy with the results thus far.
I've completed my bare-bones version of the web-based record search tool. All of the logic for the search - both displaying the available options and parsing out a form submission - is contained within two classes, Search and Search_Form. The classes build the search query in an organized manner based on passed parameters. This format makes the search easy to change or extend, which I can foresee happening based on the needs of the data and the researcher SD. All of the code is documented using comments and standard docblock syntax.
The database-side of things is handled with Zend_Db, the Zend Framework's RDBMS adapter. Since the search form can result in some very complex queries, Zend_Db makes the process of building the select statement relatively easy. Plus, it's a more robust database adapter in terms of security and integrity than I could build in any decent amount of time.
The search form doesn't yet have any styling and is bare, but now that the functionality is done everything else will fall into place eventually. SD is coming in for a meeting Thursday Jan 27th so I hope to hammer out some more details then.
As my first project at HCMC, I've been tasked with finishing Stewart's procedure for converting the raw data (as supplied by SD) to well-formed XML. Stewart did an admirable job changing a tab-delimited document into an almost-finished XML file using GREP and an exhaustive series of regular expressions. Since, however, the final steps involve populating elements based on the value of other elements (checked against a list of allowable values), I've decided - after talking with Martin and Greg - to use XSLT to finish off the procedure. I don't know XSLT, but since it will be an important part of my work here for the next eight months, this is a good opportunity to learn the language.
So the first step is to take the value of crime_text, figure out to which value it corresponds in the list of allowable crimes in the proofing RNG, and then use that value to populate crime_normalized. Then, crime_group will be populated based on the value of crime_normalized. respite_text and outcome_text will be massaged in a similar way, though respite doesn't have a respite_group (only _normalized).
Spent a good chunk of time learning my way around what's been done for the Bailey project, the XSLT language, and the specific data and code needs of this project. The consensus: writing a catch-all function in the XSL stylesheet to convert the *_text values to *_normalized values will be tough, so there will likely be the odd value that will need to be edited manually.
Simon Devereaux has approximately 10,000 records of people convicted in potentially capital cases between 1710 and 1840 in London heard at the Old Bailey court. This project will create a web-based database which will allow interested researchers and members of the public to compose queries on that data (e.g. women charged with robbery 1710-1720). It must be able to support a range of queries and produce output allowing researchers to identify trends in judicial practice over that time.
| << | Current | >> | |
| Jan | Feb | Mar | Apr |
| May | Jun | Jul | Aug |
| Sep | Oct | Nov | Dec |