Archives for: 2011

04/11/11

Permalink 12:12:18 pm, by sarneil, 91 words, 372 views   English (CA)
Categories: Activity Log; Mins. worked: 60

Find SQL records which have identical values in two fields

I had to find all instances of criminals in which both the given-name and surname fields are identical and the number of such duplicates

Here's the SQL I used for the example:

SELECT `surname`, `given_name`, COUNT(`surname`) AS Instances FROM criminals GROUP BY `surname`, `given_name` HAVING ( COUNT(`surname`) > 1 ) AND ( COUNT(`given_name`) > 1)

and here's the same SQL with abstracted table and field names:

SELECT `field1Name`, `field2Name`, COUNT(`field1Name`) AS Instances FROM tableName GROUP BY `field1Name`, `field2Name` HAVING ( COUNT(`field1Name`) > 1 ) AND ( COUNT(`field2Name`) > 1)

Permalink 10:50:50 am, by sarneil, 258 words, 320 views   English (CA)
Categories: Activity Log; Mins. worked: 120

Given names form field bug

When I put a value into the given-names textfield in the search box and then tried to execute a search I got back an error message which began with the following:

Fatal error: Uncaught exception 'PDOException' with message 'SQLSTATE[42S22]: Column not found: 1054 Unknown column 'c.given' in 'where clause''

It took a couple of hours wading through the form, and the json and zend code working with the form and the data model until I found this line of code in includes/classes/Search.php

protected function _prepareCriminals() {
$this->_select->joinLeft(
array('c' => 'criminals'),
'c.criminal_id = tf.criminal_id_fk',
array('c.criminal_id', 'c.surname', 'c.given_name', 'c.age')
);
...
if ($this->inData('given-name')) {
$this->_select->where('LOWER(c.given-name) LIKE ?', '%' . strtolower($this->_data['given-name']) . '%');
}

I changed one "given-name" to a "given_name" (see below) and the queries now seem to work:

protected function _prepareCriminals() {
$this->_select->joinLeft(
array('c' => 'criminals'),
'c.criminal_id = tf.criminal_id_fk',
array('c.criminal_id', 'c.surname', 'c.given_name', 'c.age')
);

if ($this->inData('given-name')) {
$this->_select->where('LOWER(c.given_name) LIKE ?', '%' . strtolower($this->_data['given-name']) . '%');
}

so, apparently because given-name was not a member of the c array, the instance of LOWER(c.given-name) was interpreted as LOWER(c.given)-name. name of course was null so presumably had no effect, resulting in c.given being the name of the field in the table to search, which of course doesn't exist generating the error I saw.

20/10/11

Permalink 10:52:56 am, by sarneil, 216 words, 400 views   English (CA)
Categories: Activity Log; Mins. worked: 120

clean up year display in x-axis of flot graph

Previously, if there were more than 20 years in the range of returned data, the years appearing along the x-axis in the flot graph overwrote each other.
I changed the init-flot.js file by replacing this:

  xaxis: { 
    tickSize: 1,
    tickFormatter: function(val, axis) {
      if (!yearTotals[val]) {
        return val;
      } else {
        return val + '<br />' + yearTotals[val];
      }
    }
  }

with this:

  xaxis: { 
    ticks: 20,
    tickDecimals: 0,
    tickFormatter: function(val, axis) {
      return '&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;' + val;
    }
  }

The ticks limits the number of years explicitly labelled to a maximum of 20, thus taking care of the overlapping year display on the x-axis when there are lots of years in the chart.
The tickDecimals controls whether fractions of years get ticks on the x-axis. For small year ranges, if this is not set to 0, you get values like 1744.5 appearing, which is nonsensical for this dataset.
Since not all years get a tick on the x-axis, it didn't seem to make sense to display the total hits, so I've removed that from the tickFormatter.
I prefer the year label to be left justified to the left side of the column in the graph, rather than centered on the left side of the column, so I added those spaces to push the date to the left.

23/09/11

Permalink 02:25:24 pm, by sarneil, 76 words, 304 views   English (CA)
Categories: Activity Log; Mins. worked: 240

getting familiar with code

Spent a few hours getting familiar with Jamie's code - in particular the jquery stuff. Made some trivial changes to the pagination output. Still need to look more closely at the data model and how the chart view / table view switching is done. Also will need to see if there's some way to make the labels on the x-axis easier to read as they are currently overlapping for data sets with lots of years in them.

31/08/11

Permalink 01:00:58 pm, by jamie, 992 words, 365 views   English (CA)
Categories: Notes; Mins. worked: 0

Notes upon my departure

Original website (i.e. Stew's version): http://web.uvic.ca/~lang02/bailey/

New development website (note: requires Netlink login): http://web.uvic.ca/~lang02/bailey_v2/

XML schema: http://web.uvic.ca/~lang02/bailey/schema/bailey_trialfile_proofing.rng

The Data

The data has gone through a few schema changes, all documented on the blog. See, in chronological order (i.e. earliest first):

For each set of data that SD sends, I do the following:

  1. Make a new folder in BaileyCapitalTrials/data/ named after the year range for the data, e.g.: 1730-39

  2. Put SD's XSL file in that folder

  3. Follow the import steps to convert the XSL to XML, saving all of the step_*.xml files in the newly created folder: http://hcmc.uvic.ca/blogs/index.php?blog=36&p=8283&more=1&c=1&tb=1&pb=1

  4. Make a copy of step_3.xml and save it as (using 1730-39 as an example) 1730_1739_for_proofing.xml

  5. Send that file to SD for proofing

  6. Post on the blog that the data was received, processed, and sent for proofing

Once I get the file back from SD, I do the following:

  1. Save as 1730_1739_proofed_by_simon.xml in the proper folder (see above)

  2. Validate the file in Oxygen and fix any errors. There shouldn't be at this point, but SD lets the odd typo slip through the cracks.

  3. Make a copy, name it 1730-1739.xml and put it in
    BaileyCapitalTrials/data/finished_data/

  4. Once the file is 100% valid, import into the website: http://web.uvic.ca/~lang02/bailey_v2/import/index.php

  5. Post on the blog that the data was validated

Validation

As noted at the top of this document, the schema file is at http://web.uvic.ca/~lang02/bailey/schema/bailey_trialfile_proofing.rng . SD validates against this file so it always needs to be current.

The Website

The development website ( http://web.uvic.ca/~lang02/bailey_v2/ ) is a feature-complete, and mostly design-complete (I went for a muted look) HTML5-validated website. The site employs Jquery-driven Javascript to handle AJAX requests.

On the search form, the Jquery plugin bsmSelect is used to make multiple select fields more user-friendly. The plugin homepage is here: http://plugins.jquery.com/project/bsmSelect , and the blog post is here: http://hcmc.uvic.ca/blogs/index.php?blog=36&p=7864&more=1&c=1&tb=1&pb=1

The site search – both the form and the results – is handled by the Search suite of classes located at includes/classes (Search.php and the Search subdirectory). The Search_Form class, located at Search/Form.php, is responsible for displaying all of the search fields and populating them with the proper values. All of the classes are fully documented.

The site uses the Zend Framework's DB library for database interaction. The classes are located at includes/classes/Zend . All of the classes used in the site follow the PEAR class naming convention for easy autoloading. The PEAR convention dictates the following:

  • Class names should begin with an uppercase letter
  • Each file should only contain one class
  • Each subdirectory that a class is in should be represented in the name, with single underscores separating each level of the hierarchy. For example, if a class lives in Search/Form.php, its class name would be Search_Form . Similarly, if a class lives in My/Name/Is/Bob.php, its class name would be My_Name_Is_Bob . This convention has two primary benefits:

    • Locating a class file is easy (just need to know the name of the class and the base class include directory)
    • Autoloading classes is simple (the autoloader simply needs to replace underscores with directory separators)

The search results pages – both the results table and a single record – use the Model_ series of classes to display information, as well as either Search_Result (for the results summary) or Search_Single (for a detailed record view). Each row of a table (trials, trial_files, etc.) is represented by a class, which allows for easy placement of formatting functions and the like.

The Chart / Data Visualization

The chart to visualize results uses Flot, a Jquery-based plotting tool:

http://code.google.com/p/flot/

To accomplish the 'stacking' of the results, I used the stacking plugin, which is included wtih Flot:

http://people.iola.dk/olau/flot/examples/stacking.html

The Flot chart gets its information from a dynamic JSON dataset, which is based on a submitted search form. When a search is submitted, it's stored in the user's session for easy re-use on either the search page (to modify results) or the chart page. This is also how the “passing” of data from the chart to the results table (and vice versa) is accomplished – both just read and parse any search information in the session. The results themselves are not saved in the session, just the search query parameters.

The Flot initialization code is in js/init-flot.js – it's almost all 'stock' and taken from the examples, with the minor exception of a modification I had to make to get the totals for each year display under the year name in on the X axis of the chart.

I'm overall very happy with Flot, as is SD. It's fairly simple, decently fast, and, unlike some alternatives, still under active development.

25/08/11

Permalink 10:23:56 am, by jamie, 291 words, 390 views   English (CA)
Categories: Activity Log; Mins. worked: 60

Meeting with SD; schema modifications

[SA 120116 made changes affecting what's talked about in this post, see http://hcmc.uvic.ca/blogs/index.php?blog=36&p=9010&more=1&c=1&tb=1&pb=1]

Met with SD yesterday primarily to discuss two things: how to link together criminals from different trials that are the same person, and how to link trials in different trial_files to each other (i.e. for a bunch of criminals tried for the same crime). We determined that the simplest, quickest, and therefore most prudent way to do this is to use an "id" attribute to link together the elements as required. So, if two <criminal> elements both describe the same person, then each <criminal> element in question should look like this:


<criminal id="myuniqueidentifier">
(stuff)
</criminal>

where "myuniqueidentifir" can be any string as long as it's the same for each criminal that should be treated as the same person. <criminal> elements that don't correspond to any other <criminal> element do <strong>not</strong> need an id attribute.

Similarly, for <trial> elements that refer to the same trial (but a different criminal):


<trial rel="5">
(stuff)
</trial>

In the case of trials, 'rel' does not have to be absolutely unique, just unique within one court session (i.e the same trial_file_start_date and trial_file_end_date values). SD requested this to make it easier to keep track of the numbers.

I will be able to alter the import script to account for these new attributes, but I'm not sure if SD will have a chance to go over all of the data before my contract is up.

23/08/11

Permalink 09:15:02 am, by jamie, 16 words, 137 views   English (CA)
Categories: Activity Log; Mins. worked: 20

1930-39 data received, processed, sent for proofing

Received and processed the 1930-39 data sent by SD yesterday, and sent back for final proofing.

18/08/11

Permalink 01:51:17 pm, by jamie, 222 words, 219 views   English (CA)
Categories: Activity Log; Mins. worked: 180

Losing HTTP sessions when using Apache, Tomcat, mod_rewrite, and mod_jk

In my previous post about making francotoile.uvic.ca work, I outlined writing a VirtualHost configuration that could successfully rewrite a pear.hcmc.uvic.ca Tomcat URL to francotoile.uvic.ca. It seems, however, that with this configuration, HTTP session information isn't transmitted between Tomcat and Apache. In other words, when using the URL rewritten with the VirtualHost, a new session is created every page load and does not persist. This does not happen when using the original URL (the pear.hcmc.uvic.ca one).

I had hoped to finish this post off with "to fix this, just add...", but, sadly, I've yet to find a solution. Martin has pointed out that both http://mariage.uvic.ca/ and http://bcgenesis.uvic.ca/ , which also use mod_rewrite and mod_jk, handle sessions correctly.

It turns out that both of those sites use cookies, not sessions. Since a cookie is a natural choice in this case anyway - a language preference for the site - I've changed the code in my i18n eXist plugin to use cookies instead of sessions, which does the trick. Martin has suggested to use both in tandem, so that the system tries to use sessions but falls back on a cookie if session are unavailable. I will try to implement this enhancement before my time is up.

15/08/11

Permalink 04:25:27 pm, by jamie, 64 words, 127 views   English (CA)
Categories: Activity Log; Mins. worked: 60

Updated schema; wrote XSLT to transform old dates

Updated the schema to use the new standard date format (YYYY-MM-DD) in all date fields, namely trial_file_start_date, trial_file_end_date, respite_rr_date, and outcome_date. I wrote an XSLT transformation to convert the dates in the otherwise-valid data files to the new format. I also modified the PHP import script so that it no longer does any date conversion.
Permalink 12:48:22 pm, by jamie, 61 words, 157 views   English (CA)
Categories: Notes; Mins. worked: 0

Change to schema - date format

Because unix7.uvic.ca can't handle dates earlier than Dec 14, 1901, the proprietary date format in the XML files - currently 1759Dec31 - will need to be changed to a standard YYYY-MM-DD format, i.e. 1759-12-31. The XML -> MySQL import script can't do any date processing, so the date needs to be in the proper format in the original data.

Permalink 12:04:28 pm, by jamie, 65 words, 176 views   English (CA)
Categories: Notes; Mins. worked: 0

unix.uvic.ca can't generate timestamps before Dec 14, 1901

The unix.uvic.ca machines - at least, the ones I've used - have a 32-bit architecture. Because of this, PHP's timestamp handling is limited from Fri, 13 Dec 1901 20:45:54 GMT to Tue, 19 Jan 2038 03:14:07 GMT (see http://uk3.php.net/manual/en/function.date.php ). 64-bit architecture allows for basically limitless timestamps.

This limitation poses a problem because the Bailey data importer deals exclusively with dates before 1901.

12/08/11

Permalink 09:49:15 am, by jamie, 33 words, 146 views   English (CA)
Categories: Activity Log; Mins. worked: 5

Added LDAP Netlink auth to the Bailey dev site

At the request of SD, added a limited-access Netlink login to the Bailey dev site ( http://web.uvic.ca/~lang02/bailey_v2/ ). The allowed users are the usual suspects at HCMC and SD.
Permalink 09:41:28 am, by jamie, 27 words, 142 views   English (CA)
Categories: Activity Log; Mins. worked: 20

1740-49 data received, converted, sent for proofing

Received the 1740-49 XLS file from SD last night. Converted it to XML this morning, fixed some validation errors, and sent back to SD for final proofing.

05/08/11

Permalink 04:09:44 pm, by jamie, 31 words, 392 views   English (CA)
Categories: Notes; Mins. worked: 0

How to toggle a boolean or tinyint field in MySQL

If you have a boolean or tinyint(1) field in MySQL and you need to toggle the value without knowing in advance what the value currently is:

UPDATE table SET field = !field

04/08/11

Permalink 10:05:38 am, by jamie, 21 words, 94 views   English (CA)
Categories: Activity Log; Mins. worked: 5

1750-59 data received and imported into MySQL

SD returned the 1750-59 XML file with 100% validation, which I promptly imported into the MySQL database using the convenient import script.

03/08/11

Permalink 12:11:10 pm, by jamie, 20 words, 87 views   English (CA)
Categories: Activity Log; Mins. worked: 30

1750-59 data received, processed, sent for proofing

Received the 1750-59 data from SD in Excel format. Converted it to XML and returned it to SD for proofing.
Permalink 11:51:39 am, by jamie, 155 words, 96 views   English (CA)
Categories: Activity Log; Mins. worked: 180

XML -> MySQL import script complete

I've finished writing and testing a PHP script to import an XML file (validated against http://web.uvic.ca/~lang02/bailey/schema/bailey_trialfile_proofing.rng ) into the MySQL database. The script using SimpleXMLElement to load and loop through the file. The database handling is done by Zend_Db, since that's already being used in the main Bailey website.

I wrote the importer in such a way that new foreign keys (e.g. crime_normalizeds, judges, etc.) don't need to be entered into the MySQL database beforehand. Since the XML is validated against the schema, I can be reasonably certain that all of the values in the data are intentional. So, if the importer comes across a value that's not already in the MySQL database (say, a new judge), it simply inserts that value and continues with the import.

The importer isn't yet online, but once it is I'll publish its URL to this blog.

29/07/11

Permalink 09:59:50 am, by jamie, 166 words, 234 views   English (CA)
Categories: Activity Log; Mins. worked: 60

Removing the case element from the schema

While working on the import process to bring the XML data into the website database, it occurred to me that, after all of the changes to the schema, the <case> element doesn't serve much of a function anymore. While it used to group together multiple trial files, all of the useful grouping information (namely the trial dates) has been moved to the <trial_file> elements themselves. So, we don't need the extra level in the hierarchy.

For that matter, we also don't need the <regular_cases> element that wraps up the <case> elements, since it's also an extra level of hierarchy. The schema really just needs to be: data -> one or more trial files

Thus, I've removed <regular_cases> <case> from the schema and modified the XSLT transformation scripts accordingly. I also modified the data that's already been processed.

These changes will make the PHP script that converts the XML to MySQL simpler.

26/07/11

Permalink 09:12:44 am, by jamie, 8 words, 195 views   English (CA)
Categories: Notes; Mins. worked: 0

1760-69 proofed

Received the proofed 1760-69 data from SD, 100% valid.

25/07/11

Permalink 03:48:18 pm, by jamie, 20 words, 82 views   English (CA)
Categories: Activity Log; Mins. worked: 25

1760-69 data received, processed, sent for proofing

Received the 1760-69 data from SD in Excel format. Converted it to XML and returned it to SD for proofing.
Permalink 03:19:16 pm, by jamie, 191 words, 287 views   English (CA)
Categories: Notes; Mins. worked: 0

eXist rebuild cancelled: PHP version reinstated with planned XML import

I've decided to scrap the eXist rebuild for a few reasons:

The XML schema was not meant as a final destination for the data. SA created it as an ad hoc intermediary step between SD's raw data and the MySQL database. So, the schema doesn't lend itself well to being the backbone of an eXist architecture, particularly when it comes to searching and generating human-readable values. The structure of the data is highly relational and is a good fit for SQL, which, of course, was SA's original intention.

That said, my original motivation for the eXist rebuild still stands. The current method of translating the XML to SQL involves using XSLT transformations to generate SQL statements. However, every time the schema is updated with new structure or values - which is happening a lot - the XSLT stylesheets need to be updated as well. So, to get around these difficulties, I'm going to write a PHP script that will convert the XML to MySQL, likely use SimpleXML. It won't be the fastest script I've ever written, but since it will be a one-use-per-dataset kind of thing, speed isn't a big issue.

20/07/11

Permalink 12:40:55 pm, by jamie, 110 words, 180 views   English (CA)
Categories: Notes; Mins. worked: 0

eXist rebuild planned

After some consultation with Greg I've decided to scrap the PHP/MySQL version of Bailey that I built and write the site in eXist. After building FrancoToile and the Lansdowne Lecture site in eXist I'm comfortable working with it, and this way the Bailey XML data can be used directly without being shoehorned into a MySQL database. The allowed values for crimes, outcomes, judges, etc. is constantly changing in the schema, so maintaining those keys in MySQL would be a long-term pain. Using eXist cuts out the XML-to-MySQL middle man. I don't expect the basic functionality of the site to take long - it'll likely be done next week sometime.

Permalink 09:59:40 am, by jamie, 14 words, 65 views   English (CA)
Categories: Activity Log; Mins. worked: 10

New crime_normalized

Added a new crime_normalized "StealinaChurch" to the schema at the request of SD.

13/07/11

Permalink 11:10:47 am, by jamie, 28 words, 1600 views   English (CA)
Categories: Notes; Mins. worked: 0

Beginning restructuring of SQL schema

Because of the laundry list of changes to the XML structure, I've begun a wholesale restructuring of the MySQL database schema. There are quite a few core changes.
Permalink 09:56:32 am, by jamie, 12 words, 1596 views   English (CA)
Categories: Notes; Mins. worked: 0

1770-79 data proofed and received

SD sent back the 1770-79 XML file this morning, which is 100% valid.

12/07/11

Permalink 02:52:50 pm, by jamie, 30 words, 1511 views   English (CA)
Categories: Activity Log; Mins. worked: 60

1770-79 data sent for proofing

Sent the latest data set, 1770-79, to SD for proofing. Quick import this time and only 25 errors, 23 of which are mercy appeal errors which SD usually saves for post-import anyway.
Permalink 01:19:35 pm, by jamie, 51 words, 1602 views   English (CA)
Categories: Notes; Mins. worked: 0

1770-90 data received; change in spreadsheet format

Received the 1770-79 XLS spreadsheet from SD. Minor change to the format as he's added "Outcome Durn - Yrs" and "Outcome Durn - Other" to handle outcome duration values (see previous blog post: http://hcmc.uvic.ca/blogs/index.php?blog=36&p=8333&more=1&c=1&tb=1&pb=1 ).

24/06/11

Permalink 09:21:09 am, by jamie, 43 words, 111 views   English (CA)
Categories: Notes; Mins. worked: 0

1780-84 data proofed and completed

Received the proofed 1780-84 data file from SD. After adding "HighTreason" as a crime_normalized, the file has 100% validation.

As of today, these are the data sets that are complete (i.e. use the latest schema and are 100% valid):

  • 1780-84
  • 1785-89
  • 1790-99
  • 1800-09

22/06/11

Permalink 01:52:15 pm, by jamie, 26 words, 80 views   English (CA)
Categories: Activity Log; Mins. worked: 90

1780-84 data sent for proofing

Sent the XML file for 1780-84 for SD for proofing. There are a bunch of validation errors stemming from data ambiguity that he'll have to fix.

20/06/11

Permalink 02:16:26 pm, by jamie, 279 words, 1625 views   English (CA)
Categories: Activity Log; Mins. worked: 60

Another schema change

Another schema change for the data: we're going to utilize the hitherto ignored outcome_duration element for outcomes. Until this point, normalized outcomes (i.e. the outcome_normalized element) with years attached, such as "transported for 1 year", "hulks for five years", etc. were simply listed as T1, Hulks5, etc. But, because the list of T# and Hulks# normalized outcomes is growing, SD and I have decided to move the numbers from the outcome_normalized element into the outcome_duration element where they belong.

Furthermore, to allow for flexibility when entering the duration, the outcome_duration element has been expanded with a number of child elements. These could also be attributes, but I've chosen to use elements since everything else in the schema uses elements rather than attributes. The modified element looks like this:


<outcome_duration>
<years>(number of years, or blank for none)</years>
<months>(number of months, or blank for none)</months>
<weeks>(number of weeks, or blank for none)</weeks>
<days>(number of days, or blank for none)</days>
<other>(Life, Remainder, or blank)</other>
</outcome_duration>

I also changed the acceptable values of outcome_normalized so that, instead of accepting T#, Hulks#, and GB#, "Transport", "Hulks", and "GoodBehavior" are the new accepted values. Also, "SelfTL" and "SelfTR" have become "SelfTransport" with either "Life" or "Remainder" in the outcome_duration/other.

The values of years, months, weeks, and days must be an integer (or blank), while the value of other must correspond to a value in the setofNormalizedDurationOthers list in the RNG schema.

15/06/11

Permalink 11:36:36 am, by jamie, 258 words, 108 views   English (CA)
Categories: Activity Log; Mins. worked: 180

Flot chart - axis totals added to labels

Spent quite a bit of time figuring out how to add x-axis totals to each axis on the Flot chart - Flot calls these "ticks" - to display the total number of items for each year (along with the year number, which was already being displayed). Since I'm using the Flot stacks plugin to display multiple sets of data on each axis, it was a bit of a complex process.

The most time consuming part of the process was finding a "Flot-y" way to do it. The API documentation is OK, but a bit too informal for my tastes and glosses over a lot of things. After a lot of testing, debugging, and stepping through functions via the Chrome developer suite, I couldn't find anything suitable that could be achieved by native Flot functions dealing with the ticks. There are plenty of ways to format the data already in the tick (i.e. the year number in this case), but not to add data to the tick. So, I declared a global variable var yearTotals = [], and, in the success function that's triggered after the AJAX call that grabs the JSON data for the chart, I added a for loop to get the totals for each year and add them to the array:


    function onDataReceived(series) {
        $.each(series, function(index, value) {
            var entry = new Array();
            entry['label'] = index;
            entry['data'] = value.data;
            chartData.push(entry);
            
            for (i = 0; i < value.data.length; i++) {
                row = value.data[i];
                if (!yearTotals[row[0]]) {
                    yearTotals[row[0]] = row[1];                
                } else {
                    yearTotals[row[0]] += row[1];
                }
            }
        });

        plotWithOptions(chartData);
    }

07/06/11

Permalink 04:30:10 pm, by jamie, 35 words, 86 views   English (CA)
Categories: Notes; Mins. worked: 0

1785-89 data sent for proofing

Sent SD the 1785-89 XML file that was converted from the Excel spreadsheet he sent me the other day. He will check all the data and address the few validation errors that I didn't fix.
Permalink 10:29:18 am, by jamie, 131 words, 259 views   English (CA)
Categories: Notes; Mins. worked: 0

Raw -> XML procedure for new data spreadsheet format

Now that SD is putting his data into Excel spreadsheets, the import procedure is (thankfully!) a lot simpler than the first process:

Step 1:

  1. Use Oxygen to convert Excel spreadsheet to XML (File -> Import -> MS Excel File); ensure that "First row contains field names" is checked

  2. Save as step_1.xml

Step 2:

  1. Transform step_1.xml with transformations/1_to_2.xsl

  2. Save as step_2.xml

Step 3:

  1. Transform step_2.xml with transformations/2_to_3.xsl

  2. Add RNG schema line to the top of the file:
    <?oxygen RNGSchema="http://web.uvic.ca/~lang02/bailey/schema/bailey_trialfile_proofing.rng"type="xml"?>

  3. Validate and correct any errors. Given the variable nature of the data, 100% validation probably won't be possible at this point

  4. Save as step_3.xml

  5. Send the file to SD for proofing/final validating
Permalink 10:09:02 am, by jamie, 14 words, 93 views   English (CA)
Categories: Notes; Mins. worked: 0

Added new values to proofing file

Judges:
Adair-J
Gould-H
Hotham-N
Nares-G
Willes-J
Willes-E

Normalized outcome:
T3

Respite delays:
ThreeDays
TenDays

06/06/11

Permalink 01:26:26 pm, by jamie, 72 words, 287 views   English (CA)
Categories: Notes; Mins. worked: 0

xsl:group-by with multiple elements on the same level

If you want to loop through a series of XML elements in an XSLT stylesheet and group by more than one element (for example, the combination of <first_name> and <last_name> elements), use the concat function to create a 'virtual key':


<xsl:for-each-group select="row" group-by="concat(First_Name, ' ', Last_Name)">
    <xsl:variable name="theKey" select="current-grouping-key()"/>
<xsl:for-each-group>

Permalink 09:44:25 am, by jamie, 61 words, 90 views   English (CA)
Categories: Notes; Mins. worked: 0

New data format: Excel spreadsheet

Received 1785-89 data from SD, which he put into an Excel spreadsheet. This will be the standard way for him to give us data from now on. I will be writing a new raw-to-XML procedure to handle the import process shortly. Hopefully this method will be less painful and time-consuming than the previous method, which began with a MS Word doc.

30/05/11

Permalink 10:36:45 am, by jamie, 91 words, 303 views   English (CA)
Categories: Activity Log; Mins. worked: 60

Meeting with SD

Met with SD today to discuss XML structure, namely the possible elimination of the rec_rep element. We came to the logical conclusion that, since the rec_rep element itself no longer has any useful information (it's all been moved to child elements), the element has become redundant and can be removed. So, I updated the proofing file to reflect the structural change and wrote an XSLT procedure to transform the files.

See: http://hcmc.uvic.ca/blogs/index.php?blog=36&p=7853&more=1&c=1&tb=1&pb=1

25/05/11

Permalink 08:57:29 am, by jamie, 19 words, 82 views   English (CA)
Categories: Notes; Mins. worked: 0

1790-9 data re-proofed

Received the 1790-9 data XML file from SD yesterday, freshly re-proofed and ready for re-import into the development website.

19/05/11

Permalink 01:44:40 pm, by jamie, 60 words, 93 views   English (CA)
Categories: Notes; Mins. worked: 0

SD returns; 1800s data being re-proofed

Heard from SD the other day, who is back from Britain. He's going to re-proof the 1800s data. His hopeful plan is to have all data from 1714-1810 in order by the end of the summer. As long as the new data looks roughly the same as the data we've already received, this shouldn't be a problem on our end.

08/04/11

Permalink 08:38:21 am, by jamie, 37 words, 100 views   English (CA)
Categories: Notes; Mins. worked: 0

SD away for six weeks

SD is in Britain from April 2 - May 16. Since the ball is currently in SD's court - he's double checking the 1790s and 1800s data - progress will be on hold until the latter half of May.

21/03/11

Permalink 03:03:37 pm, by jamie, 3403 words, 143 views   English (CA)
Categories: Activity Log; Mins. worked: 60

Raw to XML procedure

Finished writing the procedure, started by SA, for how convert a raw Word data file from SD into a valid XML file. Note that, since SD will be using an Excel spreadsheet for all data beginning from the year 1810, this procedure is only valid for pre-1810 data (of which there will be plenty). I'll have to write a second procedure for converting SD's Excel files to XML once he's finished the first set of data.

The procedure is saved in a text file which will be passed back to SA upon his return. Here's the procedure verbatim:


=============================================
1) Clean text data
=============================================
save raw text as 1_cleanText.txt

replace all multiple tab with single tab
GREP search for
\t+
replace with
\t

normalize indicator of multiple trial_files in one case
GREP search for
^\s*?&\s*?&.*$
replace with
ADDITIONAL_TRIAL_FILE_IN_CASE

turn remaining ampersand characters into entities
search for
&
replace with
&

put four tabs on each trial line
put space in front of all lines with four tabs
GREP search for
^(.*?\t.*?\t.*?\t.*?\t.*?)$
replace with
 \1 [space character preceding the slash 1]

add tab to all lines with only three tabs
GREP search for
^(\w.*?\t.*?\t.*?\t.*?)$
replace with
 \1\t [space character preceding the slash 1]

add two tabs to all lines with only two tabs
GREP search for
^(\w.*?\t.*?\t.*?)$
replace with
 \1\t\t [space character preceding the slash 1]
 
manually add three tabs to all lines with only one tab
GREP manually search for
^(\w.*?\t.*?)$
replace with
 \1\t\t\t [space character preceding the slash 1]

put tab before instances of TRIAL that don't have one
GREP search for
^TRIAL
replace with
\tTRIAL

put TRIAL onto previous line
GREP search for
(.*?\t.*?\t.*?\t.*?)\r(\tTRIAL.*?)
replace with
\1\2

GREP search for:
Name\s*?Crime\s*?Respite\s*?Pardoned\s*?Executed\s*?\r
replace with:
[nothing]

GREP search for:
^\t*?(.*?Recorder's Report.*?)$
replace with
RECORDER_REPORT_HEAD \2

normalize smart quotes
search for
’
replace with
'
search for
‘
replace with
'
search for
“
replace with
"
search for
”
replace with
"

eliminate empty lines
GREP search for
\r\s*?\r
replace with
\r
repeat until no instances

put extraneous lines as comments on end of previous line
GREP search for
(.*?\t.*?\t.*?\t.*?\t.*?)\r\t(.*?)
replace with
\1\t\t\2
repeat until no hits

manually edit any lines still remaining
GREP search for
^\t
ensure there are 6 tabs in preceding line and copy extra line after sixth tab

ensure there are no more than 6 tabs in any line
GREP search for
^(.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?)\t
replace with
\1
repeat until no hits

removing leading spaces at start of line
GREP search for
^ [space character]
replace with
[nothing]

remove trailing spaces after tab
GREP search for 
\t +[space character before +]
replace with
\t

remove leading spaces before tab
GREP search for 
 +\t [space character before +]
replace with
\t

Manually check in spreadsheet for any lines that should obviously be included as comments in the preceding line

File should now have no empty lines
Each line should be one of
- start with RECORDER_REPORT_HEAD
- consist of ADDITIONAL_TRIAL_FILE_IN_CASE
- contain a trial_file record with 7 tab-delimited fields:
CriminalName TAB Crime TAB Respite TAB outcome	TAB ExecutionDate TAB Trial Ref TAB extra

sample extract
RECORDER_REPORT_HEAD 	4-7, 9-10 DEC 1799	– Recorder's Report -> F, 31 Jan 1800
Thomas Scott (M1)	Robbery	Pleasure (RR, 16 July)	Free (18 July 1800) [HO 13/13, pp.6-7]		TRIAL -- OBSP 1799-1800, pp.8-9 [0.8]
Bartholomew Foley (M1)	St in Dwelling	Pleasure (RR)	TL (15 Feb 1800) [HO 13/12, pp.394-5]		TRIAL -- OBSP 1799-1800, pp.9-10 [0.4]
Peter Chapman als Harry Read	Burglary	To Die (RR)	-----	W, 26 Feb 1800		als Harry Kirk (L) (19)TRIAL -- OBSP 1799-1800, pp.88-92 [4.6]
ADDITIONAL_TRIAL_FILE_IN_CASE
John Hall (L) (33)	Burglary	To Die (RR)	-----	W, 26 Feb 1800	TRIAL -- OBSP 1799-1800, pp.88-92 [4.6]
ADDITIONAL_TRIAL_FILE_IN_CASE
Joseph Jones (L)	Burglary	To Die (RR)	-----	W, 26 Feb 1800	TRIAL -- OBSP 1799-1800, pp.88-92 [4.6 but PG]

=============================================
2) add rec_rep, rec_rep_head, case, trial_file elements
=============================================

save file as file 2_mainElements.txt

create rec_rep containers
search for
^RECORDER_REPORT_HEAD(.*?)$
replace with
</rec_rep>\r<rec_rep>\r<rec_rep_head>\1</rec_rep_head>

delete </rec_rep> line at top of document
add <regular_cases> line at top of document
add </rec_rep> line at end of document
add </regular_cases> line at bottom of document

create trial_files
GREP search for
<trial_file>(.*?)</trial_file>
replace with
<case>\r\1\r</case>

remove case elements for multiple trial_files in one case
GREP search for
</case>\r\tADDITIONAL_TRIAL_FILE_IN_CASE\r<case>\r
replace with
[nothing]

Each line of file should now be one of:
<rec_rep>
<rec_rep_head> . . . </rec_rep_head>
<case>
<trial_file> . . . </trial_file>
</case>
</rec_rep>

sample extracts:
<rec_rep>
<rec_rep_head>	4-7, 9-10 DEC 1799	– Recorder's Report -> F, 31 Jan 1800</rec_rep_head>
<case>
<trial_file>Thomas Scott (M1)	Robbery	Pleasure (RR, 16 July)	Free (18 July 1800) [HO 13/13, pp.6-7]		TRIAL -- OBSP 1799-1800, pp.8-9 [0.8]</trial_file>
</case>
<case>
<trial_file>Bartholomew Foley (M1)	St in Dwelling	Pleasure (RR)	TL (15 Feb 1800) [HO 13/12, pp.394-5]		TRIAL -- OBSP 1799-1800, pp.9-10 [0.4]</trial_file>
</case>
<case>
<trial_file>Peter Chapman als Harry Read	Burglary	To Die (RR)	-----	W, 26 Feb 1800		als Harry Kirk (L) (19)TRIAL -- OBSP 1799-1800, pp.88-92 [4.6]</trial_file>
<trial_file>John Hall (L) (33)	Burglary	To Die (RR)	-----	W, 26 Feb 1800	TRIAL -- OBSP 1799-1800, pp.88-92 [4.6]</trial_file>
<trial_file>Joseph Jones (L)	Burglary	To Die (RR)	-----	W, 26 Feb 1800	TRIAL -- OBSP 1799-1800, pp.88-92 [4.6 but PG]</trial_file>
</case>
</rec_rep>

=============================================
3) create criminal elements
=============================================
save file as 3_criminal

create criminal element
GREP search for
<trial_file>(.*?)\t
replace with
<trial_file><criminal>\1</criminal>

move jury element outside of criminal element
GREP search for
(<criminal>.*?)(\([A-Z]\d*?\))(.*?</criminal>)
replace with
\1\3\2

create and populate surname and given_name elements
GREP search for
(<criminal>)(.*?) (.*?) [trailing space character]
replace with
\1<surname>\2</surname><given_names>\3</given_names>

create aliases, age and gender elements
search for
</given_names>
replace with
</given_names><aliases></aliases><age></age><gender>Male</gender>

populate age element (gender are all assigned 1 above)
GREP search for
(</age><gender>Male</gender>).*?\((\d+?)\).*?(</criminal>)
replace with
\2\1\3

sample:
<trial_file><criminal><surname>Thomas</surname><given_names>Scott</given_names><aliases></aliases><age></age><gender>1</gender></criminal>(M1)Robbery	Pleasure (RR, 16 July)	Free (18 July 1800) [HO 13/13, pp.6-7]		TRIAL -- OBSP 1799-1800, pp.8-9 [0.8]</trial_file>

=============================================
4) create trials elements
=============================================
save file as 4_trials

add trials, trial, judge, jury elements to trials element
search for
</criminal>
replace with
</criminal><trials><trial><judge></judge><jury><jury_type></jury_type><jury_subtype></jury_subtype></jury></trial></trials>

populate jury element
GREP search for
(</jury_type><jury_subtype>)(</jury_subtype></jury></trial></trials>)\(([A-Z])(\d*?)\)
replace with
\3\1\4\2

add crime element inside trial element
search for
</trial>
replace with
<crime><crime_text></crime_text><crime_normalized></crime_normalized><crime_group></crime_group></crime></trial>

populate crime element
GREP search for
(</crime_text>.*?</trial>)(.*?)\t
replace with
\2\1

add empty mercy_appeals element
search for
<trial_ref>
replace with
<mercy_appeals></mercy_appeals><trial_ref>

sample
<trial_file><criminal><surname>Thomas</surname><given_names>Scott</given_names><aliases></aliases><age></age><gender>1</gender></criminal><trials><trial><judge></judge><jury><jury_type>M</jury_type><jury_subtype>1</jury_subtype></jury><crime><crime_text>Robbery</crime_text><crime_normalized></crime_normalized><crime_group></crime_group></crime><mercy_appeals</mercy_appeals><trial_ref>TRIAL -- OBSP 1799-1800, pp.8-9 [0.8]</trial_ref></trial></trials>Pleasure (RR, 16 July)	Free (18 July 1800) [HO 13/13, pp.6-7]		</trial_file>

=============================================
5) create respites elements
=============================================
save file as 5_respites

add respites elements
search for
</trials>
replace with
</trials><respites><respite><respite_text></respite_text><respite_normalized></respite_normalized><respite_delay></respite_delay><respite_punishment></respite_punishment></respite></respites>

populate respite_text element
GREP search for
(</respite_text>.*?</respites>)(.*?)\t
replace with
\2\1

sample
<trial_file><criminal><surname>Thomas</surname><given_names>Scott</given_names><aliases></aliases><age></age><gender>1</gender></criminal><trials><trial><judge></judge><jury><jury_type>M</jury_type><jury_subtype>1</jury_subtype></jury><crime><crime_text>Robbery</crime_text><crime_normalized></crime_normalized><crime_group></crime_group></crime><mercy_appeals</mercy_appeals><trial_ref>TRIAL -- OBSP 1799-1800, pp.8-9 [0.8]</trial_ref></trial></trials><respites><respite><respite_text>Pleasure (RR, 16 July)</respite_text><respite_normalized></respite_normalized><respite_delay></respite_delay><respite_punishment></respite_punishment></respite></respites>Free (18 July 1800) [HO 13/13, pp.6-7]		</trial_file>

=============================================
6) create outcomes elements
=============================================
save file as 6_outcomes

add outcome, outcome_text, outcome_ref elements
search for
</respites>
replace with
</respites><outcomes><outcome><outcome_text></outcome_text><outcome_ref></outcome_ref></outcome></outcomes>

populate outcome_text element
GREP search for
(</outcome_text><outcome_ref></outcome_ref></outcome></outcomes>)(.*?)\t
replace with
\2\1

populate outcome_ref element
GREP search for
(<outcome_text>.*?)\[(.*?)\](</outcome_text><outcome_ref>)
replace with
\1\3\2

populate outcome_ref for those records not caught by regexp above
(.*?)\[(HO.*?)\](.*?)(</outcome_ref>)
replace with
\1\3\2\4

add rest of outcome elements
search for
<outcome_ref>
replace with
<outcome_normalized></outcome_normalized><outcome_group></outcome_group><outcome_duration></outcome_duration><outcome_date></outcome_date><outcome_location></outcome_location><outcome_exceptional></outcome_exceptional><outcome_ref>

populate outcome_date element
i.e. execution date, taken from 6th tab field of original clean text
not taken from the date provided in the outcome_text

GREP search for
(</outcome_date>.*?</outcomes>)(.*?)\t
replace with
\2\1

sample:
<trial_file><criminal><surname>Thomas</surname><given_names>Scott</given_names><aliases></aliases><age></age><gender>1</gender></criminal><trials><trial><judge></judge><jury><jury_type>M</jury_type><jury_subtype>1</jury_subtype></jury><crime><crime_text>Robbery</crime_text><crime_normalized></crime_normalized><crime_group></crime_group></crime><mercy_appeals</mercy_appeals><trial_ref>TRIAL -- OBSP 1799-1800, pp.8-9 [0.8]</trial_ref></trial></trials><respites><respite><respite_text>Pleasure (RR, 16 July)</respite_text><respite_normalized></respite_normalized><respite_delay></respite_delay><respite_punishment></respite_punishment></respite></respites><outcomes><outcome><outcome_text>Free (18 July 1800) </outcome_text><outcome_normalized></outcome_normalized><outcome_group></outcome_group><outcome_duration></outcome_duration><outcome_date></outcome_date><outcome_location></outcome_location><outcome_exceptional></outcome_exceptional><outcome_ref>HO 13/13, pp.6-7</outcome_ref></outcome></outcomes></trial_file>

=============================================
7) create remaining elements
=============================================
save file as 7_allELements

add remaining elements in trial_file element:
search for
</outcomes>
replace with
</outcomes><trial_file_printed_sources></trial_file_printed_sources><trial_file_other_documents> </trial_file_other_documents><trial_file_notes></trial_file_notes>

put extraneous text into trial_file_notes field
GREP search for
(</trial_file_notes>)\t(.*?)(</trial_file>)
replace with
\2\1\3

find tab delimited snippets of text and put them into the trial_file_notes field
GREP search for
(<trial_file>.*?)\t(.*?)(<.*?)(</trial_file_notes>)(.*?)$
replace with
\1\3\2\4\5

manually search for instances of tab in trial_file lines and correct them
search for
<trial_file>.*?\t
fix manually

sample of trial_file:
<trial_file><criminal><surname>Thomas</surname><given_names>Scott</given_names><aliases></aliases><age></age><gender>1</gender></criminal><trials><trial><judge></judge><jury><jury_type>M</jury_type><jury_subtype>1</jury_subtype></jury><crime><crime_text>Robbery</crime_text><crime_normalized></crime_normalized><crime_group></crime_group></crime><mercy_appeals</mercy_appeals><trial_ref>TRIAL -- OBSP 1799-1800, pp.8-9 [0.8]</trial_ref></trial></trials><respites><respite><respite_text>Pleasure (RR, 16 July)</respite_text><respite_normalized></respite_normalized><respite_delay></respite_delay><respite_punishment></respite_punishment></respite></respites><outcomes><outcome><outcome_text>Free (18 July 1800) </outcome_text><outcome_normalized></outcome_normalized><outcome_group></outcome_group><outcome_duration></outcome_duration><outcome_date></outcome_date><outcome_location></outcome_location><outcome_exceptional></outcome_exceptional><outcome_ref>HO 13/13, pp.6-7</outcome_ref></outcome></outcomes><trial_file_printed_sources></trial_file_printed_sources><trial_file_other_documents> </trial_file_other_documents><trial_file_notes>blah blah</trial_file_notes></trial_file>

add elements to rec_rep_head
search for
<rec_rep_head>
replace with
<rec_rep_head><rec_rep_start_date></rec_rep_start_date><rec_rep_end_date></rec_rep_end_date><rec_rep_pub_date></rec_rep_pub_date><rec_rep_notes></rec_rep_notes>

populate start and end date fields
GREP search for
(</rec_rep_start_date><rec_rep_end_date>)(.*?</rec_rep_notes>)\t(.*?)\t
replace with
\3\1\3\2

populate rec_rep_pub_date element
GREP search for
(</rec_rep_pub_date><rec_rep_notes>)(</rec_rep_notes>).*?(\d+? .*? \d{1,4})(.*?)(</rec_rep_head>)
replace with
\3\1\4\2\5

normalize rep_rec_start_date
GREP search for
(<rec_rep_start_date>)(\d+).*?([A-Z]{3}).*?(\d{4})
replace with
\1\4\3\2
add zero to day of month as needed (have to use ZERO placeholder because GREP engine treats \10 as the tenth selected element, rather than the first followed by a literal zero)
GREP search for
([A-Za-z])(\d</rec_rep_start_date>)
replace with
\1!ZERO!\2
search for
!ZERO!
replace with
0

normalize rep_rec_end_date
GREP search for
(<rec_rep_end_date>).*?(\d{2}) ([A-Z]{3}[A-Z]*?) (\d{4})
replace with
\1\4\3\2

get the "20-2" instances
search for
(<rec_rep_end_date>).*?(\d)\d-(\d) ([A-Z]{3})[A-Z]*? (\d{4})(</rec_rep_end_date>)
replace with
\1\5\4\2\3\6

get the "1-3" instances
search for
(<rec_rep_end_date>).*? (\d)-(\d) ([A-Z]{3})[A-Z]*? (\d{4})(</rec_rep_end_date>)
replace with
\1\5\4!ZERO!\3\6
search for
!ZERO!
replace with
0

get the remaining instances (typically those containing commas)
GREP search for
(<rec_rep_end_date>.*?\d), (\d)replace with
replace manually with YYYYMMMdd (e.g. 1562JAN07)


normalize rec_rep_pub_date
GREP search for
(<rec_rep_pub_date>)(\d+).*?([A-Z]{3}).*?(\d{4})
replace with
\1\4\3\2
add zero to day of month as needed
GREP search for
([A-Za-z])(\d</rec_rep_pub_date>)
replace with
\1!ZERO!\2
search for
!ZERO!
replace with
0

=============================================
8) transform outcome_text, crime_text, and respite_text
=============================================

Use 7_to_8.xsl to transform 7_all_elements.xml into 8_ready_to_proof.xml.

Depending on the quirks of the particular data set, you may need to do any of the following:
    
    - Adjust the regular expressions used in transform_functions.xsl and 7_to_8.xsl
    
    - Add or remove irregular outcome_text and crime_text values in irregulars.xml. This file
      provides a way to transform text that falls outside of the regular expression patterns.
      
    - Adjust value_lists.xml to reflect the most up-to-date list of 'regular' crimes, outcomes,
      and respites. 

=============================================
9) Apply XML structural changes
=============================================

Use 8_to_9.xsl to transform the result of #8 into 9_schema_changes.xml.
This brings in structural changes that were introduced in March 2011. See blog:

http://hcmc.uvic.ca/blogs/index.php?blog=36&p=7824&more=1&c=1&tb=1&pb=1

=============================================
10) Apply further structural changes
=============================================
Use 9_to_10.xsl to transform the result of #9 into 10.xml. This incorporates further
structural changes, namely removing the rec_rep_head element and moving the data into
individual trial files. See blog:

http://hcmc.uvic.ca/blogs/index.php?blog=36&p=7853&more=1&c=1&tb=1&pb=1

=============================================
11) Add start_year element
=============================================
Add a <start_year> element as a direct child of <data> to the top of the result of #10. The value of this
element should be the starting year of the data set. For example, for the 1800s data set:
<data>
    <start_year>1800</start_year>
    (trial info and rest of file)
    ...
</data>

04/03/11

Permalink 01:44:06 pm, by jamie, 38 words, 252 views   English (CA)
Categories: Notes; Mins. worked: 0

JQuery plugin for multiple select form fields

Regular HTML multiple select fields are ugly and not very user-friendly, so for the Bailey search form I'm using a JQuery plugin called bsmSelect to turn the select fields into something more usable: https://github.com/vicb/bsmSelect
Permalink 11:01:23 am, by jamie, 129 words, 72 views   English (CA)
Categories: Notes; Mins. worked: 0

How to import a valid data set into MySQL

I've now got two datasets in the MySQL database, 1790-1799 and 1800-1809. Here's the relatively simple procedure for importing an XML dataset into the MySQL database:

  1. Ensure that your XML file validates against bailey_trialfile_proofing.rng
  2. Transform the XML file with Bailey_text_to_fk.xsl - this generates a new XML file with text replaced by foreign key values
  3. Transform the result of the previous step with Bailey_add_id.xsl - this generates a file with a bunch of SQL INSERT statement
  4. Import the SQL file into the database
  5. Run fix_ranks.php on the web server to adjust the rank values for respites, outcomes, and trials

#1 is the most important step: if the file validates, then all of the other pieces will fall into place.

03/03/11

Permalink 03:19:04 pm, by jamie, 25 words, 85 views   English (CA)
Categories: Activity Log; Mins. worked: 30

Recorder report taken out of search

Since trial start/end dates and publication dates are no longer associated directly with recorder reports, the recorder report no longer has any searchable fields.
Permalink 02:09:40 pm, by jamie, 39 words, 64 views   English (CA)
Categories: Activity Log; Mins. worked: 30

Data sent to SD for final proofing

Since there have been so many structural changes and changes to allowed values recently, I've sent the 1790s and 1800s data to SD for what will likely be the final proofing before the data gets imported to the website.
Permalink 12:16:00 pm, by jamie, 319 words, 301 views   English (CA)
Categories: Activity Log; Mins. worked: 60

Meeting with SD: allowed values and more date shuffling

SD and I met to discuss two issues: some data oddities that I wasn't sure how to handle, and the re-shuffling of the recorder report dates.

#1: Data Oddities

We've made the following changes to the value lists (respites, crimes, outcomes):

  • Added "Stealing" as a crime in the "Miscellaneous" group
  • Removed "PledBelly" from the list of allowable respites (it's actually a judge's respite)
  • Decided that the list of judge's respites (the judge_respite family of elements) would be "Respited-Judge", "Belly-Q", and "Belly-NQ"
  • Added "NoRR" as a respite_normalized
  • Added "SelfTL" and "SelfTR" as outcomes in the "Transport" group
  • Added "PrisonRemission" and "Whipped" as outcomes in the "EarlyRelease" group

#2: Recorder's Report Dates

We also discussed a better way to handle rec_rep_start_date, rec_rep_end_date, and rec_rep_pub_date. These currently cover multiple trial_files within the same case, but, given that the dates may change for trials within one case, we've decided to move the dates to within the trial_file element. This will result in more redundant data but will also give SD the flexibility to enter date oddities. Specifically, these changes will be made:

  • Removing the rec_rep_head element
  • Renaming rec_rep_start_date and rec_rep_end_date to trial_file_start_date and trial_file_end_date, and moving them to within trial_file
  • Renaming rec_rep_pub_date to respite_rr_date and moving it to within the first respite element. This is being done because, as SD explained, sometimes a respite resulting from a trial may not be published in the 'original' recorder's report, but be delayed and be published in a later report, even though the trial dates may match those of the original recorder's report. So, blanketing each trial_file within the recorder's report with the same publication date doesn't allow for that flexibility.
  • Removal of rec_rep_notes, which, according to SD, is no longer needed.

01/03/11

Permalink 11:28:48 am, by jamie, 39 words, 126 views   English (CA)
Categories: Notes; Mins. worked: 0

Outcome executions

After creating an outcome_executions DB table to link outcomes with multiple execution modes, I discovered that SA had already set up a slyly-named table called "execution_specials" which accomplishes the same task and already has 1790s data. Oops.
Permalink 09:37:08 am, by jamie, 73 words, 67 views   English (CA)
Categories: Activity Log; Mins. worked: 30

Changes to 1790s XML file; SD to re-proof

SD wants to double check the 1790s XML data. So, I made a new version of the 1790s XML file, 1790-1799_new_structure.xml, to take into account the changes SD and I made to the XML structure. I used the 8_to_9_SD_changes.xsl stylesheet for the transformation (the same one I used for the 1800s data). The data may have to be re-imported into the MySQL database; not too sure yet.

28/02/11

Permalink 02:17:51 pm, by jamie, 17 words, 71 views   English (CA)
Categories: Notes; Mins. worked: 0

1800s data proofed

SD just sent me his proofed version of the 1800s data XML file. We're making good progress.
Permalink 12:18:10 pm, by jamie, 250 words, 114 views   English (CA)
Categories: Activity Log; Mins. worked: 45

Changes to SQL structure for respites, outcomes, trials

SD's latest request for the search interface is that, when a trial file has more than one associated trial, respite, and/or outcome, only the first record in each of those subtables should be searchable. For example, if a trial file has two outcomes - 'Transported' and 'Military Service' - only the 'Transported' outcome should be searchable. I had planned to accomplish this in the search query by imposing a LIMIT, for example:


SELECT .... FROM trial_files WHERE outcomes.outcome_id IN (SELECT outcome_id FROM outcomes WHERE outcomes.trial_file_id_fk = trial_files.trial_file_id LIMIT 1)

However, it turns out that, as of version 5.5, MySQL doesn't yet support LIMIT in subqueries:


#1235 - This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'

So that approach won't work. It's for the best, though, since that relies on the order of the records in the database. What I will do now is add a "rank" column to the respites, outcomes, and trials tables. Only records with a rank of 1 will be searchable. In the detail view for a record, respites, outcomes and crimes will be displayed in order of rank. The SQL:


ALTER TABLE  `trials` ADD  `rank` INT NULL DEFAULT NULL;
ALTER TABLE  `outcomes` ADD  `rank` INT NULL DEFAULT NULL;
ALTER TABLE  `respites` ADD  `rank` INT NULL DEFAULT NULL;

I've manually added ranks to the 1790s data in the database; future rank will be taken from the order of the elements in the XML files.

25/02/11

Permalink 11:42:24 am, by jamie, 148 words, 82 views   English (CA)
Categories: Activity Log; Mins. worked: 75

Search saving

Wrote a simple function to save the current search parameters so that the same results can be seen at a later date. This is the flow:
  1. User clicks on "Save Search" link at bottom of results
  2. Javascript function intercepts the link and sends an AJAX request to includes/helpers/save_search.php
  3. The save_search script loads the Search class
  4. The Search class in turn grabs the search params from the session, and saves a new record to the saved_searches table, and then returns the ID of the saved search record
  5. The save_search script returns a URL that can be used to access the saved search, which is just the search script (search.php) with a 'saved_search' GET parameter equal to the ID of the newly saved search
  6. The Javascript function informs the user of the successful save and provides them with the URL
  7. Success!

24/02/11

Permalink 04:25:52 pm, by jamie, 153 words, 62 views   English (CA)
Categories: Activity Log; Mins. worked: 120

Visualization/chart plotting progress

I've linked up the search form with the chart plotting tool, so now it's possible to:

  • Execute a search and immediately see the results in either a paginated list or the Flot-based chart tool
  • Pass the search results between both views (paginated list/chart)
  • Refine the results within either view

Contrary to my last blog post, the search data is not passed between the views via AJAX. Although that would be 'sexier', it doesn't lend itself well to storing results for the future. Speaking of which, my next task is to devise some sort of clever method for bookmarking search results, which'll be tricky since search data is stored in the session after the initial search (to avoid POST problems). I think some sort of DB-stored session ID may be the answer.

Note that development thus far has been done on my local machine but will be pushed to the Bailey website soon.

23/02/11

Permalink 01:43:52 pm, by jamie, 198 words, 85 views   English (CA)
Categories: Activity Log; Mins. worked: 60

Flot visualization tool chosen

After having good success experimenting with Flot, an open Javascript chart-plotting script based on JQuery, I've decided that it's officially "the one" for the Bailey project. The Flot homepage is here: http://code.google.com/p/flot/

I chose Flot for three main reasons over Simile's Timeplot and Google's Visualization API:
  • It does what this project needs more or less out of the box
  • JSON data formatting is a snap
  • It's by far the easiest to use of the three tools, especially for anyone with JQuery experience (but that's by no means necessary)

I've already been working to integrate Flot for the past week or so, and am progressing at a good rate. I've written the bare-bones search form and the bare-bones visualization page; the next step is to allow the results for one to be exported to the other. I don't foresee any problems with the integration and am looking forward to seeing the tools working together soon. I'd like to design a single page that incorporates both views (search form + visualization) and uses AJAX to pass the data around, perhaps using tabs to separate the two. That way I can avoid having to constantly POST data.

18/02/11

Permalink 11:32:46 am, by jamie, 37 words, 87 views   English (CA)
Categories: Activity Log; Mins. worked: 75

New XML file sent to SD for proofing

I've sent the 1800s data XML file to SD for proofing again, after writing an XSLT stylesheet to incorporate the structural changes that came out of our last meeting on Feb. 17/11 (documented in an earlier blog post).
Permalink 08:58:32 am, by jamie, 341 words, 268 views   English (CA)
Categories: Activity Log; Mins. worked: 60

Meeting with SD about XML structure and changes to data gathering

Met with SD yesterday to discuss his questions and concerns about the structure of the XML file. He's been proofing the data for the last couple of weeks. As a result of his reading, he's realized that the current XML structure is lacking in a few areas. So, we're going to make the following changes:

  • recorder_report (rec_rep in the XML) currently "has many" cases (case in the XML). rec_rep also "has one" rec_rep_head, which has rec_rep_start_date, rec_rep_end_date, rec_rep_pub_date, and rec_rep_notes elements. According to SD, in some cases the recorder report dates can change for each case within the same recorder report. So rather than rec_rep_head being a direct child of rec_rep, it will be moved to being a child of each case WITHIN the rec_rep. So the hierarchy will be: rec_rep -> HAS MANY case -> HAS ONE rec_rep_head
  • respite elements will have a child respite_ref field so that SD can add in any possible references for a respite.
  • A new element, "judge_respite", is being added as a trial_file child. Unlike the trial file -> HAS MANY respites relationship, a trial file only HAS ONE judge_respite.
  • execution_mode, which is a child of outcome, is going to be expanded to allow for multiple elements. So the relationship will become: trial file -> HAS MANY outcomes -> HAS MANY execution_modes. This is an optional field.
I also approached SD about using an Excel spreadsheet to gather his raw data, rather than a word processor (which is his current method). Since the current method to convert his raw data to XML is quite cumbersome and error-prone (not by any fault of SA, who did an admirable job with GREP), moving SD to a spreadsheet will make the conversion process faster and more streamlined. I think it'll also make his own work simpler since he'll be able to find and manipulate his data much more easily.

17/02/11

Permalink 03:33:18 pm, by jamie, 56 words, 51 views   English (CA)
Categories: Activity Log; Mins. worked: 120

Another Javascript visualization tool enters the ring

To add to my list of SIMILE and Google Visualization, I'm also investigating Flot, a plotting library built on the JQuery framework. It looks to do almost exactly what I need it to do, and its learning curve is not nearly as steep as the other tools. After some fiddling and exploring Flot is the front-runner.

16/02/11

Permalink 04:28:41 pm, by jamie, 88 words, 56 views   English (CA)
Categories: Activity Log; Mins. worked: 180

Investigating Javascript visualization tools

I spent a large chunk of today exploring possible Javascript-based visualization tools for the histogram feature, which is intended to show trends over time based on any number of criteria. There seem to be two clear frontrunners:

Both are good options but quite complex and, as always seems to be the case, fall just short of the functionality needed. Writing some custom code might be unavoidable.

15/02/11

Permalink 04:28:21 pm, by jamie, 169 words, 52 views   English (CA)
Categories: Activity Log; Mins. worked: 90

Web-based search progressing

Made some good progress on the search interface today. Instead of having one Search class to handle both searching and returning one record, I wrote a separate Search_Single class that handles fetching a trial file based on a passed ID. Since a trial file can have multiple trials, respites, and outcomes (but only one criminal), the Search_Single class grabs those associated records via separate queries. Meanwhile, the main Search class simply does a LEFT JOIN on the trials, outcomes, and respites table to find at least one record that matches the search criteria.

I chose the trial_file as the 'main' table for the search because it's the parent for trials, respites, outcomes, and criminals. Since there's only one criminal per trial file but potentially multiple trials, respites, and outcomes, the trial file makes the most sense because we don't want the same criminal to show up multiple times in the search results, which would happen if we were to search with the trials table as primary.

07/02/11

Permalink 10:40:31 am, by jamie, 178 words, 69 views   English (CA)
Categories: Activity Log; Mins. worked: 90

XML file sent to SD for proofing

I've sent the 1800 decade XML data file to SD for an initial proofing. Stewart did the bulk of the raw-to-XML transformation using a series of GREP statements. For this final transformation, I used XSLT to convert raw text to 'normalized' values that match against the RNG validation file. Here's how the validation currently breaks down:

  • The XML file has 41535 lines
  • 116 of these lines (0.003%) do NOT validate
  • Of these errors, nearly all of them (100 / 116) are either empty jury_type elements or empty crime_text elements
  • The rest of the errors are mainly due to crimes or outcomes that didn't fit into any category (at least, to my untrained eye)

The remaining errors will need a 'trained' eye to be fixed on a case-by-case basis. Once SD is happy with the data then he'll return it and I'll perform a further transformation that will import the data into the MySQL database (for which the XSLT spreadsheets have, luckily, already been written).

I've also been spending some time this morning ensuring that the XSL files used in the transformation are well-documented.

04/02/11

Permalink 02:05:13 pm, by jamie, 27 words, 51 views   English (CA)
Categories: Activity Log; Mins. worked: 120

XML validation

The XML data file for the 1800s decade of data is down to 0.003% invalid lines, almost all of which are empty jury_type or crime_text elements.
Permalink 11:41:29 am, by jamie, 171 words, 77 views   English (CA)
Categories: Notes; Mins. worked: 0

XPath gotcha with empty()

While writing the XSLT stylesheet for the Bailey project XML data, I stumbled across a misleading 'gotcha' when using the XPath function empty(). Basically, empty() will ALWAYS return false as long as the element being tested exists. So, if your XML chunk looks like this:


<alias />

And your XPath looks like this (in this case, within XSLT):


<xsl:when test="empty(./alias)">

That test will always return false since <alias> exists. To test whether an element is empty, use the string() function. For example, if we want to determine whether <alias> is empty (which it is):


<xsl:when test="not(string(./alias))">

string() will return the string value of the element, which, in this case, is empty (but not empty() - hah).

P.S.: Martin reminded me that empty() is actually for sequences, which explains the weirdness when used with an element. But that, of course, leads one to wonder why empty() doesn't throw an error when used with a non-sequence...

Permalink 10:12:25 am, by jamie, 112 words, 61 views   English (CA)
Categories: Activity Log; Mins. worked: 20

Refining outcomes and outcome groups

In preparation for the completion of the XSLT stylesheet that will transform the XML data into a valid data set, SD has made some revisions to the list of outcomes and outcome groups. These revisions are mostly just involve refining current outcomes and making them more explicit (e.g. expanding "Executed" to "Hanged", "HangedOnSite", "Gibbeted", "Dissected", "Burned", and "Drawn").

I've got some work ahead of me to integrate the new outcomes - mostly just accounting for them when encountering irregular data in the XML - but for the most part it should be fairly straightforward. The new outcomes are in the XSLT, so now it's just a matter of adjusting the stylesheet.

27/01/11

Permalink 11:57:01 am, by jamie, 120 words, 59 views   English (CA)
Categories: Activity Log; Mins. worked: 30

Meeting with SD

Met with SD today so that I could get more familiar with the project. SD went over how he'd like the data displayed, and we talked about the current state of the project in general. He is pleased with the progress and is excited to see what lies ahead. He'll double-check the list of crimes, respites and outcomes before I send him the XML file for proofing. After we hammer down the conversion and validation process for the current data set (1800s), then the rest of the data (i.e. the rest of the 19th century) will fall into place relatively quickly.

So, the project is progressing well and both SD and myself are happy with the results thus far.

Permalink 10:42:25 am, by jamie, 87 words, 66 views   English (CA)
Categories: Activity Log; Mins. worked: 75

XML file almost validating

After much massaging with XSLT, the 1800s data XML file, which began as a raw RTF document and was transformed to an almost-well-formed XML file by SA, is almost error-free and ready to go to the proofer. In approximately 41,200 lines of data, there are 116 validation errors, which will need to be corrected manually by SD or his proofer. I tried to cover all of the irregularities when writing the XSLT stylesheet, but these remaining errors need to be fixed by someone with the raw data in hand.

26/01/11

Permalink 02:20:00 pm, by jamie, 197 words, 61 views   English (CA)
Categories: Activity Log; Mins. worked: 120

New search form completed

I've completed my bare-bones version of the web-based record search tool. All of the logic for the search - both displaying the available options and parsing out a form submission - is contained within two classes, Search and Search_Form. The classes build the search query in an organized manner based on passed parameters. This format makes the search easy to change or extend, which I can foresee happening based on the needs of the data and the researcher SD. All of the code is documented using comments and standard docblock syntax.

The database-side of things is handled with Zend_Db, the Zend Framework's RDBMS adapter. Since the search form can result in some very complex queries, Zend_Db makes the process of building the select statement relatively easy. Plus, it's a more robust database adapter in terms of security and integrity than I could build in any decent amount of time.

The search form doesn't yet have any styling and is bare, but now that the functionality is done everything else will fall into place eventually. SD is coming in for a meeting Thursday Jan 27th so I hope to hammer out some more details then.

24/01/11

Permalink 04:27:06 pm, by jamie, 78 words, 94 views   English (CA)
Categories: Activity Log; Mins. worked: 60

Web-based search rebuild

Started the rebuild of the web-based 'query builder' search tool. Stew's SQL structure was for the most part complete, except for a missing jury_id_fk field in the trials table to point to the juries table. Other than that, the SQL structure is as Stew left it, though that may change at some point during the rebuild. For the search itself, I'm starting from scratch and building some PHP classes to handle building the complex search queries.

20/01/11

Permalink 09:08:29 am, by jamie, 256 words, 67 views   English (CA)
Categories: Activity Log; Mins. worked: 420

Finishing the raw-to-XML conversion procedure

As my first project at HCMC, I've been tasked with finishing Stewart's procedure for converting the raw data (as supplied by SD) to well-formed XML. Stewart did an admirable job changing a tab-delimited document into an almost-finished XML file using GREP and an exhaustive series of regular expressions. Since, however, the final steps involve populating elements based on the value of other elements (checked against a list of allowable values), I've decided - after talking with Martin and Greg - to use XSLT to finish off the procedure. I don't know XSLT, but since it will be an important part of my work here for the next eight months, this is a good opportunity to learn the language.

So the first step is to take the value of crime_text, figure out to which value it corresponds in the list of allowable crimes in the proofing RNG, and then use that value to populate crime_normalized. Then, crime_group will be populated based on the value of crime_normalized. respite_text and outcome_text will be massaged in a similar way, though respite doesn't have a respite_group (only _normalized).

Spent a good chunk of time learning my way around what's been done for the Bailey project, the XSLT language, and the specific data and code needs of this project. The consensus: writing a catch-all function in the XSL stylesheet to convert the *_text values to *_normalized values will be tough, so there will likely be the odd value that will need to be edited manually.

Capital Trials at the Old Bailey

Simon Devereaux has approximately 10,000 records of people convicted in potentially capital cases between 1710 and 1840 in London heard at the Old Bailey court. This project will create a web-based database which will allow interested researchers and members of the public to compose queries on that data (e.g. women charged with robbery 1710-1720). It must be able to support a range of queries and produce output allowing researchers to identify trends in judicial practice over that time.

Reports

Categories

2011
 << Current>>
Jan Feb Mar Apr
May Jun Jul Aug
Sep Oct Nov Dec

XML Feeds