Archives for: March 2011

21/03/11

Permalink 03:03:37 pm, by jamie, 3403 words, 135 views   English (CA)
Categories: Activity Log; Mins. worked: 60

Raw to XML procedure

Finished writing the procedure, started by SA, for how convert a raw Word data file from SD into a valid XML file. Note that, since SD will be using an Excel spreadsheet for all data beginning from the year 1810, this procedure is only valid for pre-1810 data (of which there will be plenty). I'll have to write a second procedure for converting SD's Excel files to XML once he's finished the first set of data.

The procedure is saved in a text file which will be passed back to SA upon his return. Here's the procedure verbatim:


=============================================
1) Clean text data
=============================================
save raw text as 1_cleanText.txt

replace all multiple tab with single tab
GREP search for
\t+
replace with
\t

normalize indicator of multiple trial_files in one case
GREP search for
^\s*?&\s*?&.*$
replace with
ADDITIONAL_TRIAL_FILE_IN_CASE

turn remaining ampersand characters into entities
search for
&
replace with
&

put four tabs on each trial line
put space in front of all lines with four tabs
GREP search for
^(.*?\t.*?\t.*?\t.*?\t.*?)$
replace with
 \1 [space character preceding the slash 1]

add tab to all lines with only three tabs
GREP search for
^(\w.*?\t.*?\t.*?\t.*?)$
replace with
 \1\t [space character preceding the slash 1]

add two tabs to all lines with only two tabs
GREP search for
^(\w.*?\t.*?\t.*?)$
replace with
 \1\t\t [space character preceding the slash 1]
 
manually add three tabs to all lines with only one tab
GREP manually search for
^(\w.*?\t.*?)$
replace with
 \1\t\t\t [space character preceding the slash 1]

put tab before instances of TRIAL that don't have one
GREP search for
^TRIAL
replace with
\tTRIAL

put TRIAL onto previous line
GREP search for
(.*?\t.*?\t.*?\t.*?)\r(\tTRIAL.*?)
replace with
\1\2

GREP search for:
Name\s*?Crime\s*?Respite\s*?Pardoned\s*?Executed\s*?\r
replace with:
[nothing]

GREP search for:
^\t*?(.*?Recorder's Report.*?)$
replace with
RECORDER_REPORT_HEAD \2

normalize smart quotes
search for
’
replace with
'
search for
‘
replace with
'
search for
“
replace with
"
search for
”
replace with
"

eliminate empty lines
GREP search for
\r\s*?\r
replace with
\r
repeat until no instances

put extraneous lines as comments on end of previous line
GREP search for
(.*?\t.*?\t.*?\t.*?\t.*?)\r\t(.*?)
replace with
\1\t\t\2
repeat until no hits

manually edit any lines still remaining
GREP search for
^\t
ensure there are 6 tabs in preceding line and copy extra line after sixth tab

ensure there are no more than 6 tabs in any line
GREP search for
^(.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?)\t
replace with
\1
repeat until no hits

removing leading spaces at start of line
GREP search for
^ [space character]
replace with
[nothing]

remove trailing spaces after tab
GREP search for 
\t +[space character before +]
replace with
\t

remove leading spaces before tab
GREP search for 
 +\t [space character before +]
replace with
\t

Manually check in spreadsheet for any lines that should obviously be included as comments in the preceding line

File should now have no empty lines
Each line should be one of
- start with RECORDER_REPORT_HEAD
- consist of ADDITIONAL_TRIAL_FILE_IN_CASE
- contain a trial_file record with 7 tab-delimited fields:
CriminalName TAB Crime TAB Respite TAB outcome	TAB ExecutionDate TAB Trial Ref TAB extra

sample extract
RECORDER_REPORT_HEAD 	4-7, 9-10 DEC 1799	– Recorder's Report -> F, 31 Jan 1800
Thomas Scott (M1)	Robbery	Pleasure (RR, 16 July)	Free (18 July 1800) [HO 13/13, pp.6-7]		TRIAL -- OBSP 1799-1800, pp.8-9 [0.8]
Bartholomew Foley (M1)	St in Dwelling	Pleasure (RR)	TL (15 Feb 1800) [HO 13/12, pp.394-5]		TRIAL -- OBSP 1799-1800, pp.9-10 [0.4]
Peter Chapman als Harry Read	Burglary	To Die (RR)	-----	W, 26 Feb 1800		als Harry Kirk (L) (19)TRIAL -- OBSP 1799-1800, pp.88-92 [4.6]
ADDITIONAL_TRIAL_FILE_IN_CASE
John Hall (L) (33)	Burglary	To Die (RR)	-----	W, 26 Feb 1800	TRIAL -- OBSP 1799-1800, pp.88-92 [4.6]
ADDITIONAL_TRIAL_FILE_IN_CASE
Joseph Jones (L)	Burglary	To Die (RR)	-----	W, 26 Feb 1800	TRIAL -- OBSP 1799-1800, pp.88-92 [4.6 but PG]

=============================================
2) add rec_rep, rec_rep_head, case, trial_file elements
=============================================

save file as file 2_mainElements.txt

create rec_rep containers
search for
^RECORDER_REPORT_HEAD(.*?)$
replace with
</rec_rep>\r<rec_rep>\r<rec_rep_head>\1</rec_rep_head>

delete </rec_rep> line at top of document
add <regular_cases> line at top of document
add </rec_rep> line at end of document
add </regular_cases> line at bottom of document

create trial_files
GREP search for
<trial_file>(.*?)</trial_file>
replace with
<case>\r\1\r</case>

remove case elements for multiple trial_files in one case
GREP search for
</case>\r\tADDITIONAL_TRIAL_FILE_IN_CASE\r<case>\r
replace with
[nothing]

Each line of file should now be one of:
<rec_rep>
<rec_rep_head> . . . </rec_rep_head>
<case>
<trial_file> . . . </trial_file>
</case>
</rec_rep>

sample extracts:
<rec_rep>
<rec_rep_head>	4-7, 9-10 DEC 1799	– Recorder's Report -> F, 31 Jan 1800</rec_rep_head>
<case>
<trial_file>Thomas Scott (M1)	Robbery	Pleasure (RR, 16 July)	Free (18 July 1800) [HO 13/13, pp.6-7]		TRIAL -- OBSP 1799-1800, pp.8-9 [0.8]</trial_file>
</case>
<case>
<trial_file>Bartholomew Foley (M1)	St in Dwelling	Pleasure (RR)	TL (15 Feb 1800) [HO 13/12, pp.394-5]		TRIAL -- OBSP 1799-1800, pp.9-10 [0.4]</trial_file>
</case>
<case>
<trial_file>Peter Chapman als Harry Read	Burglary	To Die (RR)	-----	W, 26 Feb 1800		als Harry Kirk (L) (19)TRIAL -- OBSP 1799-1800, pp.88-92 [4.6]</trial_file>
<trial_file>John Hall (L) (33)	Burglary	To Die (RR)	-----	W, 26 Feb 1800	TRIAL -- OBSP 1799-1800, pp.88-92 [4.6]</trial_file>
<trial_file>Joseph Jones (L)	Burglary	To Die (RR)	-----	W, 26 Feb 1800	TRIAL -- OBSP 1799-1800, pp.88-92 [4.6 but PG]</trial_file>
</case>
</rec_rep>

=============================================
3) create criminal elements
=============================================
save file as 3_criminal

create criminal element
GREP search for
<trial_file>(.*?)\t
replace with
<trial_file><criminal>\1</criminal>

move jury element outside of criminal element
GREP search for
(<criminal>.*?)(\([A-Z]\d*?\))(.*?</criminal>)
replace with
\1\3\2

create and populate surname and given_name elements
GREP search for
(<criminal>)(.*?) (.*?) [trailing space character]
replace with
\1<surname>\2</surname><given_names>\3</given_names>

create aliases, age and gender elements
search for
</given_names>
replace with
</given_names><aliases></aliases><age></age><gender>Male</gender>

populate age element (gender are all assigned 1 above)
GREP search for
(</age><gender>Male</gender>).*?\((\d+?)\).*?(</criminal>)
replace with
\2\1\3

sample:
<trial_file><criminal><surname>Thomas</surname><given_names>Scott</given_names><aliases></aliases><age></age><gender>1</gender></criminal>(M1)Robbery	Pleasure (RR, 16 July)	Free (18 July 1800) [HO 13/13, pp.6-7]		TRIAL -- OBSP 1799-1800, pp.8-9 [0.8]</trial_file>

=============================================
4) create trials elements
=============================================
save file as 4_trials

add trials, trial, judge, jury elements to trials element
search for
</criminal>
replace with
</criminal><trials><trial><judge></judge><jury><jury_type></jury_type><jury_subtype></jury_subtype></jury></trial></trials>

populate jury element
GREP search for
(</jury_type><jury_subtype>)(</jury_subtype></jury></trial></trials>)\(([A-Z])(\d*?)\)
replace with
\3\1\4\2

add crime element inside trial element
search for
</trial>
replace with
<crime><crime_text></crime_text><crime_normalized></crime_normalized><crime_group></crime_group></crime></trial>

populate crime element
GREP search for
(</crime_text>.*?</trial>)(.*?)\t
replace with
\2\1

add empty mercy_appeals element
search for
<trial_ref>
replace with
<mercy_appeals></mercy_appeals><trial_ref>

sample
<trial_file><criminal><surname>Thomas</surname><given_names>Scott</given_names><aliases></aliases><age></age><gender>1</gender></criminal><trials><trial><judge></judge><jury><jury_type>M</jury_type><jury_subtype>1</jury_subtype></jury><crime><crime_text>Robbery</crime_text><crime_normalized></crime_normalized><crime_group></crime_group></crime><mercy_appeals</mercy_appeals><trial_ref>TRIAL -- OBSP 1799-1800, pp.8-9 [0.8]</trial_ref></trial></trials>Pleasure (RR, 16 July)	Free (18 July 1800) [HO 13/13, pp.6-7]		</trial_file>

=============================================
5) create respites elements
=============================================
save file as 5_respites

add respites elements
search for
</trials>
replace with
</trials><respites><respite><respite_text></respite_text><respite_normalized></respite_normalized><respite_delay></respite_delay><respite_punishment></respite_punishment></respite></respites>

populate respite_text element
GREP search for
(</respite_text>.*?</respites>)(.*?)\t
replace with
\2\1

sample
<trial_file><criminal><surname>Thomas</surname><given_names>Scott</given_names><aliases></aliases><age></age><gender>1</gender></criminal><trials><trial><judge></judge><jury><jury_type>M</jury_type><jury_subtype>1</jury_subtype></jury><crime><crime_text>Robbery</crime_text><crime_normalized></crime_normalized><crime_group></crime_group></crime><mercy_appeals</mercy_appeals><trial_ref>TRIAL -- OBSP 1799-1800, pp.8-9 [0.8]</trial_ref></trial></trials><respites><respite><respite_text>Pleasure (RR, 16 July)</respite_text><respite_normalized></respite_normalized><respite_delay></respite_delay><respite_punishment></respite_punishment></respite></respites>Free (18 July 1800) [HO 13/13, pp.6-7]		</trial_file>

=============================================
6) create outcomes elements
=============================================
save file as 6_outcomes

add outcome, outcome_text, outcome_ref elements
search for
</respites>
replace with
</respites><outcomes><outcome><outcome_text></outcome_text><outcome_ref></outcome_ref></outcome></outcomes>

populate outcome_text element
GREP search for
(</outcome_text><outcome_ref></outcome_ref></outcome></outcomes>)(.*?)\t
replace with
\2\1

populate outcome_ref element
GREP search for
(<outcome_text>.*?)\[(.*?)\](</outcome_text><outcome_ref>)
replace with
\1\3\2

populate outcome_ref for those records not caught by regexp above
(.*?)\[(HO.*?)\](.*?)(</outcome_ref>)
replace with
\1\3\2\4

add rest of outcome elements
search for
<outcome_ref>
replace with
<outcome_normalized></outcome_normalized><outcome_group></outcome_group><outcome_duration></outcome_duration><outcome_date></outcome_date><outcome_location></outcome_location><outcome_exceptional></outcome_exceptional><outcome_ref>

populate outcome_date element
i.e. execution date, taken from 6th tab field of original clean text
not taken from the date provided in the outcome_text

GREP search for
(</outcome_date>.*?</outcomes>)(.*?)\t
replace with
\2\1

sample:
<trial_file><criminal><surname>Thomas</surname><given_names>Scott</given_names><aliases></aliases><age></age><gender>1</gender></criminal><trials><trial><judge></judge><jury><jury_type>M</jury_type><jury_subtype>1</jury_subtype></jury><crime><crime_text>Robbery</crime_text><crime_normalized></crime_normalized><crime_group></crime_group></crime><mercy_appeals</mercy_appeals><trial_ref>TRIAL -- OBSP 1799-1800, pp.8-9 [0.8]</trial_ref></trial></trials><respites><respite><respite_text>Pleasure (RR, 16 July)</respite_text><respite_normalized></respite_normalized><respite_delay></respite_delay><respite_punishment></respite_punishment></respite></respites><outcomes><outcome><outcome_text>Free (18 July 1800) </outcome_text><outcome_normalized></outcome_normalized><outcome_group></outcome_group><outcome_duration></outcome_duration><outcome_date></outcome_date><outcome_location></outcome_location><outcome_exceptional></outcome_exceptional><outcome_ref>HO 13/13, pp.6-7</outcome_ref></outcome></outcomes></trial_file>

=============================================
7) create remaining elements
=============================================
save file as 7_allELements

add remaining elements in trial_file element:
search for
</outcomes>
replace with
</outcomes><trial_file_printed_sources></trial_file_printed_sources><trial_file_other_documents> </trial_file_other_documents><trial_file_notes></trial_file_notes>

put extraneous text into trial_file_notes field
GREP search for
(</trial_file_notes>)\t(.*?)(</trial_file>)
replace with
\2\1\3

find tab delimited snippets of text and put them into the trial_file_notes field
GREP search for
(<trial_file>.*?)\t(.*?)(<.*?)(</trial_file_notes>)(.*?)$
replace with
\1\3\2\4\5

manually search for instances of tab in trial_file lines and correct them
search for
<trial_file>.*?\t
fix manually

sample of trial_file:
<trial_file><criminal><surname>Thomas</surname><given_names>Scott</given_names><aliases></aliases><age></age><gender>1</gender></criminal><trials><trial><judge></judge><jury><jury_type>M</jury_type><jury_subtype>1</jury_subtype></jury><crime><crime_text>Robbery</crime_text><crime_normalized></crime_normalized><crime_group></crime_group></crime><mercy_appeals</mercy_appeals><trial_ref>TRIAL -- OBSP 1799-1800, pp.8-9 [0.8]</trial_ref></trial></trials><respites><respite><respite_text>Pleasure (RR, 16 July)</respite_text><respite_normalized></respite_normalized><respite_delay></respite_delay><respite_punishment></respite_punishment></respite></respites><outcomes><outcome><outcome_text>Free (18 July 1800) </outcome_text><outcome_normalized></outcome_normalized><outcome_group></outcome_group><outcome_duration></outcome_duration><outcome_date></outcome_date><outcome_location></outcome_location><outcome_exceptional></outcome_exceptional><outcome_ref>HO 13/13, pp.6-7</outcome_ref></outcome></outcomes><trial_file_printed_sources></trial_file_printed_sources><trial_file_other_documents> </trial_file_other_documents><trial_file_notes>blah blah</trial_file_notes></trial_file>

add elements to rec_rep_head
search for
<rec_rep_head>
replace with
<rec_rep_head><rec_rep_start_date></rec_rep_start_date><rec_rep_end_date></rec_rep_end_date><rec_rep_pub_date></rec_rep_pub_date><rec_rep_notes></rec_rep_notes>

populate start and end date fields
GREP search for
(</rec_rep_start_date><rec_rep_end_date>)(.*?</rec_rep_notes>)\t(.*?)\t
replace with
\3\1\3\2

populate rec_rep_pub_date element
GREP search for
(</rec_rep_pub_date><rec_rep_notes>)(</rec_rep_notes>).*?(\d+? .*? \d{1,4})(.*?)(</rec_rep_head>)
replace with
\3\1\4\2\5

normalize rep_rec_start_date
GREP search for
(<rec_rep_start_date>)(\d+).*?([A-Z]{3}).*?(\d{4})
replace with
\1\4\3\2
add zero to day of month as needed (have to use ZERO placeholder because GREP engine treats \10 as the tenth selected element, rather than the first followed by a literal zero)
GREP search for
([A-Za-z])(\d</rec_rep_start_date>)
replace with
\1!ZERO!\2
search for
!ZERO!
replace with
0

normalize rep_rec_end_date
GREP search for
(<rec_rep_end_date>).*?(\d{2}) ([A-Z]{3}[A-Z]*?) (\d{4})
replace with
\1\4\3\2

get the "20-2" instances
search for
(<rec_rep_end_date>).*?(\d)\d-(\d) ([A-Z]{3})[A-Z]*? (\d{4})(</rec_rep_end_date>)
replace with
\1\5\4\2\3\6

get the "1-3" instances
search for
(<rec_rep_end_date>).*? (\d)-(\d) ([A-Z]{3})[A-Z]*? (\d{4})(</rec_rep_end_date>)
replace with
\1\5\4!ZERO!\3\6
search for
!ZERO!
replace with
0

get the remaining instances (typically those containing commas)
GREP search for
(<rec_rep_end_date>.*?\d), (\d)replace with
replace manually with YYYYMMMdd (e.g. 1562JAN07)


normalize rec_rep_pub_date
GREP search for
(<rec_rep_pub_date>)(\d+).*?([A-Z]{3}).*?(\d{4})
replace with
\1\4\3\2
add zero to day of month as needed
GREP search for
([A-Za-z])(\d</rec_rep_pub_date>)
replace with
\1!ZERO!\2
search for
!ZERO!
replace with
0

=============================================
8) transform outcome_text, crime_text, and respite_text
=============================================

Use 7_to_8.xsl to transform 7_all_elements.xml into 8_ready_to_proof.xml.

Depending on the quirks of the particular data set, you may need to do any of the following:
    
    - Adjust the regular expressions used in transform_functions.xsl and 7_to_8.xsl
    
    - Add or remove irregular outcome_text and crime_text values in irregulars.xml. This file
      provides a way to transform text that falls outside of the regular expression patterns.
      
    - Adjust value_lists.xml to reflect the most up-to-date list of 'regular' crimes, outcomes,
      and respites. 

=============================================
9) Apply XML structural changes
=============================================

Use 8_to_9.xsl to transform the result of #8 into 9_schema_changes.xml.
This brings in structural changes that were introduced in March 2011. See blog:

http://hcmc.uvic.ca/blogs/index.php?blog=36&p=7824&more=1&c=1&tb=1&pb=1

=============================================
10) Apply further structural changes
=============================================
Use 9_to_10.xsl to transform the result of #9 into 10.xml. This incorporates further
structural changes, namely removing the rec_rep_head element and moving the data into
individual trial files. See blog:

http://hcmc.uvic.ca/blogs/index.php?blog=36&p=7853&more=1&c=1&tb=1&pb=1

=============================================
11) Add start_year element
=============================================
Add a <start_year> element as a direct child of <data> to the top of the result of #10. The value of this
element should be the starting year of the data set. For example, for the 1800s data set:
<data>
    <start_year>1800</start_year>
    (trial info and rest of file)
    ...
</data>

04/03/11

Permalink 01:44:06 pm, by jamie, 38 words, 233 views   English (CA)
Categories: Notes; Mins. worked: 0

JQuery plugin for multiple select form fields

Regular HTML multiple select fields are ugly and not very user-friendly, so for the Bailey search form I'm using a JQuery plugin called bsmSelect to turn the select fields into something more usable: https://github.com/vicb/bsmSelect
Permalink 11:01:23 am, by jamie, 129 words, 67 views   English (CA)
Categories: Notes; Mins. worked: 0

How to import a valid data set into MySQL

I've now got two datasets in the MySQL database, 1790-1799 and 1800-1809. Here's the relatively simple procedure for importing an XML dataset into the MySQL database:

  1. Ensure that your XML file validates against bailey_trialfile_proofing.rng
  2. Transform the XML file with Bailey_text_to_fk.xsl - this generates a new XML file with text replaced by foreign key values
  3. Transform the result of the previous step with Bailey_add_id.xsl - this generates a file with a bunch of SQL INSERT statement
  4. Import the SQL file into the database
  5. Run fix_ranks.php on the web server to adjust the rank values for respites, outcomes, and trials

#1 is the most important step: if the file validates, then all of the other pieces will fall into place.

03/03/11

Permalink 03:19:04 pm, by jamie, 25 words, 84 views   English (CA)
Categories: Activity Log; Mins. worked: 30

Recorder report taken out of search

Since trial start/end dates and publication dates are no longer associated directly with recorder reports, the recorder report no longer has any searchable fields.
Permalink 02:09:40 pm, by jamie, 39 words, 63 views   English (CA)
Categories: Activity Log; Mins. worked: 30

Data sent to SD for final proofing

Since there have been so many structural changes and changes to allowed values recently, I've sent the 1790s and 1800s data to SD for what will likely be the final proofing before the data gets imported to the website.
Permalink 12:16:00 pm, by jamie, 319 words, 279 views   English (CA)
Categories: Activity Log; Mins. worked: 60

Meeting with SD: allowed values and more date shuffling

SD and I met to discuss two issues: some data oddities that I wasn't sure how to handle, and the re-shuffling of the recorder report dates.

#1: Data Oddities

We've made the following changes to the value lists (respites, crimes, outcomes):

  • Added "Stealing" as a crime in the "Miscellaneous" group
  • Removed "PledBelly" from the list of allowable respites (it's actually a judge's respite)
  • Decided that the list of judge's respites (the judge_respite family of elements) would be "Respited-Judge", "Belly-Q", and "Belly-NQ"
  • Added "NoRR" as a respite_normalized
  • Added "SelfTL" and "SelfTR" as outcomes in the "Transport" group
  • Added "PrisonRemission" and "Whipped" as outcomes in the "EarlyRelease" group

#2: Recorder's Report Dates

We also discussed a better way to handle rec_rep_start_date, rec_rep_end_date, and rec_rep_pub_date. These currently cover multiple trial_files within the same case, but, given that the dates may change for trials within one case, we've decided to move the dates to within the trial_file element. This will result in more redundant data but will also give SD the flexibility to enter date oddities. Specifically, these changes will be made:

  • Removing the rec_rep_head element
  • Renaming rec_rep_start_date and rec_rep_end_date to trial_file_start_date and trial_file_end_date, and moving them to within trial_file
  • Renaming rec_rep_pub_date to respite_rr_date and moving it to within the first respite element. This is being done because, as SD explained, sometimes a respite resulting from a trial may not be published in the 'original' recorder's report, but be delayed and be published in a later report, even though the trial dates may match those of the original recorder's report. So, blanketing each trial_file within the recorder's report with the same publication date doesn't allow for that flexibility.
  • Removal of rec_rep_notes, which, according to SD, is no longer needed.

01/03/11

Permalink 11:28:48 am, by jamie, 39 words, 100 views   English (CA)
Categories: Notes; Mins. worked: 0

Outcome executions

After creating an outcome_executions DB table to link outcomes with multiple execution modes, I discovered that SA had already set up a slyly-named table called "execution_specials" which accomplishes the same task and already has 1790s data. Oops.
Permalink 09:37:08 am, by jamie, 73 words, 66 views   English (CA)
Categories: Activity Log; Mins. worked: 30

Changes to 1790s XML file; SD to re-proof

SD wants to double check the 1790s XML data. So, I made a new version of the 1790s XML file, 1790-1799_new_structure.xml, to take into account the changes SD and I made to the XML structure. I used the 8_to_9_SD_changes.xsl stylesheet for the transformation (the same one I used for the 1800s data). The data may have to be re-imported into the MySQL database; not too sure yet.

Capital Trials at the Old Bailey

Simon Devereaux has approximately 10,000 records of people convicted in potentially capital cases between 1710 and 1840 in London heard at the Old Bailey court. This project will create a web-based database which will allow interested researchers and members of the public to compose queries on that data (e.g. women charged with robbery 1710-1720). It must be able to support a range of queries and produce output allowing researchers to identify trends in judicial practice over that time.

Reports

Categories

March 2011
Sun Mon Tue Wed Thu Fri Sat
 << < Current> >>
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31  

XML Feeds