Finished writing the procedure, started by SA, for how convert a raw Word data file from SD into a valid XML file. Note that, since SD will be using an Excel spreadsheet for all data beginning from the year 1810, this procedure is only valid for pre-1810 data (of which there will be plenty). I'll have to write a second procedure for converting SD's Excel files to XML once he's finished the first set of data.
The procedure is saved in a text file which will be passed back to SA upon his return. Here's the procedure verbatim:
=============================================
1) Clean text data
=============================================
save raw text as 1_cleanText.txt
replace all multiple tab with single tab
GREP search for
\t+
replace with
\t
normalize indicator of multiple trial_files in one case
GREP search for
^\s*?&\s*?&.*$
replace with
ADDITIONAL_TRIAL_FILE_IN_CASE
turn remaining ampersand characters into entities
search for
&
replace with
&
put four tabs on each trial line
put space in front of all lines with four tabs
GREP search for
^(.*?\t.*?\t.*?\t.*?\t.*?)$
replace with
\1 [space character preceding the slash 1]
add tab to all lines with only three tabs
GREP search for
^(\w.*?\t.*?\t.*?\t.*?)$
replace with
\1\t [space character preceding the slash 1]
add two tabs to all lines with only two tabs
GREP search for
^(\w.*?\t.*?\t.*?)$
replace with
\1\t\t [space character preceding the slash 1]
manually add three tabs to all lines with only one tab
GREP manually search for
^(\w.*?\t.*?)$
replace with
\1\t\t\t [space character preceding the slash 1]
put tab before instances of TRIAL that don't have one
GREP search for
^TRIAL
replace with
\tTRIAL
put TRIAL onto previous line
GREP search for
(.*?\t.*?\t.*?\t.*?)\r(\tTRIAL.*?)
replace with
\1\2
GREP search for:
Name\s*?Crime\s*?Respite\s*?Pardoned\s*?Executed\s*?\r
replace with:
[nothing]
GREP search for:
^\t*?(.*?Recorder's Report.*?)$
replace with
RECORDER_REPORT_HEAD \2
normalize smart quotes
search for
’
replace with
'
search for
‘
replace with
'
search for
“
replace with
"
search for
”
replace with
"
eliminate empty lines
GREP search for
\r\s*?\r
replace with
\r
repeat until no instances
put extraneous lines as comments on end of previous line
GREP search for
(.*?\t.*?\t.*?\t.*?\t.*?)\r\t(.*?)
replace with
\1\t\t\2
repeat until no hits
manually edit any lines still remaining
GREP search for
^\t
ensure there are 6 tabs in preceding line and copy extra line after sixth tab
ensure there are no more than 6 tabs in any line
GREP search for
^(.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?)\t
replace with
\1
repeat until no hits
removing leading spaces at start of line
GREP search for
^ [space character]
replace with
[nothing]
remove trailing spaces after tab
GREP search for
\t +[space character before +]
replace with
\t
remove leading spaces before tab
GREP search for
+\t [space character before +]
replace with
\t
Manually check in spreadsheet for any lines that should obviously be included as comments in the preceding line
File should now have no empty lines
Each line should be one of
- start with RECORDER_REPORT_HEAD
- consist of ADDITIONAL_TRIAL_FILE_IN_CASE
- contain a trial_file record with 7 tab-delimited fields:
CriminalName TAB Crime TAB Respite TAB outcome TAB ExecutionDate TAB Trial Ref TAB extra
sample extract
RECORDER_REPORT_HEAD 4-7, 9-10 DEC 1799 – Recorder's Report -> F, 31 Jan 1800
Thomas Scott (M1) Robbery Pleasure (RR, 16 July) Free (18 July 1800) [HO 13/13, pp.6-7] TRIAL -- OBSP 1799-1800, pp.8-9 [0.8]
Bartholomew Foley (M1) St in Dwelling Pleasure (RR) TL (15 Feb 1800) [HO 13/12, pp.394-5] TRIAL -- OBSP 1799-1800, pp.9-10 [0.4]
Peter Chapman als Harry Read Burglary To Die (RR) ----- W, 26 Feb 1800 als Harry Kirk (L) (19)TRIAL -- OBSP 1799-1800, pp.88-92 [4.6]
ADDITIONAL_TRIAL_FILE_IN_CASE
John Hall (L) (33) Burglary To Die (RR) ----- W, 26 Feb 1800 TRIAL -- OBSP 1799-1800, pp.88-92 [4.6]
ADDITIONAL_TRIAL_FILE_IN_CASE
Joseph Jones (L) Burglary To Die (RR) ----- W, 26 Feb 1800 TRIAL -- OBSP 1799-1800, pp.88-92 [4.6 but PG]
=============================================
2) add rec_rep, rec_rep_head, case, trial_file elements
=============================================
save file as file 2_mainElements.txt
create rec_rep containers
search for
^RECORDER_REPORT_HEAD(.*?)$
replace with
</rec_rep>\r<rec_rep>\r<rec_rep_head>\1</rec_rep_head>
delete </rec_rep> line at top of document
add <regular_cases> line at top of document
add </rec_rep> line at end of document
add </regular_cases> line at bottom of document
create trial_files
GREP search for
<trial_file>(.*?)</trial_file>
replace with
<case>\r\1\r</case>
remove case elements for multiple trial_files in one case
GREP search for
</case>\r\tADDITIONAL_TRIAL_FILE_IN_CASE\r<case>\r
replace with
[nothing]
Each line of file should now be one of:
<rec_rep>
<rec_rep_head> . . . </rec_rep_head>
<case>
<trial_file> . . . </trial_file>
</case>
</rec_rep>
sample extracts:
<rec_rep>
<rec_rep_head> 4-7, 9-10 DEC 1799 – Recorder's Report -> F, 31 Jan 1800</rec_rep_head>
<case>
<trial_file>Thomas Scott (M1) Robbery Pleasure (RR, 16 July) Free (18 July 1800) [HO 13/13, pp.6-7] TRIAL -- OBSP 1799-1800, pp.8-9 [0.8]</trial_file>
</case>
<case>
<trial_file>Bartholomew Foley (M1) St in Dwelling Pleasure (RR) TL (15 Feb 1800) [HO 13/12, pp.394-5] TRIAL -- OBSP 1799-1800, pp.9-10 [0.4]</trial_file>
</case>
<case>
<trial_file>Peter Chapman als Harry Read Burglary To Die (RR) ----- W, 26 Feb 1800 als Harry Kirk (L) (19)TRIAL -- OBSP 1799-1800, pp.88-92 [4.6]</trial_file>
<trial_file>John Hall (L) (33) Burglary To Die (RR) ----- W, 26 Feb 1800 TRIAL -- OBSP 1799-1800, pp.88-92 [4.6]</trial_file>
<trial_file>Joseph Jones (L) Burglary To Die (RR) ----- W, 26 Feb 1800 TRIAL -- OBSP 1799-1800, pp.88-92 [4.6 but PG]</trial_file>
</case>
</rec_rep>
=============================================
3) create criminal elements
=============================================
save file as 3_criminal
create criminal element
GREP search for
<trial_file>(.*?)\t
replace with
<trial_file><criminal>\1</criminal>
move jury element outside of criminal element
GREP search for
(<criminal>.*?)(\([A-Z]\d*?\))(.*?</criminal>)
replace with
\1\3\2
create and populate surname and given_name elements
GREP search for
(<criminal>)(.*?) (.*?) [trailing space character]
replace with
\1<surname>\2</surname><given_names>\3</given_names>
create aliases, age and gender elements
search for
</given_names>
replace with
</given_names><aliases></aliases><age></age><gender>Male</gender>
populate age element (gender are all assigned 1 above)
GREP search for
(</age><gender>Male</gender>).*?\((\d+?)\).*?(</criminal>)
replace with
\2\1\3
sample:
<trial_file><criminal><surname>Thomas</surname><given_names>Scott</given_names><aliases></aliases><age></age><gender>1</gender></criminal>(M1)Robbery Pleasure (RR, 16 July) Free (18 July 1800) [HO 13/13, pp.6-7] TRIAL -- OBSP 1799-1800, pp.8-9 [0.8]</trial_file>
=============================================
4) create trials elements
=============================================
save file as 4_trials
add trials, trial, judge, jury elements to trials element
search for
</criminal>
replace with
</criminal><trials><trial><judge></judge><jury><jury_type></jury_type><jury_subtype></jury_subtype></jury></trial></trials>
populate jury element
GREP search for
(</jury_type><jury_subtype>)(</jury_subtype></jury></trial></trials>)\(([A-Z])(\d*?)\)
replace with
\3\1\4\2
add crime element inside trial element
search for
</trial>
replace with
<crime><crime_text></crime_text><crime_normalized></crime_normalized><crime_group></crime_group></crime></trial>
populate crime element
GREP search for
(</crime_text>.*?</trial>)(.*?)\t
replace with
\2\1
add empty mercy_appeals element
search for
<trial_ref>
replace with
<mercy_appeals></mercy_appeals><trial_ref>
sample
<trial_file><criminal><surname>Thomas</surname><given_names>Scott</given_names><aliases></aliases><age></age><gender>1</gender></criminal><trials><trial><judge></judge><jury><jury_type>M</jury_type><jury_subtype>1</jury_subtype></jury><crime><crime_text>Robbery</crime_text><crime_normalized></crime_normalized><crime_group></crime_group></crime><mercy_appeals</mercy_appeals><trial_ref>TRIAL -- OBSP 1799-1800, pp.8-9 [0.8]</trial_ref></trial></trials>Pleasure (RR, 16 July) Free (18 July 1800) [HO 13/13, pp.6-7] </trial_file>
=============================================
5) create respites elements
=============================================
save file as 5_respites
add respites elements
search for
</trials>
replace with
</trials><respites><respite><respite_text></respite_text><respite_normalized></respite_normalized><respite_delay></respite_delay><respite_punishment></respite_punishment></respite></respites>
populate respite_text element
GREP search for
(</respite_text>.*?</respites>)(.*?)\t
replace with
\2\1
sample
<trial_file><criminal><surname>Thomas</surname><given_names>Scott</given_names><aliases></aliases><age></age><gender>1</gender></criminal><trials><trial><judge></judge><jury><jury_type>M</jury_type><jury_subtype>1</jury_subtype></jury><crime><crime_text>Robbery</crime_text><crime_normalized></crime_normalized><crime_group></crime_group></crime><mercy_appeals</mercy_appeals><trial_ref>TRIAL -- OBSP 1799-1800, pp.8-9 [0.8]</trial_ref></trial></trials><respites><respite><respite_text>Pleasure (RR, 16 July)</respite_text><respite_normalized></respite_normalized><respite_delay></respite_delay><respite_punishment></respite_punishment></respite></respites>Free (18 July 1800) [HO 13/13, pp.6-7] </trial_file>
=============================================
6) create outcomes elements
=============================================
save file as 6_outcomes
add outcome, outcome_text, outcome_ref elements
search for
</respites>
replace with
</respites><outcomes><outcome><outcome_text></outcome_text><outcome_ref></outcome_ref></outcome></outcomes>
populate outcome_text element
GREP search for
(</outcome_text><outcome_ref></outcome_ref></outcome></outcomes>)(.*?)\t
replace with
\2\1
populate outcome_ref element
GREP search for
(<outcome_text>.*?)\[(.*?)\](</outcome_text><outcome_ref>)
replace with
\1\3\2
populate outcome_ref for those records not caught by regexp above
(.*?)\[(HO.*?)\](.*?)(</outcome_ref>)
replace with
\1\3\2\4
add rest of outcome elements
search for
<outcome_ref>
replace with
<outcome_normalized></outcome_normalized><outcome_group></outcome_group><outcome_duration></outcome_duration><outcome_date></outcome_date><outcome_location></outcome_location><outcome_exceptional></outcome_exceptional><outcome_ref>
populate outcome_date element
i.e. execution date, taken from 6th tab field of original clean text
not taken from the date provided in the outcome_text
GREP search for
(</outcome_date>.*?</outcomes>)(.*?)\t
replace with
\2\1
sample:
<trial_file><criminal><surname>Thomas</surname><given_names>Scott</given_names><aliases></aliases><age></age><gender>1</gender></criminal><trials><trial><judge></judge><jury><jury_type>M</jury_type><jury_subtype>1</jury_subtype></jury><crime><crime_text>Robbery</crime_text><crime_normalized></crime_normalized><crime_group></crime_group></crime><mercy_appeals</mercy_appeals><trial_ref>TRIAL -- OBSP 1799-1800, pp.8-9 [0.8]</trial_ref></trial></trials><respites><respite><respite_text>Pleasure (RR, 16 July)</respite_text><respite_normalized></respite_normalized><respite_delay></respite_delay><respite_punishment></respite_punishment></respite></respites><outcomes><outcome><outcome_text>Free (18 July 1800) </outcome_text><outcome_normalized></outcome_normalized><outcome_group></outcome_group><outcome_duration></outcome_duration><outcome_date></outcome_date><outcome_location></outcome_location><outcome_exceptional></outcome_exceptional><outcome_ref>HO 13/13, pp.6-7</outcome_ref></outcome></outcomes></trial_file>
=============================================
7) create remaining elements
=============================================
save file as 7_allELements
add remaining elements in trial_file element:
search for
</outcomes>
replace with
</outcomes><trial_file_printed_sources></trial_file_printed_sources><trial_file_other_documents> </trial_file_other_documents><trial_file_notes></trial_file_notes>
put extraneous text into trial_file_notes field
GREP search for
(</trial_file_notes>)\t(.*?)(</trial_file>)
replace with
\2\1\3
find tab delimited snippets of text and put them into the trial_file_notes field
GREP search for
(<trial_file>.*?)\t(.*?)(<.*?)(</trial_file_notes>)(.*?)$
replace with
\1\3\2\4\5
manually search for instances of tab in trial_file lines and correct them
search for
<trial_file>.*?\t
fix manually
sample of trial_file:
<trial_file><criminal><surname>Thomas</surname><given_names>Scott</given_names><aliases></aliases><age></age><gender>1</gender></criminal><trials><trial><judge></judge><jury><jury_type>M</jury_type><jury_subtype>1</jury_subtype></jury><crime><crime_text>Robbery</crime_text><crime_normalized></crime_normalized><crime_group></crime_group></crime><mercy_appeals</mercy_appeals><trial_ref>TRIAL -- OBSP 1799-1800, pp.8-9 [0.8]</trial_ref></trial></trials><respites><respite><respite_text>Pleasure (RR, 16 July)</respite_text><respite_normalized></respite_normalized><respite_delay></respite_delay><respite_punishment></respite_punishment></respite></respites><outcomes><outcome><outcome_text>Free (18 July 1800) </outcome_text><outcome_normalized></outcome_normalized><outcome_group></outcome_group><outcome_duration></outcome_duration><outcome_date></outcome_date><outcome_location></outcome_location><outcome_exceptional></outcome_exceptional><outcome_ref>HO 13/13, pp.6-7</outcome_ref></outcome></outcomes><trial_file_printed_sources></trial_file_printed_sources><trial_file_other_documents> </trial_file_other_documents><trial_file_notes>blah blah</trial_file_notes></trial_file>
add elements to rec_rep_head
search for
<rec_rep_head>
replace with
<rec_rep_head><rec_rep_start_date></rec_rep_start_date><rec_rep_end_date></rec_rep_end_date><rec_rep_pub_date></rec_rep_pub_date><rec_rep_notes></rec_rep_notes>
populate start and end date fields
GREP search for
(</rec_rep_start_date><rec_rep_end_date>)(.*?</rec_rep_notes>)\t(.*?)\t
replace with
\3\1\3\2
populate rec_rep_pub_date element
GREP search for
(</rec_rep_pub_date><rec_rep_notes>)(</rec_rep_notes>).*?(\d+? .*? \d{1,4})(.*?)(</rec_rep_head>)
replace with
\3\1\4\2\5
normalize rep_rec_start_date
GREP search for
(<rec_rep_start_date>)(\d+).*?([A-Z]{3}).*?(\d{4})
replace with
\1\4\3\2
add zero to day of month as needed (have to use ZERO placeholder because GREP engine treats \10 as the tenth selected element, rather than the first followed by a literal zero)
GREP search for
([A-Za-z])(\d</rec_rep_start_date>)
replace with
\1!ZERO!\2
search for
!ZERO!
replace with
0
normalize rep_rec_end_date
GREP search for
(<rec_rep_end_date>).*?(\d{2}) ([A-Z]{3}[A-Z]*?) (\d{4})
replace with
\1\4\3\2
get the "20-2" instances
search for
(<rec_rep_end_date>).*?(\d)\d-(\d) ([A-Z]{3})[A-Z]*? (\d{4})(</rec_rep_end_date>)
replace with
\1\5\4\2\3\6
get the "1-3" instances
search for
(<rec_rep_end_date>).*? (\d)-(\d) ([A-Z]{3})[A-Z]*? (\d{4})(</rec_rep_end_date>)
replace with
\1\5\4!ZERO!\3\6
search for
!ZERO!
replace with
0
get the remaining instances (typically those containing commas)
GREP search for
(<rec_rep_end_date>.*?\d), (\d)replace with
replace manually with YYYYMMMdd (e.g. 1562JAN07)
normalize rec_rep_pub_date
GREP search for
(<rec_rep_pub_date>)(\d+).*?([A-Z]{3}).*?(\d{4})
replace with
\1\4\3\2
add zero to day of month as needed
GREP search for
([A-Za-z])(\d</rec_rep_pub_date>)
replace with
\1!ZERO!\2
search for
!ZERO!
replace with
0
=============================================
8) transform outcome_text, crime_text, and respite_text
=============================================
Use 7_to_8.xsl to transform 7_all_elements.xml into 8_ready_to_proof.xml.
Depending on the quirks of the particular data set, you may need to do any of the following:
- Adjust the regular expressions used in transform_functions.xsl and 7_to_8.xsl
- Add or remove irregular outcome_text and crime_text values in irregulars.xml. This file
provides a way to transform text that falls outside of the regular expression patterns.
- Adjust value_lists.xml to reflect the most up-to-date list of 'regular' crimes, outcomes,
and respites.
=============================================
9) Apply XML structural changes
=============================================
Use 8_to_9.xsl to transform the result of #8 into 9_schema_changes.xml.
This brings in structural changes that were introduced in March 2011. See blog:
http://hcmc.uvic.ca/blogs/index.php?blog=36&p=7824&more=1&c=1&tb=1&pb=1
=============================================
10) Apply further structural changes
=============================================
Use 9_to_10.xsl to transform the result of #9 into 10.xml. This incorporates further
structural changes, namely removing the rec_rep_head element and moving the data into
individual trial files. See blog:
http://hcmc.uvic.ca/blogs/index.php?blog=36&p=7853&more=1&c=1&tb=1&pb=1
=============================================
11) Add start_year element
=============================================
Add a <start_year> element as a direct child of <data> to the top of the result of #10. The value of this
element should be the starting year of the data set. For example, for the 1800s data set:
<data>
<start_year>1800</start_year>
(trial info and rest of file)
...
</data>