Updated the way the stats page counts non-complete place and vessel entries.
cover the BC 1861-1867 Entry Books of Correspondence: Letters from Secretary of State and Despatches. These will now be linked into the transcription documents.
Refreshed all the OAI metadata records. I'm going to document this process since I don't seem to have documented it in the past.
The generation process used to take about four hours on my previous workstation; this time it took only half an hour.
Doing stats (see previous post) I found some encoding oddities in name encoding. These should be constrained by the schema, so I'm going to have a look at the possibility of rebuilding the schema accordingly.
Generated these stats for CP's report on this round's grant funding:
Images processed so far this round for 1861: CO 60:10, 60:11, 10:12 CO 305:17, 305:18 RG7 G8C:21 for a total of 4369 images, at 3 sizes = 13107. 1317 links to page-images have been added to the 404 documents for 1861. According to my calculations, so far in 1861, 7150 names of people, places, and vessels have been linked: 5252 people 65 vessels 1833 places
KSW will do some calculations for the next application, for 1862.
We discovered that one of the old scripts we used to convert the documents ran amok a little and added a false "documentType" value of "Secret." Liekly becasue the script assumed that "Secretary" counted as "Secret"!
We removed <idno type="documentType">Secret</idno> from 1,862 files. Revision number prior to this mass-fix: 990. First revision number with ONLY this fix: 991.
Important: there are actually 6 "Secret" files. These documents have <head> elements containing "Secret" but not containing "Secretary":
We have added the <idno type="documentType">Secret</idno> to these files, and this revision number 992.
After some discussion and a request from a user, we've decided to make our encoding guidelines document available on the site. It is, of course, in a state of continuous evolution, so we'll refresh the PDF periodically. A link has been added to the Development page.
Gathered stats up to end of September.
The Colonial Despatches project would like to welcome our two new team members, Alison Malis (doing a Directed Reading in the History Department) and Brigitte Dreger-Smylie (Directed Reading/Professional Writing Program). We also welcome back Theo Biggs, previously doing Directed Reading but now as a workstudy student, entering his third year with us. There should be lots of activity over the coming semester!
Stats for complete, incomplete and unavailable bios were being incorrectly calculated following our change to the use of persName/@type recently. Reported by KSW, and now fixed.
The links from schedules not reliably connected with a document id were failing with an inscrutable error, as was any URL which didn't actually point at a document (where a sort of 404 would be expected). I've now fixed that, so that a cleaner "not found" document appears, and schedules with plausible target documents (based on despatch numbers and dates) actually jump to the first plausible document.
KS-W and I clarified and extended the existing system for classifying bios, and place and vessel definitions, and I updated the XSLT accordingly:
A number of existing bio entries will be reclassified from incomplete to unavailable by KS-W.
Met with JL, CP, IO'C, and DB-M re the georeferencing of maps from the Coldesp collection and the Library's collection. There will be another meeting in September to thrash out more details, and in February to look at some results from student work in a GIS class on some of the existing maps.
We will need to produce duplicates for some of the files in the 1861 collection, specifically, for documents that appear as letter-book copies in 398/1 and as originals in the RG7 G8C 9 collection.
We will handle this process as we have done before in previous collections.
A reader pointed out that we have two competing abbreviations for what is now Libraries and Archives Canada, LAC and the older NAC. We have now replaced all instances of NAC with LAC, and updated the search engine to take account of the change.
282 page images for RG7 G8C vol 21 (in three different sizes) have been added to the collection. These cover the Despatches to London July 1859 to April 1861 (letterbook copies). These will now be linked into the transcription documents where appropriate.
466 page images for CO 60 Vol 11 (in three different sizes) have been added to the collection. These cover the 1861 Despatches to London Sept-Dec. These will now be linked into the transcription documents.
588 page images for CO 60 Vol 12 (in three different sizes) have been added to the collection. These cover the 1861 Public Offices and Miscellaneous correspondence. These will now be linked into the transcription documents.
767 page images for CO 60 Vol 10 (in three different sizes) have been added to the collection. These cover the 1861 Despatched from London, January to August. These will now be linked into the transcription documents.
Today's progress:
Still to do: rework the processMapBibl template so that it really uses all of the info that's now there (author, publisher, etc. etc.). This should probably be done with regular templates.
This is the complete mapping for copying metadata over from the ContentDM records to our TEI files:
I'm now halfway through the XSLT which will integrate the metadata into the TEI files. Should be done tomorrow.
This is my preliminary mapping:
Spent most of the day manually aligning records between ContentDM and ColDesp, so this is where we're at:
Also wrote to CP with a list of 7 maps that we have, but which are apparently missing from ContentDM.
More progress on matching with ContentDM. I've now generated an XHTML file with two tables, one of candidate matches (186 maps) with links to both ColDesp and ContentDM, for human checking, and one of failed matches (33 maps from ColDesp), with ColDesp links and enough metadata for a manual search. I've manually verified the 186 candidate matches and found that most match; I reported one map apparently missing from ContentDM to CP, and found a dupe in ColDesp.
Next steps:
910 page images for CO 305 Vol 18 (in three different sizes) have been added to the collection. These cover the 1861 Vancouver Island Public Offices and Miscellaneous Correspondence. These will now be linked into the transcription documents.
The complete collection of 1208 page images for CO 305 Vol 17 (in three different sizes) have been added to the collection. These cover the 1861 Vancouver Island Despatches to London. These will now be linked into the transcription documents.
...described in this post.
Based on our meeting last week, I've drafted a proposal for the HCMC committee for the port of the project to a pure eXist implementation with enhanced searching, NLP topic discovery, etc. Sent to JL for comments.
For many projects it will be useful to have a way of calling a java lib which can make a universal similarity metric measurement of two strings. I've started working from this documentation to create a class and the necessary wrappers to make this work. I'm still trying to resolve some dependencies, but I think this will be practical, and we'll be able to use the USM module in the context of oXygen (where we're allowed to use Saxon EE). The testbed for this will be the matching of ContentDM records with our TEI metadata for maps.
Put together an immediate and a longer term plan for the project; I'll detail these when I have a chance.
I've done some preliminary alignment with XSLT to find out which maps we have which can be matched with entries from ContentDM:
It seems likely that many of these items actually do match, but because they have no Penfold numbers or matching ids, I'll have to match them with some sort of fuzzy matching approach.
I regenerated my map_lookup.xml file with a bit of added data:
xquery version "1.0";
declare default element namespace "http://www.tei-c.org/ns/1.0";
declare namespace tei = "http://www.tei-c.org/ns/1.0";
<maps xmlns="http://hcmc.uvic.ca">
{
for $t in //tei:TEI
return
<map xml:id="{$t/@xml:id}">
{
if ($t//tei:title) then
<title>{$t//tei:title[1]/text()}</title>
else
()
}
{
if ($t//tei:idno[@type="penfoldNum"]) then
(
<penfold>{$t//tei:idno[@type="penfoldNum"]/text()}</penfold>,
<docId>{$t//tei:idno[@type="doc_id"]/text()}</docId>
)
else
()
}
Completed the report for PCA, who signed off yesterday, and sent it on to SD and EG-W.
Four new correspondence documents from 1859 have been added to the correspondence, transcribed by Marion Massey and marked up by Petria Arienzale. The total document count is now 7151.
Added the first few new vessels to the vessels file, fixing some typos in the original transcription, confirming the existence and naming of the vessels, and finding some sources to get the researcher started. Lots more to do. I'm up to the John Stephenson.
Simple XQuery to pull out the data:
xquery version "1.0";
declare default element namespace "http://www.tei-c.org/ns/1.0";
declare namespace tei = "http://www.tei-c.org/ns/1.0";
<maps xmlns="http://hcmc.uvic.ca">
{
for $t in //tei:TEI
return
<map xml:id="{$t/@xml:id}">
{
if ($t//tei:title) then
<title>{$t//tei:title[1]/text()}</title>
else
()
}
{
if ($t//tei:idno[@type="penfoldNum"]) then
<penfold>{$t//tei:idno[@type="penfoldNum"]/text()}</penfold>
else
()
}
</map>
}
</maps>
I might have to add more data points to the output; in fact it might be worth just pulling out the whole of the sourceDesc. I'm currently looking at the possibility of enhancing my UniSymMetric Java class so it could be called as an extension function from XSLT in Saxon; that would give me a fallback when there's no Penfold number, and it might be handy in all sorts of other ways too.
JD pointed me at an OAI feed from ContentDM, which is exactly what I need for my metadata harvesting. This is my plan:
I've started work on an XSLT stylesheet to do the job. The purpose of the stylesheet is to process detailed OAI metadata records which use Dublin Core identifiers into teiHeader elements suitable for adding to TEI documents Despatches project.
The OAI metadata is in the file oai_from_contentdm.xml, and originates in the UVic Library's ContentDM system. It contains 261 records relating to Early BC Maps, and most of these are maps also in the Colonial Despatches project collection. The ContentDM metadata is well-organized and has been considerably enhanced, so we're going to take that data and generate new teiHeader elements for our TEI files from it.
The first stage is to create a mapping between each of the fields in the OAI data and the location in the teiHeader where we propose to store it.
Input documents:
Output documents:
Adding this as a task for me, long-term, because it needs to be part of the plan for the next phase of the project.
I had pointed JT at fo_925-1650_pt_1_24_vic_harbour_1847, which is Penfold 576, for the Kellett map of Victoria Harbour, but it turns out he wanted Penfold 577, which is fo_925-1807_vic_1848. I've slightly enriched the metadata for 577 using data from ContentDM, manually, but there should be a way to do this mechanically because the ContentDM metadata is organized into clear fields. Ultimately, it would be a good idea to find some way to get at this metadata and pull it into our headers, so we'll have to write a mapping between the two. Here's an example of the ContentDM data in HTML:
http://contentdm.library.uvic.ca/cdm/singleitem/collection/collection5/id/130/rec/2
It claims to be XHTML, but it's not even well-formed, never mind valid, so it couldn't be parsed with e.g. XSLT unless it was tidied first. Hopefully there's a more helpful feed from it. I'm contacting JD about that.
Dating of maps is inconsistent for maps which have a notBefore and/or notAfter. Check them in the sorted gallery, find oddities, and normalize. Did some today.
Did some auditing of the "Marion's transcriptions" spreadsheet that we're using to keep track of the transcriptions awaiting markup, since PCA has been working on these; checked filenames and made updates and notes where appropriate. Also fixed file naming issue reported by PCA, and did some other housekeeping.
JT provided two new maps for the gallery, so I've added those. I had to refresh myself on the procedure for doing this, so I'll detail it here:
I've assigned the first five 1859 documents transcribed by MM to PCA; the 1858 documents are rather complicated, and the existing 1858 documents need some editing, so it's simpler to work on the 1859 documents for the moment. The Google spreadsheet records the status of each document.
DONE: The transcription of the document 58-01-21_HBC748.rtf is marked up as the file V585MI30, when it should be V585MI02_A. It is already up on the site.
All vessels referred to in the Schedules which have obvious existing vessel bios have now been linked (including one correction to a typo, "Fartar" instead of "Tartar"). The remaining vessels, for which new vessel bios will be required, are:
Alexandra Cameleon Devastation East Lotherian John Bright John Stephenson John Stevenson Kingfisher Nanaimo Packet Ossifree Prince of the Seas Random Royal Charlie Scout Scylla Severn Shenandoah Sutlej
It's likely that the John Stephenson and John Stevenson are the same vessel, and possible that they're actually the John Stevens.
The William Allen was tagged as "william", which made it confusable with the Brig William ("william_brig"). I've now changed the vessel bio and all references to it to show "william_allen". Also fixed an encoding issue in an 1854 document that I stumbled across.
Thanks to some excellent work from Petria Arienzale, abstracts have now been added for all 1854 documents. We now have abstracts for all years between 1846 and 1854.
Reviewed PCA's latest work (excellent) and sent comments. Also noticed a couple of issues in other documents and fixed them.
DONE 2012-03-26: The xml:id for the William Allen is currently "william", which is very confusing; change it to "william_allen", and change refs to it, so it's not confused with the Brig William.
NOTE: Completed 2012-04-23. Many new vessel entries have resulted from this work, and they will need to be completed when time permits.
Try this, first in /db/coldesp/correspondence, and then in /db/coldesp/:
xquery version "1.0"; declare default element namespace "http://www.tei-c.org/ns/1.0"; for $r in //name[@type='vessel'][not(@key)] return $r
The vessel tags inside the correspondence seem mainly to be for vessels which HAVE write-ups; these should simply be correctly linked with @key. The broader set include vessels which may not have bios yet; bios need to be created, and those vessels linked.
This is the state of play on TNB's work as of today:
There are issues with the search engine relating to both authors and addressees of correspondence. The drop-down lists are generated from distinct values of tags in the header. These tags, inherited from the Waterloo Script, contain plain text, and so the same individual is identified in a variety of different ways. It would be helpful if we could tag these names with ids from the personography, and then build our search engine drop-downs in a more intuitive fashion.
It seems best to start with the addressees, since they constitute a much smaller number (only 89 distinct values, listed below). The simplest approach would be this:
Addressees:
Following one of KSW's notes in this post, removed date tags from specific location in 17 files. This is presumably for consistency -- only 17 files had them -- and because I suspect some useful parsing can be done/is being done based on the first date in the text being the date the document was penned.
Items in the indexes have a link under their info popup which enables you to retrieve references to them in the correspondence, but sometimes there are no references (as in the case of peripheral bios, which are referred to in other bios but not in the actual correspondence). Previously, clicking on the "Mentions..." link simply did nothing in these cases, but I've now added a trap for this condition and an appropriate error message.
Added appropriate credit to MM for her transcription work, and began the process of pulling documents from Google Docs into the actual repo, which is a bit easier to keep track of. Found one suitable document to get PCA started with full-doc transcription, and created a simple guide to the file/id/naming convention for our collection. Wrote a detailed assignment for PCA and sent it. This process will include a check that our Guidelines document in fact provides enough guidance for a encoding a complete new document. Most likely we will be expanding it in the next week or two as PCA starts to add new transcriptions.
Reviewed the extensive (and excellent) work completed by PCA, who is now nearly at the end of the 1854 abstracts. Wrote a number of notes for tweaks and fixes, as well as a couple of requests for further research and the transcription of a mysteriously-untranscribed despatch (V547102A).
PCA reported that mentions of the Brig William, wrecked in 1854, are linked to the vessel info for the William Allen, which is not the same ship at all. We dug around to find some references from which to construct a new vessel entry, and she's now going ahead with writing it.
Note: Francotoile and the Mysteries projects are not showing any stats.
Worked through the bios provided by TB, and made a couple of tweaks; found one person who has been misnamed for years. Also started work on TS's bios; I need to work through a couple of issues directly with him tomorrow.
PCA has completed abstracts for Jan-Feb 1854. Reviewed them and sent feedback, as well as updating her directed reading report.
Pulled down the server stats for January.
PA will move on to abstracts for 1854. Meanwhile, I've sent some comments on the other three bios, and in the process added some name markup to an 1852 file.
Reviewed the work PA has been doing -- it's excellent -- and wrote some feedback, as well as adding to the weekly report.
Several thousand OAI-PMH records have been regenerated to take account of updates to despatches files and other XML documents in the collection. The process currently takes a long time, so it's only done every few months. OAI metadata records for the collection are now up to date.
The Colonial Despatches project has been added to the DH Commons project list. We are hoping through this to attract collaborators from other institutions who may be interested in researching, writing, and proofreading.
Today PA did her first markup, encoding several of the peripheral bios that she's finished writing, and we posted them on the site, and added her to the credits page.
TS found an error in a peripheral bio, where the bio for James Johnstone had been created from that for hamilton_t, but the Hamilton info had been left in it; additionally, the name itself was "Johnson" instead of "Johnstone". Fixed this, and also updated a reference to the person in the Johnstone Strait place entry.
Changed old BC Geo Names urls from this format:
http://ilmbwww.gov.bc.ca/bcgn-bin/bcg10?name=51611
to this format:
http://apps.gov.bc.ca/pub/bcgnws/names/51611.html
The form seems to have changed, and 16 out of 85 of our references were still using the old URL form. TB noticed the problem.
Gathered some resources for PA, and assigned a list of bios to work on. She's also now familiar with the Linux OS.
Fixed this bug, which affected the display of catchwords which were between, rather than within, paragraph tags. This involved a change to the CSS, but then I was able to fix nearly 190 instances of <fw> tags that were inside <p> tags and shouldn't have been. Where a catchword now appears after the end of a paragraph, and the next page starts a new paragraph, the <fw type="catchword"> tag should be positioned after the end of the first paragraph, and before the <pb> tag.
Today the Colonial Despatches Project welcomes its fourth directed reading student, Petria Arienzale. She'll start work tomorrow on some of the biographies, and over the next few weeks she'll be learning about XML, TEI, oXygen and a host of other components of our work.
Saved from Megapode.
This is a list of stuff that KSW reports needs attention (from his Google Doc "Matters for deliberation"):
The Colonial Despatches is an XML database project which is creating a digital archive containing the original correspondence between the British Colonial Office and the colonies of Vancouver Island and British Columbia. The project lives at http://bcgenesis.uvic.ca, and the web application runs on the Pear dev Tomcat. The XML data is managed in SVN at http://revision.tapor.uvic.ca/svn/coldesp/.
| << | Current | >> | |
| Jan | Feb | Mar | Apr |
| May | Jun | Jul | Aug |
| Sep | Oct | Nov | Dec |