This concerns documents that appear in both letter-book and original form, and how to handle this crossover.
For example, we found a dozen or so documents in 1859 that are part of the 398/1 (BC series) and RG7G8C (VI series) collections, respectively. We decided that it was best to show both, but alert the reader to the copy or original, from each respective document.
So, in the RG7G8C version of this file, we added this note:
<note xml:id="B597018_1">Please note that this document exists as a <ref type="doc" cRef="V597018.scx">letter-book copy</ref>, as part of the British Columbia collection.</note>
And in this document, the 398/1 version, we added this note:
<note xml:id="V597011_1">The original form of this correspondence <ref type="doc" cRef="B597011.scx">can be viewed here</ref>. Please note that the original was marked initially as part of the Vancouver Island collection, and changed thereafter, presumably after receipt, to the British Columbia collection.</note>
For now, we have worked through most of the 1859 collection for duplicates. We will have to check in the CO410 collection for the same issue, and do the same for all applicable years.
EDIT: This is fixed as of 2012-01-10.
This task has to do with catchwords and how to position them properly on the website—flush to the right margin of the body-text—when they apear after the final paragraph of a given page, just prior to the page-break.
For now, we have wrapped the FW tag in a P tag, as in the following example:
[...] with a view to their protection and civilization.</p>
<p><fw type="catchword" rend="text-align: right;">I</fw></p><pb n="rg7_g8c_08/rg7_g8c_08_00043v.jpg"/>
<p>I am glad to find that your sentiments respecting [...]
The above is a workaround, as it could be argued that the catchword does not, in itself, represent a paragraph. So, we will need to, eventually, develop a way to display these dangling catchwords appropriately and, in the process, remove the P tags we use now.
The complete collection of 712 page images for RG7 G8C Vol 9 (in three different sizes) have been added to the collection. These cover the second part of 1860-61 BC Despatches from London. We originally had only the first 25 of these images.
This is a call to remove B587055A.xml from the coldesp collection, as it is a duplicate of B587056A.xml.
B587055A.xml is incomplete in its transcription of a private letter found in the 398/1 image collection. B587056A.xml, however, provides a complete transcription and, moreover, it follows the correct sequence within the 398/1 image-collection.
361 new images (in three different sizes) have been added to the collection. These cover the second part of BC 1859 Despatches from London.
Stats show a noticeable increase in usage through the semester.
Over 400 new images (in three different sizes) have been added to the collection. These cover the BC 1859 Despatches from London.
Over 600 new images (in three different sizes) have been added to the collection. These cover the BC 1858 and 1860-61 Despatches from London.
439 new images (in three different sizes) have been added to the collection. These cover the Vancouver Island 1862-63 Despatches from London.
KSW pointed out that we had a little press coverage in the TC on Nov 20, so I've added that to the relevant page on the site. I've also added a couple of the new folks working on the site to the Credits page.
The default output size limit for an XQuery is 10000. When requesting a list of "Mentions of this place in the documents" when the place is Vancouver Island, an error was occurring because this limit was exceeded. I've fixed this in two ways:
<watchdog output-size-limit="20000" query-timeout="-1"/>That change apparently does not take effect until the server is restarted. However, I didn't want or need to restart the server to fix the bug right now, because of #2.
getRefs.xq file to solve the immediate problem:
declare option exist:output-size-limit "20000";
If similar problems show up in future, we can make further increases.
719 new images (in three different sizes) have been added to the collection. These cover the Vancouver Island 1859-61 Despatches from London.
...from Megapode.
These two blocks of XQuery will search for page-image links in the header <biblScope> and in <pb> tags and report any that don't match the expected pattern. That doesn't mean they're bad, just that they need checking.
header <biblScope>s:
xquery version "1.0";
declare default element namespace "http://www.tei-c.org/ns/1.0";
for $b in //biblScope[@type="startPageImage"]
let $bits := tokenize($b/@facs, "/")
where not(starts-with($bits[2], $bits[1]))
or not(matches($b/@facs, '((co)|(rg7))_((g8c)|([0-9]{1,3}))_[0-9]{2,2}/((co)|(rg7))_((g8c)|([0-9]{1,3}))_[0-9]{2,2}_[0-9]{5,5}[rv].jpg'))
return (xs:string($b/ancestor::TEI/@xml:id), $b)
<pb> tags in the body:
xquery version "1.0";
declare default element namespace "http://www.tei-c.org/ns/1.0";
for $pb in //pb[@n]
let $bits := tokenize($pb/@n, "/")
where not(starts-with($bits[2], $bits[1]))
or not(matches($pb/@n, '((co)|(rg7))_((g8c)|([0-9]{1,3}))_[0-9]{2,2}/((co)|(rg7))_((g8c)|([0-9]{1,3}))_[0-9]{2,2}_[0-9]{5,5}[rv].jpg'))
return (xs:string($pb/ancestor::TEI/@xml:id), $pb)
A detailed write-up of the information below has been added to the Guidelines document. For now, the following examples, where the attributes are emphasized, should suffice:
Quick-and-dirty XQuery to generate stats for 1860:
xquery version "1.0";
declare default element namespace "http://www.tei-c.org/ns/1.0";
let $pbs := count(//TEI[substring(@xml:id, 2, 2) = '60']//pb[@n]),
$biblScopes := count(//TEI[substring(@xml:id, 2, 2) = '60']//biblScope[@type='startPageImage']),
$tot := ($pbs + $biblScopes)
return concat('Page-break tags: ', $pbs, '; biblScopes: ', $biblScopes, '; total: ', $tot)
xquery version "1.0"; declare default element namespace "http://www.tei-c.org/ns/1.0"; let $names := count(//TEI[substring(@xml:id, 2, 2) = '60']//name[@key]) return $names
Made some updates to the stats page (stats.htm) so that it displays more useful data about the state of completion of the documents.
A "peripheral_vessels.xml" file was created to house vessels mentioned in files other than the despatches. For example, in Captain Cook's biography, we might mention his ship, Discovery, which does not appear in the despatches, at least not in the content transcribed currently.
As we discussed as a team, it seems odd that the online reader should encounter some vessels tagged and others not. After all, readers do not know which vessels occur in the letters and which do not. The peripheral-vessels file solves cures this potential for confusion.
Lastly, should a vessel that appears in the peripheral-vessels file one day be discovered elsewhere in the future, say, if the enclosures are eventually transcribed, then we would move the respective vessel entry over to the "vessels.xml" file, a simple copy/paste operation.
144 new images (in three different sizes) have been added to the collection. These cover the Public Accounts for Vancouver Island, 1857-1860.
KSW noticed that TS had forgotten to commit his changes to SVN, but I was able to log into his machine as hcmc and sudo svn commit. Have to make sure that didn't result in any permissions changes that would prevent future updates or commits.
This page will list our SVN conundrums and how we solved them! And, should this page miss something, check this website.
svn diff -r [version number]:[version number]As in this example:
svn diff -r 460:481 B60001.xmlThis was used to look at two versions of the same file: B60001.xml from version 460 and version 481. The SVN report details, with little plus and minus signs, to indicate lies and content added or removed, respectively.
Set up KSW with upload privileges over most of the data areas of the db, so he can refresh files whenever he needs to.
TS joined the team today as a workstudy. I've set him up on Onion, and we've gone through the procedures around use of SVN and oXygen. KSW will take over on Tuesday.
As in title.
I have completed an image inventory for all the myriad collections we have on file, in a few locations. This can be viewed here as a webpage, which is updated automatically whenever any future changes are made.
One of our existing documents from 1860 was found in CO 305 vol 17 (no other 1860 documents are in there), so KSW has processed the first 20-odd pages of 305/17, and we've added them to the collection, but the rest remain to be done when 1861 is being processed.
526 new images (in three different sizes) have been added to the collection. These cover the Entrybooks of Correspondence for BC, 1858-1861. They are currently being linked into the transcription documents.
2382 new images to go up, and I realized I hadn't previously documented the changed paths resulting from the change from the old Rutabaga to the new DS hardware. Here's the command now: log into nfs.hcmc.uvic.ca, go to /home1t/coldesp/www, and run:
rsync --stats --recursive --times --delete --verbose -e ssh jpg_scans/ "mholmes@rutabaga.hcmc.uvic.ca:/volume1/homes/mholmes/Colonial_Despatches/www/jpg_scans/"
Three new sets of page-images have now been added to the collection:
In all, 2382 new images have been added (at three different sizes, as usual). Substantial updates have also been made to the document markup for 1860, and additions have been made to the biographies and the place database. (All this good work is of course Kim's; my role is just to integrate all the changes into the database, make backups, update the information pages on the site, etc.)
Retrieved ytd stats from Megapode.
Stats for the first half of the year retrieved from megapode.
Now the new Rutabaga is online, I was able to update the backups of the jpg_scans image tree with an almighty rsync operation.
A further 1348 page-images have been added to the manuscript image browser, covering Vancouver Island Public Offices and Miscellaneous correspondence, 1860. Transcriptions are now being linked into these images.
Yesterday I generated all the OAI metadata using my local copy of the main eXist 1.1.1 application, forgetting that its XQuery functionality is limited compared with the newer one. The resulting records were missing lots of lookup data from the ographies, so I've regenerated and re-uploaded them.
A page has been added to the site explaining the OAI-PMH metadata interface and how it works, with links to example queries returning XML responses.
My local copy of ColDesp was out of sync and out of date, so I've updated it. This took a bit of diffing to figure out which files had been removed from the set, along with their corresponding OAI files. Then I regenerated all the OAI records and uploaded them into the server db. Finally, I backed everything up, and took a complete local copy of Tomcat + eXist to copy to my laptop, for the conference, in case of connectivity issues.
The index pages (people, places, vessels) have some peculiar referencing complexities, in that any item on the page can have links which reference other items on the same page, or items which must be retrieved by AJAX. Previously, in the case of a local item, the JavaScript was moving the content from its normal place on the page into the popup, and then in theory putting it back again, but that actually resulted in a blank space in some situations (such as when the popup was closed, rather than being filled with new content). I've now rewritten the system so that it does what the Mariage site does: it clones the content of the item into the popup, and deletes it when it's done.
The problem: on the places index page, when you click on a place within a places write up, the clicked-place vanishes from the places index. Rather than cloning the content, the link appears to relocate it. This may be happening in the bios and vessels list as well.
A further 941 page-images have been added to the manuscript image browser, covering Despatches from Vancouver Island to London, 1860. Transcriptions are now being linked into these images.
YTD stats retrieved and stashed.
I've tweaked the XQuery which produces the KML file so that it can recognize when a <place> has <location type="path">, and in that case it produces a LineString element instead of a <Polygon> or a <Point>, and it doesn't supply the closing georef which repeats the first, which is what we use to close a polygon.
We use paths for rivers. To make a path, or line, appear on Google Maps you need to override the code that automatically coverts multiple (not single, of course) <geo> coordinates to polygons.
To do this, you need to add a type="path" attribute to the <location> tag, as follows: <location type="path">.
Although it's a bit ugly. No time to get too fancy, unfortunately.
Did most of a diagram of the OAI process today.
A further 1298 page-images have been added to the manuscript image browser, covering British North America 1860 Public Offices and Individuals. Transcriptions are now being linked into these images.
Today's progress, with help from KSW:
Retrieved the original launch documents and fixed missing images caused by hard-coded paths to raster graphics in the SVG. Updated the "Despatches by numbers" one. Over the next couple of days, we'll reformat these for an appropriate size, make an OAI one, and get them printed and laminated.
Got RM set up with SVN instructions and working on the places.xml file. He's now made some edits, so I've updated the db and added him to the project team page.
Nothing unusual.
This lists some of the errors that we have encountered, and their respective solutions!
Upon my morning svn update I received this error:
kim@dandelion:~/Desktop/coldesp_xml/xml$ svn update
svn: warning: cannot set LC_CTYPE locale
svn: warning: environment variable LANG is en_CA
svn: warning: please check that your locale name is correct
SOLUTION: type this into terminal
export LC_ALL=C
I found this info here.Use this to find phrases excluding certain words. For example, if you want to find Hudson's Bay, but without "Company," "House," "Territory," and so on:
This info taken from here: http://www.regular-expressions.info/lookaround.html
Catching up with recent changes -- set my local version of the app to regenerate all the OAI records from the latest XML files while out at the dentist, then copied them down from there, committed them to SVN, and uploaded them into the live app.
Completed the addition of new placenames from 1859, of which, there are 59 additions. Hopefully, Theo and Shaun can get through their respective tasks in time to help to complete them.
Following yesterday's post, I've rewritten the XQuery that handles generating the KML files that are passed to Google Maps, along with tweaks to the XSLT that creates the map links in place data display on the site:
getKml.xq file.<LinearRing> (for a polygon) or a <Point> (for a point) tag into the KML.<place>/<desc> tags is now output as plain text twice in the output KML: once as a <Snippet> element (which means it shows up on the left of the map, next to the placename), and once again as a <description> element (which means it shows up inside the "speech bubble"-style popup when you click on the location's pushpin marker.Right now, we're handling the link between the places entries and Google maps in this way:
<geo> element) for the location, we're simply passing the coordinates and a placename to Google Maps, which then displays the point.What we need to do:
The biographies files have been tuned up. And, I have added an 1859 biograpies file to the "Bios" folder. I have yet to add the list of new names from 1859, as I await Theo to finish his additions from 1858, as the names I found in '59 may have occurred fist in '58.
In all the current biography files, I have deleted duplicate entries and formatted correctly several of the entries, and where required, updated the necessary tags in the despatches. Along the way, this process has revealed ways in which we can standardize further the way we handle biography entries. I will post these amended standards on the Guidelines document.
I have completed my file-by-file pass of all the 1859 files, adding place, date, people, vessel, and First Nations tags.
Next, I will add the new-found names, of people places and vessels, to their respective content-files. I predict this to take roughly two days, and then we will require entries to be written for, at least, the place names and vessels.
Retrieved from Urchin.
It seems I can only generate OAI file correctly using my recent Cocoon/eXist build locally. I really need to port the project over to that, but it might be complicated; it certainly requires a rebuild of my xqsearchutils library, which is currently generating errors...
Began the process of looking through M's incoming transcriptions, of which there are 156 files. The principal difficulties initially are these:
SM is working on abstracts, and we've now set him up with access to the SVN. He's working on 1852.
Compiled some stats for govlet.ca and coldesp for the ACDP performance indicators report, for CP.
Spent an hour looking at some sample scans of maps from HBC, and discussing with CP by email what we might order in the way of digital images. Also some requirements for reporting for the ACDP have come down the pipe, so I've asked sysadmin to add govlet.ca to my Urchin stats set.
Here is where we sit for the moment:
I have to have a last file-by-file pass at the 1859 files to scan for missed people, place, vessel, and mentions of Indigenous People. Much of this has already been done with find/replace, for example, with the common names and places, but I need to do a final pass to catch the stragglers.
Theo continues to add place-name tags to files from the 1858 collection, as this was missed in the last round. This is a welcome change for the poor man, as I had him parsing images for eons. I suspect that by the time his time with us is completed (as an RA), he will have completed this task, perhaps more.
Shaun has turned to the writing of abstracts, picking up where we had left off: 1852. As he is here as part of a Directed Reading, through the work of Susan Doyle in the Pro. Writing Department, he will be leaving us when his semester ends, around the first week of April.
I have completed the date tags for 1859. There may be the odd one missing, but I will catch them when I scan through each file for new place, people, and vessel names.
120 changes; these are the regexps:
(?<!abbr>)([Dd])ep(<hi rend="[^"]+super+[^"]+">t\.?</hi>\.?)
<choice><abbr>$1$2</abbr><expan>$1epartment</expan></choice>
I'm not going to do this often, but the OAI records were very out of date, so I've regenerated them all locally and uploaded them back into the db. Takes a long time -- I wonder if there's a way to script it so it could run unattended somehow...
Grabbed the stats for January.
A further 1422 page-images have been added to the manuscript image browser, covering British Columbia 1859 Public Offices Part 2, and Miscellaneous. Transcriptions are now being linked into these images.
To assist KSW with automating some of the name markup, I've been generating lists of distinct values for name variants, using this code:
xquery version "1.0";
declare namespace xdb="http://exist-db.org/xquery/xmldb";
declare namespace util="http://exist-db.org/xquery/util";
declare namespace f="http://exist-db.org/f-functions";
declare namespace tei="http://www.tei-c.org/ns/1.0";
(:declare namespace fn="http://www.w3.org/2005/xpath-functions";:)
declare function f:getContents($id as xs:string) as element()*
{
for $d in distinct-values(collection('/db/coldesp/correspondence/')//tei:name[not(@type)][@key = $id])
return
<name>{$d}</name>
};
<people>
{for $id in distinct-values(collection('/db/coldesp/bios/')//tei:person/@xml:id)
return
<person xml:id="{$id}">
{f:getContents($id)}
</person>}
</people>
Throws up some interesting things that look like they might be typos, as well as many names that don't seem to have any mentions in the text. I'm investigating.
Added in the First Nations information where it's been tagged as <dc:subject> tags, and regenerated the records. This goes pretty quickly in the 1.4.1 version of the db; I should definitely move forward with the port asap.
Regenerated all the OAI records using my local version of the site based on eXist 1.4.1 (it seems to be faster, even without indexes properly configured), and also added a couple of links to the site as requested by CP, one to the Govlet site and one to the Libraries Early BC Maps page.
Generating these records takes a good while. I keep hitting little buglets in the XQuery which require me to restart the process. Hopefully we're pretty solid at this point.
Struggling with the strange behaviour I was seeing, where a function would execute correctly if called directly, but not if called from a for loop, I discovered two things: the same problem still exists in eXist 1.4.1, but there I see an actual error to the effect that the context is missing for a node; and I can eliminate the problem by rewriting some XPath inside the query. This is the XPath that was causing the problem:
for $n in (distinct-values($doc//@key[parent::tei:name[not(@type)]]))
Admittedly it's a bit perverse. If it's rewritten like this:
for $n in (distinct-values($doc//tei:name[not(@type)]/@key))
then the query works even when executed inside a loop. This means I can now generate all the complete OAI records that JD would like to see.
I have added Page-Break tags to the CO 60/5 files. All that remains for the remaining 1859 files is CO 60/6, the images for which Theo will be done with by week's end. In the meantime, I can begin tagging people, place, ship, and mentions of First Nations in the 1859 files, generally.
I've hit a snag with the OAI stuff which is almost certainly an eXist bug: when I generate an OAI record individually, passing an xml:id to the function that does it, then all the people, places etc. are included, but when I try to generate records in a loop by passing all the ids in, then that information is not included. I've tried configuring extra indexes and all sorts of other workarounds but it seems insuperable, and I'm certainly not going to generate 7,143 records one at a time. I'm now forced to look at updating to the new build of eXist/Cocoon to see if the bug is present there. If it's not, then that's a solid reason for doing the migration right now.
A further 1250 page-images have been added to the manuscript image browser, covering British North America 1859 Public Offices/Hudson's Bay Company. Transcriptions are now being linked into these images.
A further 1108 page-images have been added to the manuscript image browser, covering BC 1859 Despatches to London and Public Offices Part 1. Transcriptions are now being linked into these images.
oai_update.xq file, so that the <record> elements have the right schema and namespace data. I also realized that the <dc:identifier> element should contain a full working URL to the transcription file, so now it does.@xml:id attribute no longer appears on the <record> element when outputting it in response to a request.I'm now uploading the records to the main db, and I'll do the requisite testing on that using the online tool linked in my prior post. Once that's done, I'll consider it functional and ready for testing by the library folks.
I now have the paging and resumptionToken functionality working, and I've started testing the repository output using this online tool. Mostly it's working fine, but I have two issues with validation of the output of ListRecords -- I'm storing the id of the document in an xml:id attribute in the record element, and that's not allowed; I have a broken xmlns:xsi attribute in some of the records caused by a now-fixed bug in the oai_update.xq; and the xsl:schemaLocation attribute seems not to be allowed (probably caused by the foregoing issue). I should be able to get these fixed by tomorrow; I'll have to regenerate all the OAI files, though, and then export/import them from my local machine to the Pear instance.
Updated changed files in the db; added the OAI files to SVN; and continued work on the OAI interface. I now have everything working except for the paging out of results with the resumption token, which I hope to complete tomorrow. Then I'll need to add better indexing for the OAI files, and we should be ready to go.
Shaun Macpherson will be joining the project this week, in an editing and copyediting capacity.
His work for the project is part of a directed study course that he and Susan Doyle have constructed, with input from Martin and I, as part of the English Department's Professional Writing program. This is an unpaid position, and Shaun is working for course credit. Shaun will work/learn for 6 hours per week under my guidance and direction, until sometime in April--roughly, the semester's end.
We look forward to his contributions! To start, he will incorporate Frank Leonard's latest batch of biographies from 1848; he will then move on to write, edit, and code abstracts.
Just a quick update to say that Theo is processing images for the CO 60/6 collection, of which there are 117 files in 1859. I will now move on to process the images for CO 60/5, of which there are 98 file in 1859.
Finished the addition of PB tags to the 1859 files found in CO 305/13 and CO 410/1. The latter, letterbook copies, pointed occasionally to enclosures that exist in the originals. Presumably, once and if we track down the originals, we will have to add these items.
The Cocoon/eXist build that's housing the current ColDesp application is pretty long in the tooth, and we're going to need some newer features in the coming months (see the previous post). So I've been preparing the way for migration by testing the current application as it runs inside our latest build. This is what I had to do:
<map:generators default="file">
<map:generator logger="sitemap.generator.text" name="text" src="org.apache.cocoon.generation.TextGenerator"/>
</map:generators>
and all instances of this:
<map:transform type="session"/>I don't think the former was actually used (no sign so far), and the only possible function of the latter was to handle the authentication for blocking access to non-UVic users; that will need some careful testing.
These are my conclusions so far:
Working:
Failing or suspect:
This is not exhaustive, but it's quite heartening; it suggests that reworking the collection.xconf files (and perhaps adding some for the actual documents) could be all we need to do to get a basic working web application. I can then test it side-by-side with the old one, to determine relative speed and see if there are any slowdowns that need to be worked on in the new version. Assuming it performs at least as well as the old one, there's no reason not to migrate.
Today I added two new functions, f:pruneRecord($recId as xs:string) and f:pruneRecords(). These check one or all of the OAI records in the database to see if a there's a parallel correspondence document for it; if not, it's presumably been deleted, and the OAI record is also deleted. Tested (locally) and working.
Although the actual API for OAI requests and responses is relatively simple, the specification allows for a harvester to make large numbers of requests which can require significant amounts of data from the database; this is especially so in the case of our collection, in which we have a great deal of metadata to offer about every document. There is the risk that a single harvester could hit the db with enough data requests to slow the site for other users. For that reason, I'm taking a three-stage approach to designing the OAI features:
First of all, I have a set of routines that generate and maintain individual OAI records for every document in the correspondence. This is basically complete, and I have a full set of OAI records in my development version of the db; I also have routines written which check all the existing documents and refresh the metadata if they've changed, delete obsolete metadata records, and so on. This part of the work is done.
The next stage will be to write the actual query interface which allows a harvester to request this metadata from the db. Once that's done, the db will be able to provide metadata to a harvester in accordance with the specification.
Finally, I need to address the issue of maintaining the OAI records in the live database. The process of checking and regenerating records takes quite a long time at the moment (between ten and thirty minutes, depending on what it has to do). I don't want to be running that kind of intensive process on the live db. In the meantime, I can run it periodically on my local copy of the db, and upload the resulting records, but ultimately we want to have a more flexible system whereby any change to a document results in an update to its associated OAI record. I can do this using triggers in the eXist database, but it will require an update to a more recent version of eXist. This is something I've been planning for a while, and it will bring other benefits, but it's something I'll have to test carefully before we deploy it.
So I'd expect actual OAI functionality to go live by the end of the month, if nothing unexpected intervenes, and then I'll start the port to the new version of eXist. If that goes smoothly, implementing triggers shouldn't be too complicated, and then the OAI records will maintain themselves.
KSW and I wrote some content for JL and CP, to help with the latest grant application.
Got the complete 2010 stats (the ones we care about) from Megapode.
The Colonial Despatches is an XML database project which is creating a digital archive containing the original correspondence between the British Colonial Office and the colonies of Vancouver Island and British Columbia. The project lives at http://bcgenesis.uvic.ca, and the web application runs on the Pear dev Tomcat. The XML data is managed in SVN at http://revision.tapor.uvic.ca/svn/coldesp/.
| << | Current | >> | |
| Jan | Feb | Mar | Apr |
| May | Jun | Jul | Aug |
| Sep | Oct | Nov | Dec |