Long lunch, and leaving early to watch the last of my NLP lectures for this week. Getting those hours down...
Following our recent presentation, I've been thinking a lot about the idea that the true "map" of the project is the XML db, and it occurs to me that a good illustration of this would be a network map of locations in the database. This could be done by measuring, for every two locations that share a parent document, the proximity between them, and then by calculating the average proximity between each pair of locations across the whole db. Then you could use that to create a network graph.
I've been trying to think of ways to calculate the proximity of two XML tags, and I think it could be done with XPath:
string-length(concat($tagOne//following::text()[following::???$tagTwo], '')
although I'm not quite sure how to phrase the last bit...
Simple XQuery to pull out the data:
xquery version "1.0";
declare default element namespace "http://www.tei-c.org/ns/1.0";
declare namespace tei = "http://www.tei-c.org/ns/1.0";
<maps xmlns="http://hcmc.uvic.ca">
{
for $t in //tei:TEI
return
<map xml:id="{$t/@xml:id}">
{
if ($t//tei:title) then
<title>{$t//tei:title[1]/text()}</title>
else
()
}
{
if ($t//tei:idno[@type="penfoldNum"]) then
<penfold>{$t//tei:idno[@type="penfoldNum"]/text()}</penfold>
else
()
}
</map>
}
</maps>
I might have to add more data points to the output; in fact it might be worth just pulling out the whole of the sourceDesc. I'm currently looking at the possibility of enhancing my UniSymMetric Java class so it could be called as an extension function from XSLT in Saxon; that would give me a fallback when there's no Penfold number, and it might be handy in all sorts of other ways too.
JD pointed me at an OAI feed from ContentDM, which is exactly what I need for my metadata harvesting. This is my plan:
I've started work on an XSLT stylesheet to do the job. The purpose of the stylesheet is to process detailed OAI metadata records which use Dublin Core identifiers into teiHeader elements suitable for adding to TEI documents Despatches project.
The OAI metadata is in the file oai_from_contentdm.xml, and originates in the UVic Library's ContentDM system. It contains 261 records relating to Early BC Maps, and most of these are maps also in the Colonial Despatches project collection. The ContentDM metadata is well-organized and has been considerably enhanced, so we're going to take that data and generate new teiHeader elements for our TEI files from it.
The first stage is to create a mapping between each of the fields in the OAI data and the location in the teiHeader where we propose to store it.
Input documents:
Output documents:
Adding this as a task for me, long-term, because it needs to be part of the plan for the next phase of the project.
I had pointed JT at fo_925-1650_pt_1_24_vic_harbour_1847, which is Penfold 576, for the Kellett map of Victoria Harbour, but it turns out he wanted Penfold 577, which is fo_925-1807_vic_1848. I've slightly enriched the metadata for 577 using data from ContentDM, manually, but there should be a way to do this mechanically because the ContentDM metadata is organized into clear fields. Ultimately, it would be a good idea to find some way to get at this metadata and pull it into our headers, so we'll have to write a mapping between the two. Here's an example of the ContentDM data in HTML:
http://contentdm.library.uvic.ca/cdm/singleitem/collection/collection5/id/130/rec/2
It claims to be XHTML, but it's not even well-formed, never mind valid, so it couldn't be parsed with e.g. XSLT unless it was tidied first. Hopefully there's a more helpful feed from it. I'm contacting JD about that.
Dating of maps is inconsistent for maps which have a notBefore and/or notAfter. Check them in the sorted gallery, find oddities, and normalize. Did some today.
Late duty.
The bookstore is using lighter paper than we're used to, so my calculations for the spine width were off. I'd also omitted one of the editors from the cover, so a rework was necessary anyway. Got that done, and sent off the new PDF of the cover, and a new PDF for the document, incorporating Rudling's last-minute changes.
With SA, synthesized all the various changes and suggestions into a single document, then met with the folks from P & A and finalized them all. SA has merged them back into the final spreadsheet, and we're ready to get to work. I created the primary folders.
Did some auditing of the "Marion's transcriptions" spreadsheet that we're using to keep track of the transcriptions awaiting markup, since PCA has been working on these; checked filenames and made updates and notes where appropriate. Also fixed file naming issue reported by PCA, and did some other housekeeping.
note to self on nuts and bolts
on local file system:
create the folder structure you want (if you're copying an existing local instance of an svn project, you have to delete the .svn file from each folder in that project)
on command line,
cd to parent folder of the one you want to add (that parent folder has to already be in svn)
svn add FOLDER_YOU_WANT_TO_ADD
svn commit -m "message about adding new folder"
JT provided two new maps for the gallery, so I've added those. I had to refresh myself on the procedure for doing this, so I'll detail it here:
There are three files in the site which contain database connection strings:
inc/config_EDIT_ME.inc
content/maps/include/conf_EDIT_ME.inc
content/maps/include/config_EDIT_ME.xml
In each of these three files, the values for the database connection string have been replaced with placeholders. You have to make a copy of each of those files with the following names:
inc/config.inc
content/maps/include/conf.inc
content/maps/include/config.xml
In the copies, substitute the correct values for your connection string.
If the folder is in svn (which it probably is), you'll need to use svn add to add each of the files to the repo, then do your svn commit.
... at the author's request, approved by JT.
I've assigned the first five 1859 documents transcribed by MM to PCA; the 1858 documents are rather complicated, and the existing 1858 documents need some editing, so it's simpler to work on the 1859 documents for the moment. The Google spreadsheet records the status of each document.
used switch --relocate oldURL newURL to point local copy to new URL for svn repo.
example:
switch --relocate https://revision.tapor.uvic.ca/svn/reponame, https://revision.hcmc.uvic.ca/svn/reponame
updated my local files, then used the exist admin client to upload 4 modified data files to the database.
Root of svn tree is at https://revision.hcmc.uvic.ca/svn/hcmc/
Leaving early today for an appointment; taking tomorrow off with SA's agreement to work in peace on NLP coursework, and to burn up some of the G&T hours that have stacked up.
I'm half way through this, and I'll have to finish it at home. Deadline tonight, which I won't meet, but it's hard to get more than a few minutes of uninterrupted time during the work day.
DONE: The transcription of the document 58-01-21_HBC748.rtf is marked up as the file V585MI30, when it should be V585MI02_A. It is already up on the site.
All page numbers entered and proofing attribute removed at JT's request. Vol 19 remains unpublished for a few months.
Planning discussions.
All vessels referred to in the Schedules which have obvious existing vessel bios have now been linked (including one correction to a typo, "Fartar" instead of "Tartar"). The remaining vessels, for which new vessel bios will be required, are:
Alexandra Cameleon Devastation East Lotherian John Bright John Stephenson John Stevenson Kingfisher Nanaimo Packet Ossifree Prince of the Seas Random Royal Charlie Scout Scylla Severn Shenandoah Sutlej
It's likely that the John Stephenson and John Stevenson are the same vessel, and possible that they're actually the John Stevens.
The William Allen was tagged as "william", which made it confusable with the Brig William ("william_brig"). I've now changed the vessel bio and all references to it to show "william_allen". Also fixed an encoding issue in an 1854 document that I stumbled across.
Thanks to some excellent work from Petria Arienzale, abstracts have now been added for all 1854 documents. We now have abstracts for all years between 1846 and 1854.
Finished off the last lecture (started at home), then worked through the problem set for the week. Got full marks first time, which was astonishing to me.
List of keywords for all articles entered.
More proofing corrections from JT. In the process, found another misplaced footnote tag. Also added superscript handling to XHTML rendering (it was oddly missing).
Leaving early to watch NLP lectures...
Reviewed PCA's latest work (excellent) and sent comments. Also noticed a couple of issues in other documents and fixed them.
Received and entered corrections from the author; found many other issues which I corrected and reported to JT. This needs another proofing, I think.
Working with very long, slow transformations -- have to do other tasks while transformations run, then examine the results, tweak, and set it off again. Very slow process.
Met with CC and examined some of the outcomes from our rulesets. There's obviously a huge amount of tuning still to do, but it's also clear that before each rule is run, the word needs to be checked against the dictionary in case it's already OK; if it is, then we don't need to keep working on it. I've now implemented that by turning the spell-check dictionary into an XML file which is then indexed with xsl:key (I tried other string-finding methods but they were much slower). The transformation now takes substantially longer than it used to, but it's clearer what's happening. One issue might be archaic forms in the spell-check dictionary, of course.
Another issue is u/v variation. When we change one to the other, we often end up changing it back in a later rule. It seems likely that a better approach would be to change all u/v to another unused symbol, and then write rules based on context for changing that symbol to the appropriate output.
Posting time spent with LSPW figuring out how to port the old colloquium materials over to the Cascade site.
Met with PAB and JT to discuss moving Beck to Cascade. The decision is to wait until PAB's PhD is finished, at the end of the year.
The equations are taking time to understand...
Broken link reports came in with many links on Hist still broken from before, so I went through them and fixed any that are genuine (many are not -- Xenu seems to report lots of links which are perfectly OK). Reported reasons and changes to TG. Also checked out some odd items on the otherwise-empty WOST report which seem to be for the CFUV site, not from WOST at all. Reported that to CB, who fixed it.
Long lunch, and leaving a mite early.
I've been doing preliminary work on the text-extraction and normalization problem. I've completed the initial rather difficult task of extracting the text and linking it back to the original locations in the source document, and I've started playing around with normalization rules. I took the long series of substitutions CC sent me in an earlier email and encoded them as search/replace operations; I'm using a spreadsheet to store them, and generating the required XML block automatically from it. I've since tweaked a few of the rules. I'm working with duchesse_de_milan.xml as a test document initially, and I've hooked in a CSS stylesheet which makes it almost readable in original and "normalized".
More often than not, our current rules take a good word and turn it into something incorrect. That's partly because many of the rules are, as yet, underspecified; for instance, some rules should only act at the beginning of a word, and others only in very specific contexts. Working through the rules to improve them, based on the errors, will help a lot, and I think we'll also be able to improve the output by putting them in a particular order.
The other thing that's missing, at the moment, is a check on the word before it's normalized; I should be checking each word against a modern dictionary before anything is done to it, and only making changes if it turns out not to be a good dictionary word.
But I think we can see the scale of the task ahead of us. It'll take some months to refine our ruleset to the point where we're getting consistently good results.
As part of our investigation of how the database becomes a new mapping tool, I ran the following code on the db:
xquery version "1.0";
declare default element namespace "http://www.tei-c.org/ns/1.0";
declare variable $placeId := "CHEA2";
let $containingDocs :=
distinct-values(for $r in //ref[@target = concat("mol:", $placeId)]
return $r/ancestor::TEI/@xml:id),
$linksInDocs := distinct-values(//TEI[@xml:id = $containingDocs]//ref/@target[starts-with(., "mol:")]/substring-after(., "mol:")),
$locationDocs := for $l in $linksInDocs where //TEI[@xml:id = $l]/facsimile order by $l return $l
for $d in $locationDocs
return concat($d, ': ', //TEI[@xml:id = $d]/descendant::title[1]/text()[1])
This reveals that 248 other locations are connected to Cheapside through documents in the database. On the map, my quick estimate is that around 50 items are connected (although we'll need to do a proper count of that).
Pushing on...
Slogging through these things. I'm watching them once at home, and then note-taking them again at work...
AT will be starting work on Tarr, so I've added him to the SVN users, and reorganized the repo so Nostromo and Tarr get different folders. I've written a more elaborate set of SVN instructions, and when AT has Oxygen set up on his laptop, he'll spend some time working alongside KT to get familiar with the workflow.
Basically followed these steps as I've done before. This time there are too many reviews to fit on the cover TOC, so I've replaced the list of reviews with a single Reviews entry pointing at the first page of the review section.
Made a start on the cover for Volume 20, but I can't proceed very far until I know what the exact year specification is going to be for the volume.
On late duty (but posting a bit early because I'm shutting down to clean my desk before leaving...).
The repos changed location today, so I updated the documentation on the site, and sent details to JJ.
I now have my XSLT module successfully reconstituting a line-broken word on both sides of the break, like this:
<ab corresp="mar:textnode#xpath(/*[1]/*[2]/*[2]/*[1]/*[2]/text()[10])"><seg>
</seg><w corresp="mar:offset#xpath(substring(., 22, 3))"><choice><orig>ant</orig><reg type="joined-2">imagiant</reg></choice></w><seg> </seg><w corresp="mar:offset#xpath(substring(., 26, 3))"><choice><orig>que</orig></choice></w><seg> </seg><w corresp="mar:offset#xpath(substring(., 30, 4))"><choice><orig>Vous</orig></choice></w><seg> </seg><w corresp="mar:offset#xpath(substring(., 35, 5))"><choice><orig>pren-</orig><reg type="joined-1">prendrez</reg></choice></w></ab><ab corresp="mar:textnode#xpath(/*[1]/*[2]/*[2]/*[1]/*[2]/text()[11])"><seg>
</seg><w corresp="mar:offset#xpath(substring(., 22, 4))"><choice><orig>drez</orig><reg type="joined-2">prendrez</reg></choice></w><seg> </seg><w corresp="mar:offset#xpath(substring(., 27, 7))"><choice><orig>quelque</orig></choice></w><seg> </seg><w corresp="mar:offset#xpath(substring(., 35, 8))"><choice><orig>intereſt</orig></choice></w><seg> </seg><w corresp="mar:offset#xpath(substring(., 44, 1))"><choice><orig>à</orig></choice></w></ab>
It's nasty-ugly but it's only intended for machines to read. Having the full form of the word on both sides of the linebreak means we'll be able to do n-grams properly, and having the two joined forms labelled differently (joined-1 and joined-2) means we'll be able to ignore one of them if we're reconstituting a continuous string.
On to probability theory...
I've now fixed the streetcar map problem, by changing all filenames and references so that they're consistently referring to 1939 instead of 1936. The fixes have been committed to trunk, and put up on the website.
The other two issues remain outstanding; PD will get back to me with the correct firemap URL at Malaspina, and we'll wait until the DNS has been changed before addressing the problem with Firefox and captcha, since it seems to be cookie-related.
...on MK's instructions.
Review of resubmitted article.
If you look at Rudling 2, 20, you'll see an author with a pseudonym. In the biblio, the pseudonym is handled like this:
<biblStruct>
<monogr>
<author><name reg="Gunnarson, Karl: see Schulze, Karl Gunnar">Karl Gunnarson see Schulze, Karl Gunnar</name></author>
<title></title>
<imprint></imprint>
</monogr>
</biblStruct>
And the main entry looks like this:
<biblStruct>
<monogr>
<author><name reg="Schultze, Karl Gunnar (pseud. Karl Gunnarson)">Karl Gunnar Schultze (pseud. Karl Gunnarson)</name></author>
<title level="m">På Kanadas prärier</title>
<imprint>
<pubPlace>Stockholm</pubPlace>
<publisher>Folket i Bilds Förlag</publisher>
<date value="1939">1939</date>
</imprint>
</monogr>
</biblStruct>
Tweaks to Lange (which was missing para breaks for some reason) and Rudling arising out of discussions with JT on Friday. Still a couple of questions outstanding. New markup structure for handling pseudonym in biblio will be documented under "Hit by a bus".
PD has checked the new VIH site on taprlans/www, and reports only these issues:
The map itself has the date "1939" on it. The .map file (the basic definition file) is called "vicstreetcar1939.map". However. it contains pointers to images called: victoria-streetcar-1936.png victoria-streetcar-1936-key.png and this page: http://hcmc.uvic.ca/~taprhist/content/maps/maps.php has 1936 in its caption. However. if you go to the map viewer, click on the Maps menu and drill down to it. you see the caption "1939 - Streetcar routes". I think that: - The actual date is 1939. - The images are wrongly named, as are the pointers to them in the .map file. - The caption on the maps.php page is wrong. whereas the Maps menu in the map viewer is right.Waiting for PD to confirm my analysis before changing the image file names, the .map file, and the caption/link on maps.php.
The link on the page describing the 1885 Fire Insurance Plans of Victoria needs to be changed. The page in question is located at: http://hcmc.uvic.ca/~taprhist/content/maps/firemap.php The hyperlink should be redirected to: http://www.mediastudies.viu.ca/steeple/index.htm [The link currently points to an obsolete server - http://cdi.mala.bc.ca/firemaps/However, the new URL is about 1891 panorama images; it has nothing about the 1985 Fire Insurance Plans. Waiting for the correct URL from PD.
Leaving early.
DONE 2012-03-26: The xml:id for the William Allen is currently "william", which is very confusing; change it to "william_allen", and change refs to it, so it's not confused with the Brig William.
On MK's instructions.
I've had to resort to a second pass through the data to count offsets, and that's now working reliably. I've also got the reconstitution of hyphenated words at linebreaks working, but only most of the time; for some reason, when the linebreak precedes a <fw> element, the reconstitution fails. I'm still working on that, but it's very mysterious. I'll probably have to create some test data rather than working on real files until I get it sorted out.
All in all, though, very promising progress.
Met with JT -- some issues discussed, leaving corrections to be made tomorrow, and a decision re the cover, where there will not be enough space for reviews in the TOC: we will have a single entry for reviews on the cover (but not inside, obviously).
LSPW and I have been trying to track the history of Boréal and how it relates to the Grad colloquium. This is what we learned:
LSPW will create a new page in current_students/graduate which has an accordion with one section for each year; the introductory material for each year can be copy/pasted from the index.php files, and links to the articles listed.
I now have the XSLT breaking down each text node into a series of components: either whitespace (passed through as plain text), punctuation sequences (tagged with <pc>) or word[-fragment]s (tagged with <w>, with much more tagging due in subsequent phases).
My current problem is the requirement to record the offset and length of each word in the original text node, so that a search engine can find its way from the modernized source back to the original text. Length is easy, but offset is proving difficult. I have a question posted on the XSLT list in the hope of some help, but it may be that we have to go in two stages: pre-process to create the <ab> element, which is stored in a variable, and then post-process, where the <ab> element and its contents are re-analyzed and additional tagging is added based on that analysis, before the resulting enriched ab is output.
There's one special map that uses SVG for its interface, and which was originally working only on FF; DB had put a complete block on other browsers, but now most of them support SVG so I've removed that block. Browsers that don't support SVG should get with the program. Committed that change to SVN, but I was unable to commit the bulk of my additions (code which was not originally in SVN, but should have been) because we ran out of disk space on revision.tapor.uvic.ca. We'll be switching to the new SVN on Monday, so hopefully this problem will be solved.
Worked through the final video lecture in week 1.
NOTE: Completed 2012-04-23. Many new vessel entries have resulted from this work, and they will need to be completed when time permits.
Try this, first in /db/coldesp/correspondence, and then in /db/coldesp/:
xquery version "1.0"; declare default element namespace "http://www.tei-c.org/ns/1.0"; for $r in //name[@type='vessel'][not(@key)] return $r
The vessel tags inside the correspondence seem mainly to be for vessels which HAVE write-ups; these should simply be correctly linked with @key. The broader set include vessels which may not have bios yet; bios need to be created, and those vessels linked.
This is the state of play on TNB's work as of today:
I've written the bones of an XSLT file to convert an original file to a framework for modernization and regularization. So far the code can create <ab> elements with full working xpath references back to the source text nodes. Now I need to start on tokenization, which I think I'll do with a regex initially, but it's going to be quite complicated.
Got the site basically working by doing this:
METADATA
"queryable" "true"
"tile_source" "cache" <-- This line has to be removed.
END
Our surmise is that this problem line causes the server to construct a broken path to a cache folder that doesn't exist or isn't writable, and therefore it cannot read or construct tiles.
We now propose to have the DNS repointed so that vihistory.uvic.ca points at taprhist/www, and keep the live site there.
Greg also noticed that vihistory.ca is broken; outside of the ring it's pointing at mala.bc.ca DNS servers, so he's emailed PD to get him to fix that on the domain host.
Working on standardizing some spelling variants across the P5 source. Mundane but necessary.
There are issues with the search engine relating to both authors and addressees of correspondence. The drop-down lists are generated from distinct values of tags in the header. These tags, inherited from the Waterloo Script, contain plain text, and so the same individual is identified in a variety of different ways. It would be helpful if we could tag these names with ids from the personography, and then build our search engine drop-downs in a more intuitive fashion.
It seems best to start with the addressees, since they constitute a much smaller number (only 89 distinct values, listed below). The simplest approach would be this:
Addressees:
Following one of KSW's notes in this post, removed date tags from specific location in 17 files. This is presumably for consistency -- only 17 files had them -- and because I suspect some useful parsing can be done/is being done based on the first date in the text being the date the document was penned.
Items in the indexes have a link under their info popup which enables you to retrieve references to them in the correspondence, but sometimes there are no references (as in the case of peripheral bios, which are referred to in other bios but not in the actual correspondence). Previously, clicking on the "Mentions..." link simply did nothing in these cases, but I've now added a trap for this condition and an appropriate error message.
I've been working out my ideas a little more clearly, and beginning to evolve the idea of a working pipeline and a target format for my documents. It would look something like this:
<ab> element. At this stage,
@xml:id, like this xml:base="mar:maladies_des_femmes"<ab> element points back to the location of the original text node which gave rise to it, using a TEI pointer structure, something like this: <ab corresp="xpath1(*[20]/*[4]/*[3]/text()[2])">.<w> tag, and that tag is linked back to the original source using XPath again: <w corresp="xpath1(substring(., 36, 10))">.<w> tag. It is also stored in an attribute (possibly @n, or more likely a custom attribute), so that when the text content is normalized and modernized, the original form is still available.<w> tags are run through a series of normalization rules which do things such as replacing long s.<w> tags. This is going to require some serious processing, and will include algorithmic spelling modernization, dictionary lookups, etc.@lemma attribute on the <w> tag.For this, we'll need a range of tools, some of which exist and some of which appear not to exist yet (or, as in the case of the lemmatizer, not in an open-source form we can adapt for a Java web application).
Got a pic of the speaker from KE, and added it to the site; worked fine on the www-dev location, but the www location kept showing the old page until I deleted the file, loaded the page for a 404, then uploaded again. Strange.
Late duty...
... on KE's instructions.
ST solved a long-standing issue in the Xenu broken link report for the French site. Apparently when a number of pages were originally created, they were made by copy/pasting from the existing site. That site had links to a Contact page, which were then deleted -- except that what was deleted was only the text, not the anchor tag, so the links were still there, invisible. Deleted them all, except for one reported by ST on a page which no longer seems to exist (french/current-students/graduate/colloquium/index.php). Wrote to LSPW to find out what happened to that page.
Interesting discussions of personographies, referencing with private URI schemes, and centralized authority databases at the first meeting of the versioning group.
Posting time spent on timesheets (SA is away, so did TNB's timesheet too).
Getting towards the end of the week one materials.
Added appropriate credit to MM for her transcription work, and began the process of pulling documents from Google Docs into the actual repo, which is a bit easier to keep track of. Found one suitable document to get PCA started with full-doc transcription, and created a simple guide to the file/id/naming convention for our collection. Wrote a detailed assignment for PCA and sent it. This process will include a check that our Guidelines document in fact provides enough guidance for a encoding a complete new document. Most likely we will be expanding it in the next week or two as PCA starts to add new transcriptions.
At PAB's request. And some preliminary discussion of moving it to Cascade.
Received email from PL with project summary.
Printed off waiver release forms as requested.
Purchased supplies for special project ("Interviews").
Completed reimbursement.
week of Mar 5 - Mar 9 M -3.0 CSG, T +1.0 beanstream, W -3.0 CSG, R +1.0 admin before vac, F +1.0 francotoile update
next week I'm coming in Tuesday for some kind of focus group which will take about 2 hours
This structure in the xml data file:
<ref type="info">pépés<note> : <mentioned>Pépé<mentioned> est généralement utilisé par les enfants.</note></ref>
Was originally processed by this xsl:
<xsl:template match="tei:ref[@type='info']">
<xhtml:a href="#" class="tooltip">
<xsl:value-of select="./child::text()"/>
<xhtml:span class="hover_off">
<xsl:value-of select="tei:note"/>
</xhtml:span>
</xhtml:a>
</xsl:template>
Generating this output (note the "Pépé" is passed through as plain text, whereas user wants it italicized)
<a class="tooltip" href="#">pépés<span class="hover_off">Pépé est généralement utilisé par les enfants.</span></a>
I modified the xsl to this:
<xsl:template match="tei:ref[@type='info']">
<xhtml:a href="#" class="tooltip">
<xsl:value-of select="./child::text()"/>
<xhtml:span class="hover_off">
<!--<xsl:value-of select="tei:note"/>-->
<xsl:apply-templates/>
</xhtml:span>
</xhtml:a>
</xsl:template>
Which generates this output (note the "pépés" appears in the span as well as outside it):
<a class="tooltip" href="#">pépés<span class="hover_off">pépés : <em>Pépé</em> est généralement utilisé par les enfants.</span></a>
I've got to come with some xsl that gives me this output from the given input, but ran out of time today:
<a class="tooltip" href="#">pépés<span class="hover_off"> : <em>Pépé</em> est généralement utilisé par les enfants.</span></a>
When I do, I can delete the leading " : " which is only there as a kludge around this problem.
Leaving early -- need to keep these hours under control. Going home to read about NLP and historical spelling variance.
I now have a collection of a dozen or so papers I'm reading and annotating, and some ideas are getting clearer. At the moment (although I still have a lot of reading and consulting to do), this kind of approach looks promising:
Final tweaks received from department, and nav plan submitted to JS.
With critical input from Martin on the syntax of the java command, I managed to create a new rng file derived from the existing data files using the oddbyexample utility from TEI.
Here are my notes.
minimal instructions here: http://tei-l.970651.n3.nabble.com/ODD-by-example-utility-td2344937.html
download for saxon jar files : http://saxon.sourceforge.net/#F9.4HE
download for oddbyexample.xsl and getfiles.xsl : http://tei.svn.sourceforge.net/viewvc/tei/trunk/Stylesheets/tools/
my setup:
in folder: /System/Library/Java/Extensions (which is in the java classpath)
- saxon9he.jar (working jar file in System)
- saxon9-unpack.jar (working jar file in System)
all other files in folder: /Users/sarneil/Documents/Projects/french/FrancoToile/oddbyexample/
- data folder containing all the data files to use in creating the odd file (I removed child values folder)
- oddbyexample.xsl
- getfiles.xsl
- saxon9he.jar (backup of jar file in System, not used otherwise)
- saxon9-unpack.jar (backup of jar file in System, not used otherwise)
- ftodd (file created by running the java command below)
- francotoile.rng (file created by running ftodd file through Roma as detailed below)
- this readme file.
command I issued:
java -jar /System/Library/Java/Extensions/saxon9he.jar -it:main -o:/Users/sarneil/Documents/Projects/french/FrancoToile/oddbyexample/ftodd /Users/sarneil/Documents/Projects/french/FrancoToile/oddbyexample/oddbyexample.xsl corpus=/Users/sarneil/Documents/Projects/french/FrancoToile/oddbyexample/data
Everything (i.e. paths) is spelled out explicitly as otherwise there's just too much voodoo magic for me.
Tell java to run the jar file specified in the following argument (i.e. saxon9he.jar)
The -it switch presumably tells java which class to run first (not sure).
The -o switch provides the path and file name for the output file (e.g. /root/path/path/path/nameOfODDfile)
The next argument provides the path and file name of the oddbyexample.xsl file to run
The corpus= argument provides the path to the folder containing the tei data files to run the oddbyexample.xsl against to generate the ftodd file
Once you've the odd file
Go to http://www.tei-c.org/Roma/
Click the Open existing customization button and browse to the odd file you've just created
Click the start button
In the Customize tab, change the filename to what you want your schema's filename to be (e.g. francotoile) without any extension
Click the save button
In the Schema tab, select RELAX NG schema (XML syntax) not compact syntax
Click the generate button
Roma will generate the file francotoile.rng (using the name you provided and the extension based on the schema format you selected)
Save that file and move it wherever you want it to go.
Where the data files are expecting that rng file to be for francotoile:
Will test shortly.
Reviewed the extensive (and excellent) work completed by PCA, who is now nearly at the end of the 1854 abstracts. Wrote a number of notes for tweaks and fixes, as well as a couple of requests for further research and the transcription of a mysteriously-untranscribed despatch (V547102A).
Did another lecture video.
Folks from three different projects needed attention, and with the morning taken up by the presentation it was difficult to get through emails before the end of the day...
Long discussion with EDR about possible approaches to encoding rhythm. I think we should use something like this:
<metDecl xml:id="fr_ip" type="met" pattern="AAT:AAT\|AAT:AATA"> <metSym value="T">syllabe tonique</metSym> <metSym value="A">syllabe atone</metSym> <metSym value="|">césure</metSym> <metSym value=":">pause métrique</metSym> </metDecl>
and then tie <l> tags to the specific <metDecl> using the @met they match. This would make for nice stand-off markup. We should probably actually replace the pipe with some other character that doesn't need escaping, for convenience. But looking at the Guidelines, you can't actually point at a metDecl; you have to reiterate the pattern in the @met attribute. I've already found and reported one bug in the source for the French example of <metDecl>, but I think bit of the GL needs a more serious look.
Started work on first NLP course assignment. It's helping me polish up my regexes and dip a toe into Java again.
JJ + me gave keynote at CS IdeaFest.
Ok.
To get this bit to work your user has to have access to the adminstration / account settings / order settings area.
You must
1) provide the URLs for working pages in each of the Approval Redirect: and Decline Redirect: text boxes. I think those URLs can be to either pages hosted within the beanstream account or on your server - I've only tested the former so far.
2) uncheck the Require hash validation checkbox
3) check the Include hash validation checkbox
4) click the update button at the bottom of the page
I've created a simple shopping cart, but when I try to use it to buy, I get a "hashvalue missing" error. I had a similar problem with a form hosted on their service buy created by me and solved it by including a hashcode in the submission string as the documentation suggested.
I don't see how I'm going to be able to inject a hashcode into the submission string produced by one of the forms on their shopping cart, and there's no mention of handling hashcodes in the shopping card documentation, so I've written to RE to find out what to do.
I also created a simple page on my account on the UVic server which invokes the shopping cart and passes in the item I want to buy. That works fine, but if I then go on to try and actually buy the item, I get the same error as detailed above.
The issue with the simple form is that you'll have to write a lot of code to deal with various kinds of situations (errors, user changing their mind about items or quantities, etc.). I'm hoping the cart takes care of some of those hassles. First test is a cart using pages hosted on their server, then I'll try a cart with as many of the pages as possible (likely the product pages) on the Malahat site.
First, you've got to get the finance people to create a test account for you with the beanstream service, if you haven't got that set up already as described in the post on how to get going with a simple form.
Robert Elves has been my contact.
That account's permissions have to be set so that it has full access to the configuration / shopping cart and configuration/inventory areas.
I suspect strongly that you'll also want access to the configuration / shipping area too, but I haven't got far enough to know that for sure yet.
The procedure for the Simple Shopping Cart is described here: https://beanstream-manuals.pbworks.com/f/BEAN_Starter_Cart.pdf
I found it pretty straightforward to implement, other than I'm not sure how to rearrange the order in which the categories appear, and I'm not sure how to handle the various shipping charges the Malahat charges (e.g. the first item in any category attracts a higher price than subsequent items in the same category).
Leaving early to drive CA to an appointment.
- 4 recorders have already been set to: volume to 25; setting to "Meeting"
(don't reset to any other settings); recording room location: CLE B046
Procedure during Recording:
- 2 machines are recording at once
- 1 machine only is designated for student to control
(Students: only touch 1 button: Record - Pause - Record (All same button)
- 2nd machine is the back-up machine and will run the entire time Non-Stop (students don't touch)
When ready to start interview:
Judy:
- turns on both recorders' power (power switch on side of recorder)
- places "backup recorder" on table - press record button - leaving machine in record
mode throughout the interview till end (students don't use this machine)
- places student recorder on table (screen facing student)
Student:
- press RED RECORD button to start recording
- press STOP button when interview is finished. Bring student recorder to Judy.
Judy:
- stops "backup recorder" when interview is finished.
Both recorders keep in HCMC.
TO COPY INTERVIEW FILE FROM STUDENT RECORDER TO JUDY'S COMPUTER:
Connect Student Recorder to Judy's Computer:
- each recorder has a cable; plug cable into back of Judy's computer
- plug other end of cable into USB port located on side of student recorder
- recorder will automatically power up itself
On Judy's Computer Desktop:
1. See "VN8100PC" (name of digital recorder) icon on Judy's desktop.
2. - 2x click on VN8100PC icon on desktop to open
- will see VN8100PC screen with 3 folders list - see "Recorder" folder
3. Open "Recorder" folder and open "Folder A" (click once on arrow - opens - then click on
arrow to open Folder A) - see interview files listed e.g. VN810001.MP3, VN810002.MP3,
etc.
4. Select and drag interview file over from VN8100PC Screen to Judy's desktop
5. Rename that interview file now on Judy's desktop (click once, then click again in
field to rename file) with the naming convention of: student's surname_interviewee's surname_date_interview file#.
6. Open on Judy's desktop the "INTERVIEWS" folder.
7. Drag renamed interview file over to Judy's "INTERVIEWS" folder.
8. Click on "INTERVIEWS" folder to see the renamed interview file has been moved there.
9. To disconnect recorder from Judy's computer: Right click on VN8100PC icon on Judy's
desktop to disconnect.
10.Disconnect cables (from computer and recorder); put recorder and cable back in plastic
bag and lock up.
TO GIVE A COPY OF INTERVIEW FILE TO STUDENT ON THEIR USB MEMORY STICK:
1. Plug student's USB device into back of Judy's computer.
2. Student's USB icon then shows up on Judy's desktop.
3. Under "Device" (on left of screen) see the USB memory stick listed.
4. Open on Judy's desktop the "INTERVIEWS" folder.
5. Drag specified interview file from "INTERVIEWS" folder to their USB device icon.
6. To disconnect USB memory stick right click on USB icon to "Eject"
7. Unplug USB device from back of computer, return USB memory stick to student.
Student CD-R Disk copy:
1. Insert CD-R disk into Judy's computer (with disk printed-side facing me)
2. Wait
3. "Untitled CD" icon shows up on Judy's desktop
4. 2x click on "Untitled CD" icon (rename student's CD e.g. "Interview" by clicking 1x slowly in the field then click again in the same field of the icon and type "Interview")
5. Screen window "Untitled CD" shows up
6. 2x click on Judy's desktop the "Interviews" folder to open
7. Drag interview file from Judy's "Interviews" folder to Untitled CD (now renamed "Interviews)
8. Hit "Burn" button on right of screen
9. Will ask "Are you sure....." Leave burn speed as is
10.Hit "Burn"; will go through tow times on its own (once, then again verifying) takes a bit of time
11.When done, screen disappears
12.Click on student's CD-R icon to see file has been dragged over (this is a copy, original interview file stays in Judy's desktop folder)
13.Right click on CD-R to "Eject"
Students to provide USB memory stick however CD-R used instead if USB memory stick not provided. Students keep CD-R disk.
Paperwork to be turned in:
At end of each interview student will turn in paperwork (Signed Release form)to Judy to put in desk file folder - for PL to pick up.
I've created the first version of the book and sent it to JT.
This XQuery outputs the XInclude statements for reviews, ordered by the author of the book being reviewed:
xquery version "1.0";
declare namespace xi="http://www.w3.org/2001/XInclude";
declare option exist:serialize "expand-xincludes=no";
for $d in //TEI.2[descendant::sourceDesc/descendant::biblScope[@type='vol'][contains(., '20')]][descendant::classCode[contains(., 'review')]]
order by $d/teiHeader/fileDesc/titleStmt/title/name/@reg
return
(
xs:string(concat('<!-- ', $d/@id, ': ', normalize-space($d/teiHeader/fileDesc/titleStmt/title), ' -->')),
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="{$d/@id}.xml">
<xi:fallback>{concat('MISSING XINCLUDE CONTENT: ', $d/@id, '.xml')}</xi:fallback>
</xi:include>
)
In the process, I've modified the XSLT so that it uses the   character when inserting guillemets, as well as for "double" punctuation marks; that results in better-looking output. I haven't done that for the XHTML output though; I'm still using   there, because various reports cast doubt on the reliability of   on various browsers.
Presentation prep and keeping everything else ticking over, along with ScanCan work that's rather urgent...
PCA reported that mentions of the Brig William, wrecked in 1854, are linked to the vessel info for the William Allen, which is not the same ship at all. We dug around to find some references from which to construct a new vessel entry, and she's now going ahead with writing it.
With regard to spaces, French punctuation behaves like English, except in the case of the so-called "double" punctuation marks (;:!?). These should be preceded by U+202F, the "narrow no-break space". In the case of the Iglesias text, regular spaces were used, whcih meant that punctuation marks sometimes wrapped to the next line. I've now fixed that, and confirmed that XEP handles it OK.
Lots more to do on the Iglesias, though...
Mussari, Urberg, Rudling, Gudmundsson and Blackwell done (with a few red-circled questions to talk to JT about).
I've been working for the better part of the last couple of days trying to figure out why the imap maps in ViHistory are not appearing.
Problem appears with the production front end and a test front end connected to either the old db server or the new db server.
Problem appears with a test front end which is the production front end minus the captcha code (which is the only code that has changed since Jamie left us a working site).
We weren't getting errors when we trolled the server logs on lettuce.
We did get the following errors from the sysadmin:
Apache error log:
[Tue Mar 06 09:16:21 2012] [error] [client 96.54.151.99] [Tue Mar 6
09:16:21 2012].616424 loadSymbolSet(): Unable to access file.
(/home1t/taprhist/www/content/maps/user/symbol/generic.sym), referer:
http://vihistory.ca/content/maps/htdocs/index.php?map=vicbird1889
AND syslog:
2012-03-06T09:16:21-08:00 local@mustard.hcmc.uvic.ca user.notice
php-cgi: PHP Warning: [MapServer Error]: loadSymbolSet(): (/home1t/taprhist/www/content/maps/user/symbol/generic.sym)
2012-03-06T09:16:21-08:00 local@mustard.hcmc.uvic.ca user.notice in: /home1t/taprhist/vihdev/www/content/maps/htdocs/init.php on line 125
2012-03-06T09:16:21-08:00 local@mustard.hcmc.uvic.ca user.notice
php-cgi: PHP Warning: Failed to open map file /home1t/taprhist/vihdev/www/content/maps/user/map/vi1798.map in /home1t/taprhist/vihdev/www/content/maps/htdocs/init.php on line 125
2012-03-06T09:16:21-08:00 local@mustard.hcmc.uvic.ca user.notice
php-cgi: PHP Fatal error: Call to a member function getMetaData() on a non-object in /home1t/taprhist/vihdev/www/content/maps/htdocs/init.php on line 131
2012-03-06T09:16:21-08:00 local@mustard.hcmc.uvic.ca local0.debug
suphp_wrapper: 0 PHP5
/home1t/taprhist/vihdev/www/content/maps/htdocs/init.php
That init file is as provided by the imap people, so I really, really doubt it is causing the problem, though it is the where a problem occurs. It looks like that file is trying to create objects based on what it reads from some kind of config file, and somehow that process is breaking down, so the object doesn't get created, so invoking a method on the (non-existant or empty) object throws the error. MOre precisely, it looks like the config.php file is supposed to create an array in the variable aszMapFiles and then those values are used in the init file, but for some reason something is failing with the way that array and associated variables are being populated.
Worked through the first two lectures in the Stanford NLP course I'm taking this semester.
Entered JT's corrections for Stenberg, Higgins, Norrman and Sheffield. One outstanding issue on Stenberg and one on Norrman, waiting for JT to come by. Also looked up Chicago on ellipses, and suggested policy doesn't align with it, so referred back to JT for clarification.
Late duty and a presentation to create...
Collected images, made some details and diagrams, and wrote a first draft of my bit of the presentation in the form of a WP document with text and images. Tomorrow I'll turn it into an actual presentation.
Spent the morning turning the navigation plan for PAS into a spreadsheet; sent it to the team for comments before submitting it.
Produce a sorted list of recently changed files by running this:
find . -type f -printf '%TY-%Tm-%Td %TT %p\n' | sort
Sent out "invoices" to two French researchers (HC and EdR) for HCMC resources. Following the system I worked out with AS in dean's office:
I get research account number from researcher
I write up invoice with reference to dh cttee approval of terms and specified costs and research and hcmc account number
I send it to them and ask them to forward it to AS with authorization for Journal Entry.
AS does the journal entry.
Half hour with dean on:
allocation of space or other resources dedicated to one project for more than a year,
flexibility of terms for use of resources
participation in etcl workshop coming up
Leaving early.
Various edits. New decision to remove punctuation around ellipses still needs to be implemented.
Met with JJ and outlined the presentation for next week. I've started collecting materials, and I'll write a draft of my bit on Monday.
Note: Francotoile and the Mysteries projects are not showing any stats.
Marked up the review, and sent some queries to JT.
Meeting went on to the end of the day and then backups etc....
Met with folks from the dept, and worked through their spreadsheet. I'll turn the results into another spreadsheet on Monday, and we'll get it submitted ASAP.
Started some detailed reading on this topic, with some pointers from friends and people on TEI-L. It looks like a flurry of activity happened around 2005-2007, and there are some working examples such as EEBO with fully implemented systems, as well as lots of surveys of approaches, and some tools. It looks useful and interesting. Haven't found anything resembling a dictionary of variants for Early Modern French, though.
End of Year (eoy):
Met with SA today to review eoy issues. Forwarded to SA last year's
eoy information for perusal.
Subscription:
Renewed/mailed MAC subscription.(2 yr./24 issues subscription)
Cascade (HCMC site)
Attended Cascade drop-in session:
Issues:
- banner color (color will be implemented later today by communications staff)
- home-home: issue has already been reported by others; will be addressed by communications later; (HCMC -correct as is for now).
Reviewed:
- "page mirroring"
- content details
Attended Cascade drop-in session:
Issues:
- banner color (color will be implemented later today by communications staff)
- home-home: issue has already been reported by others; will be addressed by communications later; (HCMC -correct as is for now).
Reviewed:
- "page mirroring"
- content details
Responses to questions from yesterday prompted another round of edits to existing documents, including the normalization of the use of ellipsis (in vol 20 documents only) to have a space before and a space after.
Hérodiade Scène took a long time because of its speaker headings and line indents, but most of the rest were quite rapid. 47 out of 77 poems have now been processed into XML and HTML.
With dbs moving to Mango, I've now changed the connection string in the imap code (on home1t/london) to point to pgsql.hcmc.uvic.ca. This is working through Lettuce, but not yet on hcmc.uvic.ca/~london, because (presumably) the (machines behind) the load balancer don't yet have permission to talk to Mango's pgsql.
| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|---|---|---|---|---|---|
| << < | Current | > >> | ||||
| 1 | 2 | 3 | ||||
| 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| 11 | 12 | 13 | 14 | 15 | 16 | 17 |
| 18 | 19 | 20 | 21 | 22 | 23 | 24 |
| 25 | 26 | 27 | 28 | 29 | 30 | |