Wrote the initial phase of a system for building PDF volumes. I now have XQuery that can retrieve a set of documents based on volume and/or issue number, or an arbitrary sequence, and return the results as a <teiCorpus>. I still have to figure out how metadata should be dropped in (derived from the first in the sequence, or some other approach), and how much metadata should go in through XQuery as opposed to XSLT. Both have access to the db's string resources, so both could do it.
Made a couple of small changes to HM's preface article on her instructions, then started fixing some little niggles that have been annoying me:
- The Search function appears on both the regular TOC page and the proofing TOC. In both cases, it ran its search against the regular documents, bouncing back to the regular TOC page with the results. It now works independently for each page, so if you search on the main TOC, you get results from the main collection, but if you search on the proofing TOC you get results from docs in the proofing collection.
- Clicking on the column header sort links in the proofing TOC page also bounced you back from the proofing TOC to the main TOC. That's now fixed.
- Article XHTML output always included a link at the top to the main TOC, even when the document was in the proofing collection. It now distinguishes between the two locations of the document, and gives you a link back to the proofing TOC if the document is in the proofing collection.
Received HM's preface article for vol 40, and marked it up. In the process, I marked up a series of references to the other articles in the issue, using relative links. This threw up a new situation in the PDF export. Previously, I was handling two distinct cases of links in PDF output:
- Those beginning with
#, being internal links; easy to process into internal PDF links. - Those beginning with
http, being external Web links, also easy to process into external PDF links.
Now there was a third case:
- Those not beginning with a hash, and not containing a protocol, meaning that they're relative.
Relative links don't work in PDFs because they're typically downloaded to a temp folder and opened from there. It was necessary to reconstruct the full original URL of the request so that I could build a full link, to create a proper PDF external link. The Cocoon {request:requestURI} variable, available in the sitemap, should, according to the documentation, give you the full URI, but it doesn't; it gives you only the path after the port. After some reading around, I found the solution. In the sitemap, you pass this into the XSLT transformation:
<map:parameter name="browserURI" value="{request:scheme}://{request:serverName}:{request:serverPort}{request:requestURI}" />
giving you a variable with the full request path, and then you can do this, to reduce the path to its directory, in the XSLT:
<xsl:param name="browserURI" /> <xsl:variable name="uriDirLength" select="mdh:lastIndexOf($browserURI, '/')" /> <xsl:variable name="docPath" select="substring($browserURI, 1, $uriDirLength+1)" />
I'm blogging this in detail because the Cocoon documentation is ropy in this area, and I've previously tried to figure this out and failed.
We're having some preliminary discussions with some members of the TEI community about the possibility of a SIG to work on a TEI-based journal publishing schema. I've spent some time in the last couple of days thinking and participating.
Looks like OJS will simply be taking the donated code for NLM handling, rather than writing their own, and the donated code uses NLM 2.3, so I'm now writing a converter to turn my 3.0 output into 2.3. There are about four major areas of difference, two of which I've already dealt with, and I've also gone back and elaborated one area of the original TEI-to-NLM-3.0 conversion for greater consistency.
Basic backup (saving the output of a pipeline into a directory on the filesystem) is now working, thanks to some help setting permissions from sysadmin.
Now a proper GUI and plan needs to be implemented. This is my first shot at a plan:
- There needs to be an editorial index page which shows a list of all the documents (published and in proof) which are in the system.
- The page should have links to back up specific output formats for each document.
- Those links would invoke the pipeline which calls the flowscript, but they would do it through an AJAX call, so that the index page does not need to be replaced.
- The AJAX script would call the pipeline, and write the server response to a
<div>on the page. - The server response needs to be encoded in a TEI file of messages, which is stored in the db. This would be similar to the
site.xmlfile which currently holds all the site rubric. - The pipeline which sends back the message would retrieve a block of something from the XML file, and pass it for processing to the
<site.xsl>file, but in some manner which preventssite.xslfrom building a full page; we only need an XHTML div, for insertion into the index page. - One outcome is an error message; this message should give a warning about permissions, the most likely cause for failure of the operation.
- Once all this is working for individual files/formats, the next stage is to enhance the AJAX page so that it can do the whole lot.
- This would work by having the JavaScript create a queue of URLs to be called, and when it gets a successful response from each one, it invokes the next one, also reporting its progress as it goes. There would also need to be a method for bailing on the process.
- A similar batch function should be available for each individual document, invoking all the formats.
- Finally, we need directory browsing to be available through the Cocoon sitemap, so the editor (or indeed regular readers) can see and access all the backup files.
This setup would give the option for the editor to backup the whole collection, or just one changed file, in all its output formats (including the source XML, presumably), so that a regular backup of changes could be taken, and also when a single file is edited, copies could be regenerated for only that file.
It's long been a plan of mine (following my own recommendations from our Finland presentation) to build in a system whereby hard copies of all the XHTML, PDF and other output formats can be (or perhaps are automatically) saved on the server, so that should Tomcat or Cocoon go down in some catastrophic way, those files are all available to the editors. I implemented that today, following some instructions here. There was a lot of messing around initially getting the folder paths right in the parameters to the flowscript, but after tailing the Cocoon log for a while, I got that sorted.
However, it didn't work in the end, due to a permissions issue. The Tomcat process runs as apachsrv, which apparently doesn't have permission to write to those folders -- which makes perfect sense from a security point of view. We're working on that with sysadmin now.
NLM 3.0 output is now working and on the Website. Mostly did this yesterday, but blogged it in the wrong blog.
Wrote the XSLT to convert appendices, and began work on the reference list (bibliography) code. I've got the list framework working. I'm now looking at the rather odd NLM structures used for reference items. They don't seem to have any way of distinguishing authors from editors, other than by wrapping them in <person-group> tags with a person-group-type attribute; I guess that reflects the reality in scientific fields, were no-one publishes anything alone. The whole thing seems less structured than a TEI equivalent, being more of a loose agglomeration of tags.
My nine test files now convert without errors (except the missing reference ids that refer to the bibliography items in the reference list, which are not converting yet because they're back matter). So two-thirds of the job is done. There are some oddities in the model structure of NLM -- for instance, every section (<sec>) must have a <title>, which seems ridiculous, and links (<xref>, uri and <ext-link>) cannot contain abbreviation tags, which seems pointlessly restrictive, when they can contain bold, italics etc. However, that's not really my problem, except that it requires me to throw away some information during the conversion.