Got the reference list output code ported from XHTML to XSL:FO. Everything seems to be working except that internal links are not doing their job. Need to work on that. Citations in the body should bounce to the relevant item in the reference list.
I'm making some progress with PDF generation; I have divs, headings paragraphs, external links, images and a variety of other features working. However, I've hit one problem:
XSL:FO has properties called keep-with-previous and keep-with-next, which are used to prevent the orphaning of titles at the bottom of pages, and similar layout oddities. The FO-to-PDF converter in Cocoon 2.1 is Apache FOP 0.20.5, which is old, and which doesn't support the keep-with properties. That causes an occasional output problem: a title can end up at the bottom of a page, separated from its following paragraph. For our serious print publishing, we use the RenderX XEP engine, which has virtually complete XSL:FO support. However, XEP costs a lot of money ($4,000 for a single-core one-CPU), so it can't be a default part of the teiJournal project, which is wholly open-source.
Meanwhile, both FOP and Cocoon are moving forward; Cocoon 2.2 is out, and includes a more modern version of FOP which supports the properties we need. However, Cocoon 2.2 is a completely different animal from 2.1, with a totally different structure; moreover, the XML database we use (eXist) is available in a package with Cocoon 2.1, but no such package exists for Cocoon 2.2. So the situation is this:
Right now we can't ensure that PDFs avoid the orphaning problem in teiJournal (although if IALLT Journal wished to, they could pay $4,000 for XEP and solve the problem). In the future -- over about two to three years, I estimate -- eXist will probably move to Cocoon 2.2, or we'll learn to build Cocoon 2.2 with eXist, and teiJournal will be ported to Cocoon 2.2 and solve the problem. So we're looking at occasional orphaning problems occurring with some articles for two or three years.
While I'm posting, a reminder to myself about PDF output development and the caching problems we have with it. First, remember that the browser usually caches a PDF download, so you need to clear the browser cache before grabbing an updated copy when working on PDF output. Secondly -- and this is a killer, that I'd forgotten about -- when there are multiple XSLT stylesheets being called, Cocoon will cache the results of a transformation unless the root stylesheet has changed. Therefore, if you're actually coding in a different stylesheet, you need to make a quick edit to the root stylesheet and upload it in order to trigger a refresh in the Cocoon pipeline. It doesn't know that the root stylesheet invokes other files, so it doesn't check to see if they've changed.
Re-configured the XSL:FO output XSLT to take account of three possible levels of attribute-sets in the database (base, styleguide and user). Then I started the long task of writing the XSLT output code. I got a long way into it -- all the basics should be covered -- but I couldn't get anything to render correctly from FOP; the page layout and numbering worked, but the body document wass simply unstyled. After an hour of hacking, I finally determined that this was caused by the accursed browser cache; switching browsers threw up a page that looked half-way ready.
I still have a lot of not-quite-right attributes in my output, according to the unofficial FOP a schema I've downloaded; it tells me I need to get rid of rogue xsl:use-attribute-sets attributes that have found their way out into my code instead of being interpreted; it also tells me I must use units where values are 0 (so 0 is wrong, but 0in is correct); and it also tells me that hash expressions of colour values are wrong. I'll work on these, and see if I can get more stuff to work properly tomorrow.
I created the APA pdf.xsl file which calls the main article_to_pdf.xsl file, and tested it. Then I stored the pdf_page_masters.xsl file in the database itself, under db/teiJournal/settings/default/style/, and set up an XQuery file which can retrieve such files when passed the filename as a parameter. The XSL generation now calls that generation system through the cocoon:// protocol. The current setup assumes that if there's any kind of a user file in /settings/user/style/, it should use that; otherwise, it uses the default file. This is a large assumption, and a more graceful approach, similar to that of the CSS system, would be to find the default file, then iterate through all the named attribute sets and look for similar sets in the user file, substituting them where they exist. I'll implement that next. For the moment, basic PDF generation is still working after the XSLT rearrangement.
I followed my own instructions in the earlier post to set up FOP font configuration on the actual IALLT Journal site, and then started putting test transformations in place. I took some basic PDF attribute sets from ScanCan, and modified them heavily, primarily by converting the metric setup to inches as required by the IALLT Journal (paper size will be 8.5 x 11), and simplifying the page masters; IALLT requires only three page masters, one a recto for the article title page, then one each for recto and verso regular pages. I set up the headers and footers in a default manner, with the running titles as provided in the articles, and put page numbering at the bottom outside. Amazingly, the basics worked right out of the box; FOP is perhaps further advanced than we thought.
The next stages are:
- Decide which components of the layout and design should be abstracted for editorial control, and stored in the db. Make that work.
- Create the styleguide-controlled transformation system. Right now, the transformation is done using a root XSLT file, but actually a styleguide-based file should be called, as with the XHTML transformation, and the base file included in that.
- Add the article header system into the basic templates file (this is not styleguide-controlled).
- Create a blank pdf_references.xsl file in the styleguide folder, and include it. Later, the XHTML references document will act as the model to create this file.
- Look at the way images are included in the XML, and make sure you have a reliable method of deriving the hard path to image files just in the way we handle fonts; FOP can't work with relative paths.
Then it's just a question of working through all the little templates, figuring out how best to handle them.
Worked through the editorial corrections to the Sawhill article, and made all the fixes. Also had to make two code changes: in the first, the need to add an acknowedgement that a previous version of the article had been a prize finalist was handled by placing a paragraph with a special rend attribute in the <sourceDesc> tag:
<sourceDesc> <p rend="afterTitle">A version of this paper was selected as a Henderson Plenary Award Finalist at the IALLT 2007 Annual Conference.</p> [...] </sourceDesc>
Code in xhtml_article_base.xsl looks for this and renders it above the author names. This is only marginally satisfactory, but again, there are no absolutely appropriate tags for this kind of thing because it's peculiar to born-digital documents.
The other change related to capitalization of initials in bibliographical references. My code was automatically doing this, but the case of "bell hooks" (an affected pseudonym; the author always uses lower-case) required that automatic upper-casing be suppressed. Conceivably there are actually first names that might legitimately begin with lower-case letters anyway, so it's no bad thing as far as I can see.
Here's what I had to do:
- Create the config file with a slightly different name, thus:
[cocoon]/teiJournal/fop-config-src.xml. - In that file, replace paths to font metrics and font files with simple folder+file names, thus:
<font metrics-file="fop-fonts/DejaVuSans.xml" kerning="yes" embed-file="fop-fonts/DejaVuSans.ttf"> - In the fo2pdf serializer section of the root sitemap, set the user-config setting to use a cocoon path, which will invoke a pipeline we create:
<map:serializer logger="sitemap.serializer.fo2pdf" mime-type="application/pdf" name="fo2pdf" src="org.apache.cocoon.serialization.FOPSerializer"> <user-config>cocoon://teiJournal/fop-config.xml</user-config> </map:serializer> - Create an XSLT file to do the transformation. This invokes Cocoon's RealPath module to find the real path of Cocoon on the filesystem, so we can use it to reconstruct the path to the fonts:
<?xml version="1.0"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" exclude-result-prefixes="#all"> <xsl:param name="fontPath" /> <!-- XSLT Template to copy anything, priority="-1" --> <xsl:template match="@*|node()|text()|comment()|processing-instruction()" priority="-1"> <xsl:copy> <xsl:apply-templates select="@*|node()|text()|comment()|processing-instruction()"/> </xsl:copy> </xsl:template> <!-- Massage the path attributes. --> <xsl:template match="@metrics-file"> <xsl:attribute name="metrics-file"><xsl:value-of select="$fontPath"/><xsl:value-of select="."/></xsl:attribute> </xsl:template> <xsl:template match="@embed-file"> <xsl:attribute name="embed-file"><xsl:value-of select="$fontPath"/><xsl:value-of select="."/></xsl:attribute> </xsl:template> </xsl:stylesheet> - In the main pipeline of the root sitemap, add a pipeline to handle generating, transforming and serializing the config file, like this:
<map:match pattern="teiJournal/fop-config.xml"> <map:generate src="teiJournal/fop-config-src.xml" /> <map:transform type="saxon" src="teiJournal/xsl/fop-config.xsl"> <map:parameter name="fontPath" value="{realpath:/}WEB-INF/" /> </map:transform> <map:serialize type="xml"/> </map:match> - Check that this worked OK, by accessing the font-config.xml file in your browser, at the location pointed to by the pipeline.
- Get working fonts when you run a PDF generation.
The XML parsing classes are now complete, and can read and deconstruct the XML submission either from a POST variable or from a file on the filesystem (later, I'll add the ability to grab data from a table). They can reconstruct themselves in XML and feed that out, too.
Next the really hard work starts. We need to figure out how to build the complex SQL query to get the data back.
Entered HC's proofing corrections for the Yang article.
Got a set of corrections, including a new contribution type designation that I hadn't predicted ("Column"). Added all the necessary XSLT and made the corrections. I must remember to document the addition of "Column" on the teiJournal project Web site.