Five out of seven complete now.
Category: "Activity log"
IALLT are planning to move their site to a Drupal system, and incorporate the journal in it; Drupal will soon have support for TEI XML, so this is not a bad choice. I spent some time on a Skype call to the developers who will handle the IALLT site, and then packaged up the existing webapp with some basic instructions for them, so that they can test it on their desktops, and perhaps use it as a way of generating the XHTML and PDF content during the changeover.
The IALLT Journal is planning to move to a Drupal installation, but they want to maintain the XML versions of articles, so at their request I've used oddbyexample.xsl to generate a constrained schema and documentation for them. My own documentation, which is more intended for novice markup folks, will need to be updated to account for the changes in markup practice over the last year or two.
The backups page has never quite worked since we moved the project to Pear, and I've never been able to figure out why, but I finally worked through the issues in the flowscript that handles the backups, and discovered the problems were caused by paths, relative and absolute. Within any Cocoon pipeline, and it appears especially within flowscripts, the current context of execution is very difficult to figure out; what comes back from cocoon:/ in one script can be completely different from what comes back in another. In this case, three of the pipelines (the XHTML and the two NLMs) needed one path to call the backupLink.xq script, and the other two (TEI and PDF) needed another. I don't really know why, but it's all working now, at least.
My original NLM export code (from TEI to NLM 3.0) was written based on the old system of <biblStruct>s used in the bibliographies of the first two volumes. I'm now using a much simpler <bibl>-based system inherited from the Mariage project, so I had to add more code to handle this. I now have the NLM 3.0 export generating good valid bibliography lists. The NLM 2.3 is generated from the 3.0, so there's no need to change anything there.
US's article is now marked up and on the server, complete with its appendix (which was missing until today). Received the appendix from DC over the weekend.
Still have a question outstanding with the editor, and I need to make sure the live site is updated with XSLT changes before uploading.
Continued marking up US's article after getting some clarification on style from DC. This one has some more complex tables in it, in particular with cells spanning multiple rows, so I worked on ways to make those render acceptably. Using the TEI @rows attribute (and @cols) is fine when it comes to XHTML rendering, because they have corresponding attributes that perform the same purpose, so I got that working for XHTML. However, although there are XSL:FO attributes called @number-rows-spanned and @number-columns-spanned, they do not seem to be supported in the ageing version of FOP we're using, so I was forced to try other strategies. My initial approach was to use the @cols attribute to generate additional fo:table-cell elements as appropriate, but for some reason this also crashed the FOP renderer. In the end, I was forced to code in empty cells following each of the cells in question, and abandon the use of @cols and @rows in the markup. This is disappointing, but might be overcome with FOP 1.0, which we may be able to build into Cocoon in future, following the work done by RvdB, based on our build scripts.
I'm about half-way through the document now.
Did a few paragraphs of the main text, but I'm reluctant to go too far right now because I have a significant query in to the editor about the author's use of single quotes, and I need an answer before I do too much more markup. Also, oXygen is annoyingly failing to prompt me with available values from the local document when I type <ref target="#, something that previous versions definitely used to do.
Marking up article by US. Did the biblio, header and abstract, and found and fixed some minor issues in the preceding article in the process.
Marked up the rest of the document (in-text tables, and appendices), and added a few new features and tweaks to the table rendering code. You can now do:
<table rend="layout">
To create a table which has no borders, for layout-only purposes, and you can control the width of individual columns in a table (in the PDF output) by setting a rend attribute in the appropriate cell in the first table row:
<table>
<row>
<cell rend="column-width: 20em;">...</cell>
<cell>...</cell>
</row>
[...]
</table>
This forces a width of 20em on the first column in the PDF. This is important because otherwise, the PDF renderer does no intelligent table-column width calculation; it just distributes the width equally across the cells. It's not so important with the XHTML output because browsers do a good job of calculating and laying out cells, so the setting is ignored in the XHTML output.
I'm looking forward to being able to integrate FOP 0.95 into a new Cocoon/eXist build, so we can get around some of the keeps-and-breaks problems with the ancient FOP we have. RVDB's recent Cocoon-building scripts should help a lot with that.
Four of the figures in this document are actually tables, so I'm now marking those up. The first one throws up something we haven't dealt with before, notes below tables; I've now added handling for those, and in the process I also realized that we've been processing the heading for a figure which is a table in the wrong place; in APA, Chicago and MLA, the figure heading goes below the figure, UNLESS it's actually a table, in which case it goes above. I've now revised the XHTML and PDF rendering to make that work properly.
There's more work to do here, but once it's done, I can document table markup properly.
Only the two appendices to go, and I've laid out the structure for them (they live in the <back>, before the bibliography).
The structure of the main document (the <text> element) looks like this:
<text><body>(main text content)<back>(bibliography) </text>
Inside the <body> element are one or more <div>s, which can be nested. Each <div> may begin with a <head> element containing its heading, followed by a series of paragraphs (<p> tags) or other <div>s. A typical structure looks like this:
<div> <head>Introduction</head> <p>Intro paragraph...</p> <p>Intro paragraph...</p> </div> <div> <head>Section 1</head> <div> <head>Section 1.1</head> <p>Para in section 1.1</p> <p>Para in section 1.1</p> </div> <div> <head>Section 1.2</head> <p>Para in section 1.2</p> <p>Para in section 1.2</p> </div> </div> <div> <head>Section 2</head> [...] </div> [...]
The formatting of headings will be handled automatically, according to the APA styleguide, based on the level of nesting.
The first section, a review of the literature, is very dense with references, so it's quite time-consuming to mark up. Found one missing reference, and wrote to the author.
It was huge (52 items). Grr.
Perhaps this is one of those articles where the author includes everything s/he has ever read in the biblio, even if it's not referenced in the text...
... expanded the instructions for biblio entries, and also added an item on marking up abbreviations.
All abbreviations need to be tagged, so the system can provide an appropriate mouseover hint to help readers who aren't sure what they mean (and to help populate our abbreviation index). This is how to do it:
- The first time a specific abbreviation appears in the text, tag it in full, with its expansion, like this:
<choice> <abbr>CAIN</abbr> <expan>Computer Anxiety Index</expan> </choice>
- Tag any subsequent instances of the abbreviation using just the abbr tag, like this:
<abbr>CAIN</abbr>
The system will be able to look back in the text to find the expansion for each instance of the abbreviation, taking it from the first, fully marked-up version.
Marking up the bibliography of an article can take as much as half the entire markup time for the article. This is because the information in a biblio reference is quite detailed, and in order to be harvestable and useful it needs to be marked up carefully. The bibliography also needs to be marked up before the text itself, because the text will be full of links to items in the bibliography, so their @xml:id attributes must be known before we can mark up the text.. The bibliography of the article appears in the <back> element of the <text> element, and it looks like this:
<back> <div type="bibliogr"> <head>References</head> <listBibl> <bibl>[...]</bibl> <bibl>[...]</bibl> </listBibl> </div> </back>
Each item in the bibliography is contained by a <bibl> element, which looks like this:
<bibl xml:id="aida_1994"> <author> <name><surname>Aida</surname>, <forename>Y.</forename></name> </author> (<date when="1994">1994</date>). <title level="a">Examination of Horwitz, Horwitz, and Cope’s construct of foreign language anxiety: The case of students of Japanese.</title> <title level="j">The Modern Language Journal</title>, <biblScope type="vol">78</biblScope>, <biblScope type="pp">155-168</biblScope>. </bibl>
The original reference in this case looked like this:
Aida, Y. (1994). Examination of Horwitz, Horwitz, and Cope’s construct of foreign language anxiety: The case of students of Japanese. The Modern Language Journal, 78, 155-168.
Key points:
- Each
<bibl>element must have a unique@xml:idattribute, created from the lower-case surname(s) of the author(s), followed by an underscore and the year of the document. In the case of multiple documents from the same year, add a suffix such as a, b, c etc. The@xml:idattribute is what will be used to link references in the text to the bibliographical items they refer to. - The components of the reference remain in their normal sequence. All we need to do is to apply some markup to them; we don't change the actual text, or the order of items, at all.
- We attempt to mark up everything we can usefully mark up.
- Titles are marked up using the
<title>tag, with the@levelattribute showing the kind of title it is. These are the values for the level attribute:- <title level="a">This is an article title</title>
- <title level="m">This is a book or monograph title</title>
- <title level="j">This is a journal or periodical title</title>
- <title level="s">This is a series title</title>
- <title level="u">This is an unpublished title</title>
<title>tag. - Authors are marked up with the
<author>tag, which contains a<name>tag; inside the<name>tag, the<surname>and<forename>are tagged. For multiple forenames or initials, just use a single<forename>tag, like this:<author><name><surname<Holmes</surname>, <forename>Martin David</forename></name></author>
Any punctuation (such as a comma between surname and forenames) should be left outside the<surname>and<forename>tags. - Editors are tagged just like authors, but they use the
<editor>tag instead. - Dates are wrapped in a
<date>tag, and the value of the date is added in the@whenattribute of the date tag. Normally, in the case of a year, this will be identical to the content of the tag:<date when="1994">1994</date>
but the@whenattribute takes a formal ISO date in the form YYYY-MM-DD, with optional MM and DD, so in some cases the@whenattribute will be different from the tag content, like this:<date when="1994-01">January, 1994</date>
- Enclose a publisher in the
<publisher>tag. - Enclose the publication place in a
<pubPlace>tag. - Edition information should be wrapped in an edition tag:
<edition>3rd ed.</edition>
- For volume/issue numbers, use
<biblScope>tags, with the appropriate@typeattribute:<biblScope type="vol">28</biblScope>
<biblScope type="issue">3</biblScope>
- For page numbers, use the same tag with
@type="pp":<biblScope type="pp">26-45</biblScope>
- For links to external urls, use a
<ref>tag, with the URL in the@targetattribute; whatever you would like to show as the linked text (usually the URL itself) should be inside the<ref>tag:<ref target="http://hotpot.uvic.ca/">http://hotpot.uvic.ca/</ref>
Here are some more real-life examples:
<bibl xml:id="chun_plass_2000"><author><name><surname>Chun</surname>, <forename>D. M.</forename></name></author> & <author><name><surname>Plass</surname>, <forename>J. L.</forename></name></author> (<date when="2000">2000</date>). <title level="a">Networked multimedia environments for second language acquisition.</title> In <editor><name><forename>M.</forename> <surname>Warshauer</surname></name></editor> & <editor><name><surname>Kern</surname>, <forename>R. G.</forename></name></editor> (Eds.), <title level="m">Network-Based Language Teaching: Concepts and Practice</title> (pp. <biblScope type="pp">151-170</biblScope>). <pubPlace>Cambridge</pubPlace>: <publisher>Cambridge University Press</publisher>.</bibl>
<bibl xml:id="daly_1991"><author><name><surname>Daly</surname>, <forename>J.</forename></name></author> (<date when="1991">1991</date>). <title level="a">Understanding communication apprehension: An introduction for language educators.</title> In <editor><name><forename>E. K.</forename> <surname>Horwitz</surname></name></editor> & <editor><name><forename>D. J.</forename> <surname>Young</surname></name></editor> (Eds.), <title level="m">Language Anxiety: From Theory and Research to Classroom Implications</title>. <pubPlace>Englewood Cliffs, NJ</pubPlace>: <publisher>Prentice Hall</publisher>.</bibl>
<bibl xml:id="blake_2000"><author><name><surname>Blake</surname>, <forename>R.</forename></name></author> (<date when="2000">2000</date>). <title level="a">Computer-mediated communication: A window in L2 Spanish Interlanguage.</title> <title level="j">Language Learning and Technology</title>, <biblScope type="vol">4</biblScope>, <biblScope type="pp">120-136</biblScope>. Retrieved <date notAfter="2004-03-05">March 5, 2004</date> from <ref target="http://llt.msu.edu/vol4num1/blake/default.html"></ref>.</bibl>
I've tweaked the XSLT so that bibl is now acceptable in place of biblStruct, making the markup of bibliographies (and the maintenance of the biblio rendering code) much simpler. This is working for both XHTML and PDF rendering, in the /apa/ rendering pipeline. I still need to look at NLM output code. I've marked up a few items in the first article, and also tested the popup biblio references.
I haven't yet migrated the changes to the main site.
I've taken a local copy of the whole application off the server, so I can work on it without messing with the server install. If this project turns out to run long-term, it should really be ported to eXist 1.4, but that seems impractical right now; there's a lot of stuff in there (such as FOP, and XUpdate) which will need special attention, and there's no time. For that reason, I'm going to continue working on the existing webapp.
My first task is to simplify the bibliography code, by moving from a biblStruct to a bibl system. I'll leave the existing biblStruct handling code in place, because I don't want to have to re-encode the existing articles, but the time saved by using bibl is so significant that it'll be worth writing a bit more XSLT to handle it.
The first new content came in from the editor last week, in the shape of one file. Set up the file, did the header, and began the biblio, which is huge. Also made one entry in the new Hit by a bus category, in which I'll document my markup practices, ready to hand off to the editorial team as soon as that is practical. Wrote a section on the title statement. This biblio is large enough to give me a significant opportunity to document the bibliography practices, although I might also take this opportunity to change the system over to using <bibl>s instead of <biblStruct>s, for simplicity.
The title statement (<titleStmt>) tag, in <TEI/teiHeader/fileDesc>, contains the key information about the author and title which is used to generate the document title on the page, the title in the table of contents, and the running titles. It looks like this:
<fileDesc> <titleStmt> <title level="a" type="main">The effects of Asynchronous Computer Voice Conferencing on L2 Learners’ Speaking Anxiety</title> <title level="a" type="sub"></title> <title level="a" type="runningRecto">Asynchronous Voice Conferencing and Speaking Anxiety</title> <title level="a" type="runningVerso">Poza</title> <author> <name> <forename>María Isabel Charle</forename> <surname>Poza</surname> <affiliation><name type="org">Lincoln University</name></affiliation> </name> </author> </titleStmt>
If there is a main title and a subtitle, enter it into the subtitle field, which is optional. The running title (Atype="runningRecto") will appear on as the running title on the recto page; by convention, this is a shortened form of the main title. The runningVerso title is the surname of the author. The author's name should be broken up into <forename> and <surname> tags, with all forenames and initials going into the <forename> tag.
The UVic part of the setup is done, and I've let the IALLT folks know they have to point the DNS ips at the UVic servers.
Began the process of homing the domain with a request to sysadmin. Looks like the project is coming alive again!
Just heard over the weekend that the board has agreed to open up the journal, so I've taken the necessary steps to do that. This post explains how we were locking down the system; I started by changing the ialltjournal.xml file on the server so that it was set to allow="*". Then we restarted Tomcat, but it didn't work; nothing was accessible. So in the end I just deleted the file on the server, which achieved the desired result. We're still working out the details of how the ialltjournal.org site will interact with the lettuce site.
Following a request from BB (one yesterday, another today), I changed the allowed-ip settings in the Tomcat config to accommodate their moving to a new provider. This raised the issue of the future of the journal. I'm pushing hard for open-access, since I think it will die without it, and in any case it's difficult to justify our time going into development on a project which can never be demonstrated or shared with the wider community. The IALLT conference is going on right now, so I hope they'll address this issue. If they do, and they vote to continue closed-access, we can continue to host, but I think we'll have to train up their folks to do the markup; I'll shift the bulk of my development efforts on this project to CJBS. If they go for open-access, then the project becomes live again, and I can cross-develop with the two journals; that's the ideal situation. They may also prefer to migrate the content to static formats and go for their own install of e.g. OJS on their own server; I've prepared for that by making all the content available in static packages ready for porting to any system that'll handle XHTML or PDF. I think the main thing is to get a decision one way or the other before the summer, so I can do my own project planning.
Got the latest version from LR this morning, and then worked through the whole document, refining and rewriting my sections, standardizing formatting, and reorganizing a bit. Then I wrote a conclusion that tries to bring it all together. There will definitely be a few more hours to put in over the weekend, but I think we'll meet the deadline (Sunday).
Drafted an abstract as a way of getting a handle on reorganizing the structure, and wrote one more section, on publication engines.
Tweaked the configuration of Tomcat (in the conf section, as discussed in a previous posting) to add a new IP address to the "allowed" IPs, since the IALLT Journal is moving to a new server location. Asked the sysops to restart Tomcat.
More redaction of the first draft of the book chapter, culminating in a decision that we need to go back to the beginning, answer some basic questions about our intentions, and then rewrite. Sent materials and lots of questions back to LR for his comments. The deadline is already looming, but it's only after you've written most of it that you discover what you should have been writing about. At least I'm beginning to know more or less what my opinions are, which helps a bit...
Got a draft of the chapter from LR in MSWord format, and began some collaborative editing; I'm able to open a docx file in OOo, but I can only save a doc file, and we don't yet know whether the "change tracking" from OOo will be saved appropriately in doc format so LR can see my changes as changes. I did only six pages as an initial test; if we can work this way, I'll continue. The deadline looms.
It appears that CJBS will continue as "a publication of Nalanda College", so I've now redone the banner image for the flyer, incorporating that information in the same way it appeared in the original cover art.
The Canadian Journal of Buddhist Studies is moving here under the auspices of MTA, so we'll be creating a website for it and a teiJournal installation. In the meantime, they'd like a flyer to take to the Congress at the end of May, so I've started work on logo/banner graphics, and a gatefold flyer like the one for ScanCan.
Did the second half of Dalvi. Lots of footnotes and references, so it's a bit time-consuming. One issue that needs some thought is the frequent appearance of Sanskrit terms, which are italicized; so far, I've been marking them as <foreign>, but most (not all) are actually terms in the sense of TEI <term>, and probably ought to be marked as such; however, if there's no plan to do anything with terms (such as index and define them), then that would be needless work. We can always find all instances of <foreign> and re-examine them at a later date, if we want to make some of them as <term> as well; and since both would have the same effect on the output (italicization), this would change nothing.
As I work through the Dalvi text, I'm implementing what was described in the previous post, and I thought I'd post a couple of examples of markup for later reference, subject to modification as the wrinkles are ironed out. Here's a blockquote:
<cit rend="block">
<quote>Why have I left that undeclared? Because it is unbeneficial, it does not belong to the fundamentals of spiritual life, it does not lead to disenchantment, to dispassion, to cessation, to peace, to direct knowledge, to Enlightenment, to <foreign>Nibbāna</foreign>. That is why I have left it undeclared.<note corresp="#bodhi_2005"><title level="m">In the Buddha’s Words</title>, 230-33.</note></quote>
</cit>
The <note> element here occurs within the <quote> element; I'm not sure if that's the right place for it, but it's probably simplest because it ensures that the note number occurs attached to the quote, as opposed to wrapped to the block below. In contrast, with inline quotes, we have a different setup:
<cit><quote>It is by understanding the nature of reasoned inquiry, epistemology and debating theory that one attains the highest goal (nihṣreyasa).</quote><note corresp="#vidyabhusana_1990"><title level="m">Nyāya Sutra</title> 1.1.1.</note></cit>
Here we don't want the <note> inside the <quote> element, because that would place it inside the quotation marks if these are being supplied by the rendering code; however, it remains inside the <cit> tag, ensuring that the reference is associated closely with the quotation.
So far so good...
In marking up the next CJBS article, I've discovered that it has both note-references and a full bibliography. This raises a slight issue that I'll have to deal with. Up to now, I have the following patterns:
- Notes are notes -- text, rather than pointers to references. In this case, the
<note>element has text in it. - Notes point directly to a full citation, in which case the
<note>element contains nothing at all; it has@corresppointing to the bibliographical item in the biblio list at the end. In the Sumegi text, this is the pattern, and the bibliography itself is not reproduced other than in the notes. - References are done through author-date items in the text itself, which are linked through a ref element to the relevant biblio item (
<ref target="#spodark_2005">Spodark, 2005</ref>); in this case, the biblio list is reproduced at the end in full. The IALLT Journal style works this way.
Now we have a fourth pattern exemplified in the Dalvi CJBS text:
- Note numbers in the text point to footnotes to brief citations (e.g. In the Buddha’s Words, 230-33.). The brief citation actually refers to a text which is in the full bibliography, which is also reproduced (above the endnotes), but there's no explicit link between the short endnote and the biblio item; the reader has to figure out that relationship for him or herself.
To deal with this, I think we need to operate as follows:
- The note element should contain the text of the note, as normal. However, it should also have
@corresp, pointing to the bibliographical entry, if there is one. - The processing code for XHTML should automatically provide a link to pop up the full biblio entry from the endnote, so the user can click on the endnote number to see the endnote itself (the short citation), and then click on something else to show the full biblio entry. Another option is to make the short citation in the endnote a
<ref>element itself, thus turning it into a link automatically. This would perhaps be more consistent with other types of referencing above. - We need a way in the XML to distinguish between a listBibl which is not displayed (as in the Sumegi text) and one which is (as in the Dalvi). Previously I was assuming that the existence of @corresp attributes on note elements would be enough to distinguish a text using that form of referencing (in which the actual listBibl is not displayed) from one in which it is displayed; but now that won't work (and in any case it was a bit arbitrary). So perhaps the best option is to distinguish two types of div in the back element:
//text/back/div[@type="bibliogr"](the normal type up to now, which is displayed).//text/back/div[@type="refList"](the type exemplified by the Sumegi text, which is not displayed, but from which the endnote references are drawn).
Having just marked up five book reviews for ScanCan in P4, I'm coming to the first instance of a book review for the teiJournal system, and I'm anxious to avoid the error-prone repetition that the old system suffers from. I'll document all my steps here:
- A book review is a little odd in that it has two titles: the title of the review, and the title of the book itself. It also needs to have a solid encoding of the bibliographical information relating to the book being reviewed, and ideally this info is only encoded in one place. I've elected to place it in a
<biblStruct>element directly inside the main article title (<title level="a">). This is a provisional decision; I'm still not sure how I'd handle the range of possible situations. I imagine, though, that this setup will handle three situations: where there's one book, where there's more than one book (just two<biblStruct>s, and where there's an actual title for the review that isn't the same as the book(s) reviewed; in the last case, I'll probably just add text before the first<biblStruct>, and have the XSLT detect that text and respond accordingly. - There are two levels of affiliation information, short and long (the print volume includes a section with long affiliations). The
<affiliation>tag itself doesn't allow@type(why? this seems arbitrary), so I've elected simply to use two tags, with the short annotation first, and the long one second.
With HC's departure as editor, I've amended the masthead to remove her and one editorial assistant, but put them on a secondary page which lists previous editorial staff.
With some help from the Cocoon list and from my own experimentation, it seems that {system-properties:os.name}, {system-properties:file.separator} and other similar variables are available in the sitemap using the SystemPropertiesModule input module, which is enabled by default.
Migrated fop-config.xsl to the production server (see previous message). PDF fonts seem to be working fine.
The problems with Windows continue, this time with the PDF generation, which depends on FOP being able to find the TTF font files. The trailing-slash issue seemed to be at the heart of it, but then I discovered that there was also an issue with the root path containing backslashes, but the relative path component for fonts having forward slashes.
After much pain and suffering, looking for a way to determine what host os Cocoon is running on, I decided that I can make that decision based on the presence of backslashes in the root path. If there are backslashes, then I conclude that it's running on Windows, so I need to change forward slashes to backslashes, and make sure the trailing slash is added to the end of the root path.
This is done in the file:
fop-config.xsl
Tomorrow, I'll try migrating this solution to the live server, with the usual caution.
Cautiously migrated the four changed files listed in the previous post to the main site, and tested; PDFs are working fine, with embedded images, and backups are also working.
Implemented the #3 solution from my previous post, on the local machine. In the process, I discovered an oddity in flowscript: if you have a string variable in flowscript (in this case mine is created from a Java String object), it doesn't seem to have a .length property, but it does seem to have a .length() method. Annoying and confusing.
I now have PDF generation with images working locally, and backups working (they also depend on the realpath value).
This was done by making changes to these files:
save_transformations.flowsitemap.xmaputilities.xslpdf_general.xsl
I should be able to migrate those changes to the main server without endangering much else, for testing; I'll try that on Monday. Once that's out of the way, and if it's confirmed to be working, then the whole site is portable again and I can go back to working on the corpus PDF code, which is quite complicated.
To further my work on the corpus PDF document, I've set up a local copy of the journal on my system. This has thrown up a couple of issues related to platform compatiblity. The main one is that the Cocoon realpath:/ protocol returns a path ending in a trailing slash on Linux, but on Windows there's no slash. This means (among other things) that images are not retrieved and used for PDF generation, because the path to the images folder is created based on realpath, and ends up lacking a slash.
Cross-platform compatibility is important for this project, so I need to solve this. These are some of the options:
- Instead of passing in a fully-constructed path from the sitemap to the XSLT transformation, pass in only the realpath variable. Then the XSLT can check for a trailing slash before building the full path by adding "
teiJournal/images". This means that part of the path to the images is hard-coded into the XSLT, which perhaps makes it harder to change. - Finding some way to check the value returned by
realpathinside the sitemap, and branching based on that. (I don't know of any way to do that, other than flowscript). - A modified version of #1, which would pass both realpath and the subsequent component (
teiJournal/imagesin our current setup), then having the XSLT combine the two to create the full path.
On balance, I think #3 is best because it allows for configuration within the sitemap, but platform flexibility through the XSLT.
This needs to be tested live on the server, though.
Set up the pipeline for generating a PDF of an issue or annotation, and began work on it. First of all, I had to modify some of the base PDF code, which wasn't designed to handle a situation in which there were multiple documents. The most obvious case was the handling of end notes for each article, which is now fixed. Another case, still not fixed, is the handling of running titles and headers. These need to be derived per-TEI, rather than globally; my templates are rather too general, and result in all the items from all the articles being agglomerated and used en masse everywhere. One issue with this is that right now, there's presumably quite a lot of activity on the site, so I need to be careful not to break the existing PDF code.
However, I have a composite document rendering, with page-numbering and the like working OK. Before I go any further, I think I need to build a local copy of the system and work on that, though, so the main site isn't at risk.
The "corpus output" (a composite file consisting of a collection of individual contributions -- an issue, or an ad-hoc anthology) now has a complete <teiHeader> element, created by cherry-picking and combining various bits and pieces from the component documents. This is rather a complicated business, and is not always precisely what you want -- for instance, if you combine documents from two vol/issue sets, you'll get idno elements for each volume and issue, but no clear indication of which volume number goes with which issue number. Still thinking about that.
Completed the last two features of the backup system: there's now a button which can back up all the files, one by one; and there's a link to download a zip archive of all the files.
Because of (very reasonable) browser restrictions to prevent cross-domain scripting, it proves to be impossible to retrieve the response from the server when doing either an individual file backup or all of them, when the site is proxied through Bruno's server. What happens is that the JavaScript on the client runs, but the XMLHttpRequest object quite sensibly says that it can't do anything with the response because the response originates from a server which is not the one serving the page. Fair enough; I have some error-trapping in place which gives reasonable feedback for that, and given that (we hope) this password-protection nonsense will go away soon, we can live with it.
The other feature is pretty cool, and worth documenting in some detail. This is how it works:
- A directory generator is called to create Cocoon's standard directory generator XML output.
- This is run through an XSLT transformation which formats it as expected by Cocoon's zip serializer.
- The zip serializer streams out a zip file, as if it were a normal download of a static file. Whether it's creating a temp file first, then serializing it out, or serializing a zip file on the file, file by file, is difficult to know, but it's very fast and it works perfectly.
So now, with one click, the editor can create backups of every output format of every file on the system; and then download a zip archive of the results.
Having got the actual mechanisms of the backup system working (generating files and saving them onto the server), I started integrating the AJAX scripting so that backup calls don't actually leave the page. Getting it working with Firefox was easy -- I initially had a pipeline that returned a single XHTML anchor tag, of which the text was the new modified date of the file -- but the predictable problems surfaced with IE, so I had to rethink. In the end, I have this:
- The "Generate backup" button makes a call to the appropriate flowscript function for the type of backup being made.
- The flowscript creates the backup, then reads its last modified date.
- The flowscript hands off (through a
cocoon.redirectTocall) to an XQuery script, passing the last modified date. - The XQuery script generates a single XML node, which is actually another anchor tag.
- The output from the XQuery is processed into text by an XSLT script.
- The text is received by the AJAX object, which uses it to replace the text in the existing anchor.
While the page is waiting, the target anchor tag has its text replaced with a series of dots, which shows a simple progress system. The effect is that the date/time shown on the link is updated on the page, every time a new backup is generated.
This is tested and working on FF, IE, Safari and Opera. The next stage is to create a sequence wrapper which can call the script many times, causing the items to update one at a time, while showing per-item progress somewhere at the top of the page. That shouldn't be too hard now.
A kind soul on the Cocoon list pointed out a missing variable in my XHTML code (the "realpath" component was missing from the path to the folder I was trying to create -- a dumb typo), and I've now got XHTML backups working fine, including saving copies of the CSS and JS into the backup folders, to create independent packages for each document.
The next stage is to make this all function through AJAX. This isn't too complicated, but it will require some thought; I need to return some useful content (the date/time of the file created), in a usable format, to replace the current info on the page, so I'll need to read that using Java once the file is created.
Rewrote my backup page so that it now shows a useful and more readable table of all the files that exist, along with buttons to generate or re-generate hard copies of any documents. Documents that exist are shows as date/times of their last modification. The generation pipelines no redirect, on success, to the backups page they were called from, with a hash component to make sure the page doesn't jump around too badly when there's a lot of content on it. What's there so far works great.
However, I've spent all day struggling with the problem of creating directories in which to store the individual XHTML files. They should each have their own directory so that copies of the JS and CSS can also be stored there; that's required in order to make them actually functional and portable. However, I can't get the flowscript to create a new directory. After many trials, I've finally posted a message to the Cocoon mailing list. We'll see if anyone can suggest what might be causing the problem.
Updated some of the site pages in the protected version of the journal to match the external copies maintained on the journal website by the editors. This duplication is annoying, but it should go away if and when the journal is opened up to the world. We're also having preliminary discussions about digitizing old print versions, including the possibility of marking up some of the content.
Wrote the pipelines and flowscript for saving PDF, TEI, NLM 3.0 and NLM 2.3 files to the file system, and created a basic list page which links to the files that are there, so you can create backups and view them.
Next, I want to create the XHTML output, which involves creating a separate folder for each document, and dumping not only the output but also copies of all the CSS, JS and so on into the folder. Then it might be possible to zip the contents of the folder into a zip file.
The backups control page should also be arranged as a table, showing file generation dates, to make it clearer exactly what's happening.
After discussion with JF, WP and LR, I've made a formal proposal for a TEI SIG to work on proposals for a schema and for some controlled vocabulary lists (for bibliographic item types, and journal contribution types).
Came across this interesting list of biblio item types today: http://vocab.org/biblio/schema. I need to compare this with my list, to provide the basis for discussions on the bibliographic item type list.
HC pointed out that sorting by Vol/Issue (by clicking on the TOC table column headers) ignored the editorial sequence of items within issues. That's now fixed. Whenever the TOC is accessed "unsorted", or sorted ascending or descending by Vol/Issue, the individual items are sorted by editorial sequence within the issue.
The sort order in the contents page was a bit of a hack, based on an ad-hoc @n attribute in the root element; it worked fine for a single issue, but multiple issues require more sophisticated sorting, so I've replaced that attribute with a more formal designation of each document's position within its issue:
<publicationStmt> ... <idno type="itemNo" n="03"></idno> ... </publicationStmt>
The leading zero makes for easier sorting in cases where perhaps the XPath or XSLT engine doesn't handle strict typing well. The <idno> element is mapped to the <elocation-id> element in the NLM output, giving us better NLM metadata.
The default sort sequence is now more sophisticated: it sorts by volume and issue descending (40, then 39, etc.), then by item number ascending (01, 02, 03...), so that the most recent issues sort to the top, but content within each issue sorts in sequence.
The documents for 40.1 have now been promoted to the published collection, and the editors will announce the publication after they've done a final check.
Following the recent loss of Joe Sheehan and Jim Pusack, the editors sent an in memoriam piece to be included in volume 40.1. This is now marked up, and necessitated also the addition of a new contribution type, which is tagged as "obituary".
I'm in the process of building a system whereby editors can go to a specific page on the site, and from it, view all the possible output formats for every document in the db; they can also generate "hard" copies of each of these formats and store them on the server as a permanent record (in case the db or Cocoon become unavailable).
I made the following progress today:
- A sitemap pipeline is in place to use the Directory Generator to create a list of the documents already in the backup directory (which is
backups/{styleGuide}). - Another pipeline, called by
backups.xml?styleGuide={styleGuide}, generates a TEI file which lists all the documents in the database byxml:id; it also has an XInclude instruction in it which is then processed, and that pulls in the directory listing from the above pipeline, creating an XML file which has a list of the contribution documents, along with a list of the details of all the "hard backup" copies that already exist. - Another pipeline,
backups.xhtml?styleGuide={styleGuide}, processes that through an XSL transformation, which is currently only in skeleton form but will eventually create a rich page with links to view and generate any or all of the hard backups available.
So I guess I'm about a third of the way through this; the next stage will be generating individual items on demand, and the final stage will be generation of multiple items (all the PDFs, say, or all the XHTML files), along with a decent GUI reflecting progress and completion.
Wrote the initial phase of a system for building PDF volumes. I now have XQuery that can retrieve a set of documents based on volume and/or issue number, or an arbitrary sequence, and return the results as a <teiCorpus>. I still have to figure out how metadata should be dropped in (derived from the first in the sequence, or some other approach), and how much metadata should go in through XQuery as opposed to XSLT. Both have access to the db's string resources, so both could do it.
Made a couple of small changes to HM's preface article on her instructions, then started fixing some little niggles that have been annoying me:
- The Search function appears on both the regular TOC page and the proofing TOC. In both cases, it ran its search against the regular documents, bouncing back to the regular TOC page with the results. It now works independently for each page, so if you search on the main TOC, you get results from the main collection, but if you search on the proofing TOC you get results from docs in the proofing collection.
- Clicking on the column header sort links in the proofing TOC page also bounced you back from the proofing TOC to the main TOC. That's now fixed.
- Article XHTML output always included a link at the top to the main TOC, even when the document was in the proofing collection. It now distinguishes between the two locations of the document, and gives you a link back to the proofing TOC if the document is in the proofing collection.
Received HM's preface article for vol 40, and marked it up. In the process, I marked up a series of references to the other articles in the issue, using relative links. This threw up a new situation in the PDF export. Previously, I was handling two distinct cases of links in PDF output:
- Those beginning with
#, being internal links; easy to process into internal PDF links. - Those beginning with
http, being external Web links, also easy to process into external PDF links.
Now there was a third case:
- Those not beginning with a hash, and not containing a protocol, meaning that they're relative.
Relative links don't work in PDFs because they're typically downloaded to a temp folder and opened from there. It was necessary to reconstruct the full original URL of the request so that I could build a full link, to create a proper PDF external link. The Cocoon {request:requestURI} variable, available in the sitemap, should, according to the documentation, give you the full URI, but it doesn't; it gives you only the path after the port. After some reading around, I found the solution. In the sitemap, you pass this into the XSLT transformation:
<map:parameter name="browserURI" value="{request:scheme}://{request:serverName}:{request:serverPort}{request:requestURI}" />
giving you a variable with the full request path, and then you can do this, to reduce the path to its directory, in the XSLT:
<xsl:param name="browserURI" /> <xsl:variable name="uriDirLength" select="mdh:lastIndexOf($browserURI, '/')" /> <xsl:variable name="docPath" select="substring($browserURI, 1, $uriDirLength+1)" />
I'm blogging this in detail because the Cocoon documentation is ropy in this area, and I've previously tried to figure this out and failed.
We're having some preliminary discussions with some members of the TEI community about the possibility of a SIG to work on a TEI-based journal publishing schema. I've spent some time in the last couple of days thinking and participating.
Looks like OJS will simply be taking the donated code for NLM handling, rather than writing their own, and the donated code uses NLM 2.3, so I'm now writing a converter to turn my 3.0 output into 2.3. There are about four major areas of difference, two of which I've already dealt with, and I've also gone back and elaborated one area of the original TEI-to-NLM-3.0 conversion for greater consistency.
Basic backup (saving the output of a pipeline into a directory on the filesystem) is now working, thanks to some help setting permissions from sysadmin.
Now a proper GUI and plan needs to be implemented. This is my first shot at a plan:
- There needs to be an editorial index page which shows a list of all the documents (published and in proof) which are in the system.
- The page should have links to back up specific output formats for each document.
- Those links would invoke the pipeline which calls the flowscript, but they would do it through an AJAX call, so that the index page does not need to be replaced.
- The AJAX script would call the pipeline, and write the server response to a
<div>on the page. - The server response needs to be encoded in a TEI file of messages, which is stored in the db. This would be similar to the
site.xmlfile which currently holds all the site rubric. - The pipeline which sends back the message would retrieve a block of something from the XML file, and pass it for processing to the
<site.xsl>file, but in some manner which preventssite.xslfrom building a full page; we only need an XHTML div, for insertion into the index page. - One outcome is an error message; this message should give a warning about permissions, the most likely cause for failure of the operation.
- Once all this is working for individual files/formats, the next stage is to enhance the AJAX page so that it can do the whole lot.
- This would work by having the JavaScript create a queue of URLs to be called, and when it gets a successful response from each one, it invokes the next one, also reporting its progress as it goes. There would also need to be a method for bailing on the process.
- A similar batch function should be available for each individual document, invoking all the formats.
- Finally, we need directory browsing to be available through the Cocoon sitemap, so the editor (or indeed regular readers) can see and access all the backup files.
This setup would give the option for the editor to backup the whole collection, or just one changed file, in all its output formats (including the source XML, presumably), so that a regular backup of changes could be taken, and also when a single file is edited, copies could be regenerated for only that file.
It's long been a plan of mine (following my own recommendations from our Finland presentation) to build in a system whereby hard copies of all the XHTML, PDF and other output formats can be (or perhaps are automatically) saved on the server, so that should Tomcat or Cocoon go down in some catastrophic way, those files are all available to the editors. I implemented that today, following some instructions here. There was a lot of messing around initially getting the folder paths right in the parameters to the flowscript, but after tailing the Cocoon log for a while, I got that sorted.
However, it didn't work in the end, due to a permissions issue. The Tomcat process runs as apachsrv, which apparently doesn't have permission to write to those folders -- which makes perfect sense from a security point of view. We're working on that with sysadmin now.
NLM 3.0 output is now working and on the Website. Mostly did this yesterday, but blogged it in the wrong blog.
Wrote the XSLT to convert appendices, and began work on the reference list (bibliography) code. I've got the list framework working. I'm now looking at the rather odd NLM structures used for reference items. They don't seem to have any way of distinguishing authors from editors, other than by wrapping them in <person-group> tags with a person-group-type attribute; I guess that reflects the reality in scientific fields, were no-one publishes anything alone. The whole thing seems less structured than a TEI equivalent, being more of a loose agglomeration of tags.
My nine test files now convert without errors (except the missing reference ids that refer to the bibliography items in the reference list, which are not converting yet because they're back matter). So two-thirds of the job is done. There are some oddities in the model structure of NLM -- for instance, every section (<sec>) must have a <title>, which seems ridiculous, and links (<xref>, uri and <ext-link>) cannot contain abbreviation tags, which seems pointlessly restrictive, when they can contain bold, italics etc. However, that's not really my problem, except that it requires me to throw away some information during the conversion.
I now have teiJournal-to-NLM motoring along quite well; I've got all the way through handling body elements, as far as I can tell. Next I'll try bulk-converting the whole set of articles and bug-fixing on those, before I move on to the difficult issue of the reference list and appendices.
Added some of the basic body block-level element handling. I was able to take advantage of some of the XHTML processing already written to deal with tables, because NLM purports to use the XHTML table model, but there are still apparently some issues; <caption> seems to require another block-level element below it, and the @class attribute is not supported. Still, we're making progress.
The latest version of the schema generated from my ODD file now allows <pb> tags, which I can now use for forcing page-breaks in PDFs. It also includes a recent P5 fix, brought about by LR, which allows the @type attribute on <biblStruct> elements. I've been using @rend, because it was the best of many bad options, but @type is exactly what I need, so I've changed all the XML data and the reference rendering code so that it now uses @type instead of @rend.
In the process, I took another look at the PDF rendering, and decided that the first page template looked a bit odd, with no footer, and a large bottom margin. I've now extended the page, and added the normal recto footer, and it looks a lot better. This re-paginated all the documents, so I had to do more PDF fixes with <pb>.
The editors' intros are due soon, and then this issue should go out. I need to decide whether PDF links should be on the contents page, as well as on the actual XHTML version of the document. I think probably they should, but I'll check with the editors.
The page break thing is now working. It's a kludge, but it'll do until we can move to a newer version of FOP.
It occurred to me that I could make judicious use of a little hack element such as:
<pb type="fopFix" />
to force pagebreaks where our ancient version of FOP is not able to provide them. Having written the necessary code -- to process the element in the PDF output, and to suppress it in the XHTML -- I found one example in our current data of a heading at the bottom of a page (Conclusions, in the McIntosh paper), and started trying to add the pb, only to discover that I'd deleted it from the schema through my ODD file. Having remedied that, I now find that ROMA isn't able to produce schemas -- it seems to be unresponsive and/or broken, and produces only empty files, where it produces anything at all. I guess I'll come back to this tomorrow.
Embarcadero/CodeGear are hosting an online "CodeRage" conference with live presentations using MS Live Meeting, all week, presenting new features in Delphi 2009. This is important stuff, because D2009 now has full, core Unicode support. I've tuned in and out of four presentations during the day, and the best have been the afternoon sessions on Unicode and the RTL features. D2009 is probably going to be the basis for the next generation of the Image Markup Tool. There's no sign of academic pricing for D2009, though; I'm waiting for CodeGear's response to my query about this.
Began work on the NLM output for teiJournal, which will tie it into OJS. The metadata is basically done, and validates; it takes the <teiHeader> and converts it into the NLM <front> element. So far so good, but I am noticing that there is some information that really ought to be in my teiJournal structure, but which isn't, as yet, because IALLT has not required it. For instance, there should be provision for an abstract, which in NLM belongs in the metadata area. This conversion work will give me some clearer ideas about how to fill out the teiJournal tagset and encoding practices.
Based on some editorial feedback from HM, I've reduced the size of the images on the page to 4" wide (discovering in the process that FOP 0.20.5 doesn't support content-width, but if you just set width, it does scale the content proportionately). I also increased the space between the footer top border and the text above to a quarter of an inch.
Then I tackled endnote numbers, discovering that if you set the font size, that seems to undermine the application of the vertical-align and/or baseline-shift properties, so I've left the font size for endnote numbers in the text as default, and now I have superscripting. The link is also working, although FOP's positioning of the hot area is always slightly off. I fixed a bug that was causing all in-text footnote numbers to be "1" (xsl:number never does quite what you expect). Finally, I noticed a situation in which a URL was not having zero-width wrapping spaces inserted, and it turned out to be because it starts with https, and I was crudely trapping for http://. I'm now just looking for http, which should do the job. We may have to complicate it a bit for ftp and other protocols at some stage, but you don't see them as much as you used to.
I've now added a PDF link to the document XHTML output, so all articles are available as PDF.
Finally got the tables to display according to APA guidelines. It appears that borders only work reliably on table cells, and you can only get them to display different top/bottom from right/left borders by explicitly specifying every style, width and colour for the borders. It was setting the right/left borders to white that finally "suppressed" them. For the bottom border of the table, I had to add a bottom border to cells in the final row.
I then trimmed the space-after attribute for all headings, resulting in text that looked better -- it appears you want gaps between sections, but not between headings and their following paragraphs -- and virtually no instances of widows (which is luck, but the odds are better when there's a smaller space-after).
As far as I can see, I have only one remaining problem: I simply can't find a way to make footnote numbers appear superscripted in the text. According to the FOP compliance table, it should support baseline-shift: super, but it just doesn't work; and vertical-align, which should also be able to take a value of super, fails too. There may be some combination of settings I can use to make it function; I'll need to experiment.
Got the endnotes working properly, with somewhat simpler code than I used in ScanCan. Also managed to fix the basic-link to internal-destination problem; links within the document now work.
After that, I started work on tables. This is not simple, because the styling requirements for tables in APA are quite unusual and quite strict; you need to have top and bottom borders on tables and label rows, but no borders on cells, and no vertical borders at all. I have what I think is basically the right xsl:fo code going into FOP, but right now it's failing to do right by my borders. I'll have to hack around at it a bit more, but I may end up having to compromised and just go for a straight grid table.
Got the reference list output code ported from XHTML to XSL:FO. Everything seems to be working except that internal links are not doing their job. Need to work on that. Citations in the body should bounce to the relevant item in the reference list.
I'm making some progress with PDF generation; I have divs, headings paragraphs, external links, images and a variety of other features working. However, I've hit one problem:
XSL:FO has properties called keep-with-previous and keep-with-next, which are used to prevent the orphaning of titles at the bottom of pages, and similar layout oddities. The FO-to-PDF converter in Cocoon 2.1 is Apache FOP 0.20.5, which is old, and which doesn't support the keep-with properties. That causes an occasional output problem: a title can end up at the bottom of a page, separated from its following paragraph. For our serious print publishing, we use the RenderX XEP engine, which has virtually complete XSL:FO support. However, XEP costs a lot of money ($4,000 for a single-core one-CPU), so it can't be a default part of the teiJournal project, which is wholly open-source.
Meanwhile, both FOP and Cocoon are moving forward; Cocoon 2.2 is out, and includes a more modern version of FOP which supports the properties we need. However, Cocoon 2.2 is a completely different animal from 2.1, with a totally different structure; moreover, the XML database we use (eXist) is available in a package with Cocoon 2.1, but no such package exists for Cocoon 2.2. So the situation is this:
Right now we can't ensure that PDFs avoid the orphaning problem in teiJournal (although if IALLT Journal wished to, they could pay $4,000 for XEP and solve the problem). In the future -- over about two to three years, I estimate -- eXist will probably move to Cocoon 2.2, or we'll learn to build Cocoon 2.2 with eXist, and teiJournal will be ported to Cocoon 2.2 and solve the problem. So we're looking at occasional orphaning problems occurring with some articles for two or three years.
While I'm posting, a reminder to myself about PDF output development and the caching problems we have with it. First, remember that the browser usually caches a PDF download, so you need to clear the browser cache before grabbing an updated copy when working on PDF output. Secondly -- and this is a killer, that I'd forgotten about -- when there are multiple XSLT stylesheets being called, Cocoon will cache the results of a transformation unless the root stylesheet has changed. Therefore, if you're actually coding in a different stylesheet, you need to make a quick edit to the root stylesheet and upload it in order to trigger a refresh in the Cocoon pipeline. It doesn't know that the root stylesheet invokes other files, so it doesn't check to see if they've changed.
Re-configured the XSL:FO output XSLT to take account of three possible levels of attribute-sets in the database (base, styleguide and user). Then I started the long task of writing the XSLT output code. I got a long way into it -- all the basics should be covered -- but I couldn't get anything to render correctly from FOP; the page layout and numbering worked, but the body document wass simply unstyled. After an hour of hacking, I finally determined that this was caused by the accursed browser cache; switching browsers threw up a page that looked half-way ready.
I still have a lot of not-quite-right attributes in my output, according to the unofficial FOP a schema I've downloaded; it tells me I need to get rid of rogue xsl:use-attribute-sets attributes that have found their way out into my code instead of being interpreted; it also tells me I must use units where values are 0 (so 0 is wrong, but 0in is correct); and it also tells me that hash expressions of colour values are wrong. I'll work on these, and see if I can get more stuff to work properly tomorrow.
I created the APA pdf.xsl file which calls the main article_to_pdf.xsl file, and tested it. Then I stored the pdf_page_masters.xsl file in the database itself, under db/teiJournal/settings/default/style/, and set up an XQuery file which can retrieve such files when passed the filename as a parameter. The XSL generation now calls that generation system through the cocoon:// protocol. The current setup assumes that if there's any kind of a user file in /settings/user/style/, it should use that; otherwise, it uses the default file. This is a large assumption, and a more graceful approach, similar to that of the CSS system, would be to find the default file, then iterate through all the named attribute sets and look for similar sets in the user file, substituting them where they exist. I'll implement that next. For the moment, basic PDF generation is still working after the XSLT rearrangement.
I followed my own instructions in the earlier post to set up FOP font configuration on the actual IALLT Journal site, and then started putting test transformations in place. I took some basic PDF attribute sets from ScanCan, and modified them heavily, primarily by converting the metric setup to inches as required by the IALLT Journal (paper size will be 8.5 x 11), and simplifying the page masters; IALLT requires only three page masters, one a recto for the article title page, then one each for recto and verso regular pages. I set up the headers and footers in a default manner, with the running titles as provided in the articles, and put page numbering at the bottom outside. Amazingly, the basics worked right out of the box; FOP is perhaps further advanced than we thought.
The next stages are:
- Decide which components of the layout and design should be abstracted for editorial control, and stored in the db. Make that work.
- Create the styleguide-controlled transformation system. Right now, the transformation is done using a root XSLT file, but actually a styleguide-based file should be called, as with the XHTML transformation, and the base file included in that.
- Add the article header system into the basic templates file (this is not styleguide-controlled).
- Create a blank pdf_references.xsl file in the styleguide folder, and include it. Later, the XHTML references document will act as the model to create this file.
- Look at the way images are included in the XML, and make sure you have a reliable method of deriving the hard path to image files just in the way we handle fonts; FOP can't work with relative paths.
Then it's just a question of working through all the little templates, figuring out how best to handle them.
Worked through the editorial corrections to the Sawhill article, and made all the fixes. Also had to make two code changes: in the first, the need to add an acknowedgement that a previous version of the article had been a prize finalist was handled by placing a paragraph with a special rend attribute in the <sourceDesc> tag:
<sourceDesc> <p rend="afterTitle">A version of this paper was selected as a Henderson Plenary Award Finalist at the IALLT 2007 Annual Conference.</p> [...] </sourceDesc>
Code in xhtml_article_base.xsl looks for this and renders it above the author names. This is only marginally satisfactory, but again, there are no absolutely appropriate tags for this kind of thing because it's peculiar to born-digital documents.
The other change related to capitalization of initials in bibliographical references. My code was automatically doing this, but the case of "bell hooks" (an affected pseudonym; the author always uses lower-case) required that automatic upper-casing be suppressed. Conceivably there are actually first names that might legitimately begin with lower-case letters anyway, so it's no bad thing as far as I can see.
Here's what I had to do:
- Create the config file with a slightly different name, thus:
[cocoon]/teiJournal/fop-config-src.xml. - In that file, replace paths to font metrics and font files with simple folder+file names, thus:
<font metrics-file="fop-fonts/DejaVuSans.xml" kerning="yes" embed-file="fop-fonts/DejaVuSans.ttf"> - In the fo2pdf serializer section of the root sitemap, set the user-config setting to use a cocoon path, which will invoke a pipeline we create:
<map:serializer logger="sitemap.serializer.fo2pdf" mime-type="application/pdf" name="fo2pdf" src="org.apache.cocoon.serialization.FOPSerializer"> <user-config>cocoon://teiJournal/fop-config.xml</user-config> </map:serializer> - Create an XSLT file to do the transformation. This invokes Cocoon's RealPath module to find the real path of Cocoon on the filesystem, so we can use it to reconstruct the path to the fonts:
<?xml version="1.0"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" exclude-result-prefixes="#all"> <xsl:param name="fontPath" /> <!-- XSLT Template to copy anything, priority="-1" --> <xsl:template match="@*|node()|text()|comment()|processing-instruction()" priority="-1"> <xsl:copy> <xsl:apply-templates select="@*|node()|text()|comment()|processing-instruction()"/> </xsl:copy> </xsl:template> <!-- Massage the path attributes. --> <xsl:template match="@metrics-file"> <xsl:attribute name="metrics-file"><xsl:value-of select="$fontPath"/><xsl:value-of select="."/></xsl:attribute> </xsl:template> <xsl:template match="@embed-file"> <xsl:attribute name="embed-file"><xsl:value-of select="$fontPath"/><xsl:value-of select="."/></xsl:attribute> </xsl:template> </xsl:stylesheet> - In the main pipeline of the root sitemap, add a pipeline to handle generating, transforming and serializing the config file, like this:
<map:match pattern="teiJournal/fop-config.xml"> <map:generate src="teiJournal/fop-config-src.xml" /> <map:transform type="saxon" src="teiJournal/xsl/fop-config.xsl"> <map:parameter name="fontPath" value="{realpath:/}WEB-INF/" /> </map:transform> <map:serialize type="xml"/> </map:match> - Check that this worked OK, by accessing the font-config.xml file in your browser, at the location pointed to by the pipeline.
- Get working fonts when you run a PDF generation.
The XML parsing classes are now complete, and can read and deconstruct the XML submission either from a POST variable or from a file on the filesystem (later, I'll add the ability to grab data from a table). They can reconstruct themselves in XML and feed that out, too.
Next the really hard work starts. We need to figure out how to build the complex SQL query to get the data back.
Entered HC's proofing corrections for the Yang article.
Got a set of corrections, including a new contribution type designation that I hadn't predicted ("Column"). Added all the necessary XSLT and made the corrections. I must remember to document the addition of "Column" on the teiJournal project Web site.
Put Tomcat and teiJournal on Radicchio and tested it -- working fine. Now I can hammer it without killing tomcat-dev on Lettuce.
Greg generated font-metrics files for all the DejaVu and Gentium files -- he's blogged that process elsewhere. It only worked on Lettuce; even with the same Cocoon jars, it fails on OSX and Windows.
Then I was able to create a fop-config.xml file, which I'll reproduce at the end of this message. That file was placed in [cocoon]/WEB-INF. Then the root sitemap, which is where the fo2pdf serializer is defined, was modified, to add the <user-config> tag here:
<map:serializer logger="sitemap.serializer.fo2pdf" mime-type="application/pdf" name="fo2pdf" src="org.apache.cocoon.serialization.FOPSerializer"> <user-config>context:/WEB-INF/fop-config.xml</user-config> </map:serializer>
We know this works, because FOP would fail when we got it slightly wrong, then succeed when it was able to find the file. The key here is that this file path is relative, so the Cocoon instance remains portable.
Next, I put the fonts themselves, along with the font-metrics files, in a subfolder of WEB-INF, [cocoon]/WEB-INF/fop-fonts.
But here's the big problem: FOP was not able to find and use the fonts unless the paths to them were absolute. We tried a number of variants of relative paths, including file:fop-fonts/..., file://fop-fonts/..., ./fop-fonts/... and so on. Every attempt required a restart of Cocoon, and every second attempt required a restart of Tomcat (Tomcat dies on the second attempt to restart one of its webapps). This is not a workable way to proceed, so I'll have to set up a working Tomcat stack on a local machine to play with this. It's possible that setting the <base> or <base-font> elements correctly will do it; and it's also possible that we're incorrectly assuming these paths need to be relative to the fop-config.xml file, when actually they should be relative to something else (such as the lib directory where the FOP jar file is).
For the record, here's our fop-config.xml file, with ./ relative paths that don't work.
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<fonts>
<font metrics-file="./fop-fonts/DejaVuSans.xml"
kerning="yes" embed-file="./fop-fonts/DejaVuSans.ttf">
<font-triplet name="DejaVu Sans" style="normal" weight="normal"/>
<font-triplet name="DejaVuSans" style="normal" weight="normal"/>
</font>
<font metrics-file="./fop-fonts/DejaVuSans-Bold.xml"
kerning="yes" embed-file="./fop-fonts/DejaVuSans-Bold.ttf">
<font-triplet name="DejaVu Sans" style="normal" weight="bold"/>
<font-triplet name="DejaVuSans" style="normal" weight="bold"/>
</font>
<font metrics-file="./fop-fonts/DejaVuSans-BoldOblique.xml"
kerning="yes" embed-file="./fop-fonts/DejaVuSans-BoldOblique.ttf">
<font-triplet name="DejaVu Sans" style="italic" weight="bold"/>
<font-triplet name="DejaVuSans" style="italic" weight="bold"/>
</font>
<font metrics-file="./fop-fonts/DejaVuSans-Oblique.xml"
kerning="yes" embed-file="./fop-fonts/DejaVuSans-Oblique.ttf">
<font-triplet name="DejaVu Sans" style="italic" weight="normal"/>
<font-triplet name="DejaVuSans" style="italic" weight="normal"/>
</font>
<font metrics-file="./fop-fonts/DejaVuSansCondensed.xml"
kerning="yes" embed-file="./fop-fonts/DejaVuSansCondensed.ttf">
<font-triplet name="DejaVu Sans Condensed" style="normal" weight="normal"/>
<font-triplet name="DejaVuSansCondensed" style="normal" weight="normal"/>
</font>
<font metrics-file="./fop-fonts/DejaVuSansCondensed-Bold.xml"
kerning="yes" embed-file="./fop-fonts/DejaVuSansCondensed-Bold.ttf">
<font-triplet name="DejaVu Sans Condensed" style="normal" weight="bold"/>
<font-triplet name="DejaVuSansCondensed" style="normal" weight="bold"/>
</font>
<font metrics-file="./fop-fonts/DejaVuSansCondensed-BoldOblique.xml"
kerning="yes" embed-file="./fop-fonts/DejaVuSansCondensed-BoldOblique.ttf">
<font-triplet name="DejaVu Sans Condensed" style="italic" weight="bold"/>
<font-triplet name="DejaVuSansCondensed" style="italic" weight="bold"/>
</font>
<font metrics-file="./fop-fonts/DejaVuSansCondensed-Oblique.xml"
kerning="yes" embed-file="./fop-fonts/DejaVuSansCondensed-Oblique.ttf">
<font-triplet name="DejaVu Sans Condensed" style="italic" weight="normal"/>
<font-triplet name="DejaVuSansCondensed" style="italic" weight="normal"/>
</font>
<font metrics-file="./fop-fonts/DejaVuSansMono.xml"
kerning="yes" embed-file="./fop-fonts/DejaVuSansMono.ttf">
<font-triplet name="DejaVu Sans Mono" style="normal" weight="normal"/>
<font-triplet name="DejaVuSansMono" style="normal" weight="normal"/>
</font>
<font metrics-file="./fop-fonts/DejaVuSansMono-Bold.xml"
kerning="yes" embed-file="./fop-fonts/DejaVuSansMono-Bold.ttf">
<font-triplet name="DejaVu Sans Mono" style="normal" weight="bold"/>
<font-triplet name="DejaVuSansMono" style="normal" weight="bold"/>
</font>
<font metrics-file="./fop-fonts/DejaVuSansMono-BoldOblique.xml"
kerning="yes" embed-file="./fop-fonts/DejaVuSansMono-BoldOblique.ttf">
<font-triplet name="DejaVu Sans Mono" style="italic" weight="bold"/>
<font-triplet name="DejaVuSansMono" style="italic" weight="bold"/>
</font>
<font metrics-file="./fop-fonts/DejaVuSansMono-Oblique.xml"
kerning="yes" embed-file="./fop-fonts/DejaVuSansMono-Oblique.ttf">
<font-triplet name="DejaVu Sans Mono" style="italic" weight="normal"/>
<font-triplet name="DejaVuSansMono" style="italic" weight="normal"/>
</font>
<font metrics-file="./fop-fonts/DejaVuSans-ExtraLight.xml"
kerning="yes" embed-file="./fop-fonts/DejaVuSans-ExtraLight.ttf">
<font-triplet name="DejaVu Sans Condensed" style="normal" weight="400"/>
<font-triplet name="DejaVuSansCondensed" style="normal" weight="400"/>
</font>
<font metrics-file="./fop-fonts/DejaVuSerif.xml"
kerning="yes" embed-file="./fop-fonts/DejaVuSerif.ttf">
<font-triplet name="DejaVu Serif" style="normal" weight="normal"/>
<font-triplet name="DejaVuSerif" style="normal" weight="normal"/>
</font>
<font metrics-file="./fop-fonts/DejaVuSerif-Bold.xml"
kerning="yes" embed-file="./fop-fonts/DejaVuSerif-Bold.ttf">
<font-triplet name="DejaVu Serif" style="normal" weight="bold"/>
<font-triplet name="DejaVuSerif" style="normal" weight="bold"/>
</font>
<font metrics-file="./fop-fonts/DejaVuSerif-BoldItalic.xml"
kerning="yes" embed-file="./fop-fonts/DejaVuSerif-BoldItalic.ttf">
<font-triplet name="DejaVu Serif" style="italic" weight="bold"/>
<font-triplet name="DejaVuSerif" style="italic" weight="bold"/>
</font>
<font metrics-file="./fop-fonts/DejaVuSerif-Italic.xml"
kerning="yes" embed-file="./fop-fonts/DejaVuSerif-Italic.ttf">
<font-triplet name="DejaVu Serif" style="italic" weight="normal"/>
<font-triplet name="DejaVuSerif" style="italic" weight="normal"/>
</font>
<font metrics-file="./fop-fonts/DejaVuSerifCondensed.xml"
kerning="yes" embed-file="./fop-fonts/DejaVuSerifCondensed.ttf">
<font-triplet name="DejaVu Serif Condensed" style="normal" weight="normal"/>
<font-triplet name="DejaVuSerifCondensed" style="normal" weight="normal"/>
</font>
<font metrics-file="./fop-fonts/DejaVuSerifCondensed-Bold.xml"
kerning="yes" embed-file="./fop-fonts/DejaVuSerifCondensed-Bold.ttf">
<font-triplet name="DejaVu Serif Condensed" style="normal" weight="bold"/>
<font-triplet name="DejaVuSerifCondensed" style="normal" weight="bold"/>
</font>
<font metrics-file="./fop-fonts/DejaVuSerifCondensed-BoldItalic.xml"
kerning="yes" embed-file="./fop-fonts/DejaVuSerifCondensed-BoldItalic.ttf">
<font-triplet name="DejaVu Serif Condensed" style="italic" weight="bold"/>
<font-triplet name="DejaVuSerifCondensed" style="italic" weight="bold"/>
</font>
<font metrics-file="./fop-fonts/DejaVuSerifCondensed-Italic.xml"
kerning="yes" embed-file="./fop-fonts/DejaVuSerifCondensed-Italic.ttf">
<font-triplet name="DejaVu Serif Condensed" style="italic" weight="normal"/>
<font-triplet name="DejaVuSerifCondensed" style="italic" weight="normal"/>
</font>
<font metrics-file="./fop-fonts/GenI102-Italic.xml"
kerning="yes" embed-file="./fop-fonts/GenI102.ttf">
<font-triplet name="Gentium" style="italic" weight="normal"/>
</font>
<font metrics-file="./fop-fonts/GenR102.xml"
kerning="yes" embed-file="./fop-fonts/GenR102.ttf">
<font-triplet name="Gentium" style="normal" weight="normal"/>
</font>
<font metrics-file="./fop-fonts/GenAI102-Italic.xml"
kerning="yes" embed-file="./fop-fonts/GenAI102.ttf">
<font-triplet name="Gentium Alt" style="italic" weight="normal"/>
<font-triplet name="GentiumAlt" style="italic" weight="normal"/>
</font>
<font metrics-file="./fop-fonts/GenAR102.xml"
kerning="yes" embed-file="./fop-fonts/GenAR102.ttf">
<font-triplet name="Gentium Alt" style="normal" weight="normal"/>
<font-triplet name="GentiumAlt" style="normal" weight="normal"/>
</font>
</fonts>
</configuration>
Actually, that's a misleading title; FOP is already working. The issue is how to configure the font settings, and how, then, we might deploy the system with working font settings. Here are the basics:
- FOP comes with the capability to handle all the BASE 14 fonts (the fonts which are guaranteed to be in every PDF reader). It seems to know about those font metrics without any configuration. That means I can start developing my code using those fonts (which means Helvetica, Times and Courier), without addressing any of the problems below, but that's very limiting.
- To use other fonts, you first need to generate font-metric files for them. These are XML files encoding the metrics of the fonts, which FOP uses to calculate layout etc. The instructions on the FOP site for generating these files doesn't work; the TTFReader class is not found.
- Other documentation on the FOP site suggests that the latest version (0.95, which we think = fop-0.20.5.jar, which is in our Cocoon 2.11's) can work out the metrics for itself if you tell it where to find the fonts, using the fop.xconf file. However, we don't know where to put such a fop.xconf file on Cocoon, and how to tell FOP where to find it.
- Paths to fonts and font-metrics files need to be full paths for FOP to use them, as far as we can tell. That makes deployment difficult. Up to now, teiJournal is completely config-free, in that you can dump it into any Tomcat and it will just work. If these paths have to be hard-coded, then some kind of configuration script will have to be run when the app is first started, or by the administrator when deploying it. That's a disappointing step backwards.
We're still working on this. We'd very much like to distribute teiJournal with all the DejaVu fonts and metrics files for them, using them by default in the PDF generation; that's allowed under their licenses, and they're more attractive than the BASE 14 fonts.
I'm just beginning the PDF file generation code, so I've written to the IALLT Journal editors for some feedback on the design. As far as I remember, we've never addressed the issue of page size and layout for the PDFs. These are things we need to think about:
- What page size are we producing? Letter (8 x 10) is probably the best choice, because most people will be printing the PDF on a regular printer, if they print it at all, but we can choose any size we like.
- Do we want to go with separate layouts for recto and verso? When you're producing a print volume, you normally have slightly different margins for recto and verso pages, as well as a different running header, and the page numbers will typically be located in different places. However, when the target audience is likely to print the document off on their inkjet or laserjet printer, this is a bit pointless, because the pages are not bound in book form. On the other hand, more and more network printers have duplexing capabilities, so people may well be able to create a little booklet for themselves, or may put the pages into a binder, so perhaps we should allow for that.
- Bearing in mind the factors above, where should the page numbers go -- top or bottom? Left (verso) and right (recto), or centred?
- Again, bearing in mind the above, what should we use for the running title(s)? Each article already has a custom running title based on its title, but if we have separate recto and verso designs, we can have a second running title; that might be the author name(s), or the journal name/vol/issue, bearing in mind that a printout will be out of context of the site, so the journal name needs to be there somewhere. Alternatively, we could put the journal name in the gutter.
These decisions don't have to be set in stone; they'll just be encoded in a an XSLT file stored in the database, and can be modified easily, but it would be easier to start with a firm plan even if we change it later. My instincts say that we should design for a situation where people would print off the document in duplex and insert it into a binder, so:
- 8 x 10 paper, with a larger right margin on the recto and a larger left margin on the verso
- page numbers on the top left (verso) and top right (recto)
- running article title at the top of the verso, and author name(s) at the top of the recto
- journal name in the gutter of the verso, and volume/issue in the gutter of the recto
I'm waiting for feedback from the editors on this. Meanwhile, Greg and I have been looking at how to get FOP working (see next post).
I'm about a third of the way into the paper, which is long and full of references. There are five images to deal with, but I haven't got to them yet. So far so good -- I have a couple of queries in with the author about puzzling bits.
Finished marking up the bibliography of the last article. Found a couple of typos, and also tweaked a lot of the XSLT to display presentation and online journal data. Dates are very variable with online content, and where full dates exist, APA requires that they be in long form: 2008, September 15, but the code has to be aware that sometimes the day is missing, and sometimes the month. This is coming along nicely, though, and I'll be ready to start on the PDF code soon.
Started marking up the bibliography of the latest paper, and adding new handlers to the XSLT code as needed. We now have handlers for online letters to the editor, online reports, and blog postings. I'm about half-way through the bibliography. Only one more item (online transcript of forum presentation) looks problematic.
The third, and probably final, article for volume 40 has arrived, and it's a big one, with lots of new types of reference item to handle. I started by moving all the footnote references into a list of alphabetically-ordered commented items in the bibliography area, ready to be marked up, and looking around for some additional info for some of the items. Many are of types which are not clearly or directly handled by APA guidelines (for instance, online transcription of presentation, or letter to the editor of an online journal in response to a previous article). This article will take quite a while to mark up, but it'll help add more handlers to the reference system, before I start work on the PDF export.
The journal system allows for many different contribution types, specified through the @rend attribute on the <TEI> tag, and constrained by a customization of the schema. I've never actually used these before, but the latest contribution being a "Lab notes" item, HM asked that this fact be made noticeable on the page somewhere, so I implemented a system which supplies an absolutely-positioned contribution type label, based on XSLT string variables. Four document types are handled in the XSLT, although there are string variables for all the variants. The default "Note" is overridden in the user stylesheet in the db, and reads "Lab notes" as per the IALLT Journal requirements.
In the process, I fixed an annoying IE JavaScript error, resulting from the fact that IE doesn't support hasAttribute.
The second new article is short and simple, with no bibliography, so it was quick to mark up.
Tweaked the contents handling so that proofing documents are now simply stored in a different subcollection. This prevents their being indexed and their authors and other metadata showing up in the TOCs and indexes.