Finished marking up the new article, and sent the URL to the editors for proofing. Added handlers for two different groups of new reference/biblio item: onlineCommunity (which has its own handler), and the other group of onlineFactSheet, onlineBrochure, onlinePublicAnnouncement and software, all of which are handled as part of the old book handler, since they're all similar monograph-type things. Had to add a new function to handle "n.d." dates; this is called from the big monograph template, but not from anywhere else yet. When the time comes, I'll migrate it into all the other bits which render dates.
Category: "Activity log"
Received the first article for the new edition of the IALLT Journal, so I started marking it up. I've nearly finished that article. Notes at this point:
- I haven't yet marked up all the names and abbreviations in the body of the document. There are lots. I'll do that tomorrow.
- This article has fournew types of bibliographical item, which I've tagged as
onlineBrochure,onlineFactSheet,onlinePublicAnnouncement, andsoftware. The first two can probably be handled identically, but I'll need to write three XSLT handlers for outputting these types in APA format, as recommended in the APA electronic guidelines supplement. - I wrote a couple of bits into the sitemap and the
contents.xqfile so that going to a specific page will show the articles uploaded into the db but not released yet; they're tagged astype="proofing"in their<teiHeader>tag. The editors will be able to view these documents through a special contents page. - I still have to ensure that searching and indexing doesn't include proofing documents. That'll need a bit of work. It might be a better approach to store them in a separate collection, and write handlers based on the collection.
Met with RS, KA, and two of the OJS folks to discuss our development plans. My task is to get stuck into the conversion XSLT for teiJournal to NLM, and also to familiarize myself with the XML Galleys plugin for OJS.
For the last little while, we've been talking to the Open Journal Systems guys about integrating TEI support into OJS, and adding code to produce PDFs from XML in OJS. This is pertinent especially now because our library is touting for UVic journals to put into their OJS system (IK is the contact there), and ETCL is beginning markup of DHCN journal articles for generating PDFs, using my teiJournal system. Over the last week or so, we've been discussing this in detail on email, and it looks as though a collaboration will take place.
My initial task will be to work with KA as she marks up DHCN articles, to expand and nail down the TEI schema for journal articles, and then to write a converter to NLM. NLM is already supported by OJS, so we should be able to use that as a method of adding TEI support without too much pain.
This post is to begin logging the time spent on discussions and planning, and then on coding.
I noticed that the menu was incomplete on the IALLT Journal site. A little poking around revealed that it was an XQuery problem related to testing the equality of strings. If I do this:
declare function f:buildMainMenuItems() as element()*
{
let $site := collection('/db/teiJournal/site')//TEI[@xml:id='sitePages']/text/body
for $div in $site/div
return (
if ($div/@rend = 'link_only') then
<li><a href="{$div/head/ref/@target}">{data($div/head/@n)}</a></li>
else
if ($div/@rend != 'page_only') then
<li><a href="site.xhtml?id={$div/@xml:id}">{data($div/head/@n)}</a></li>
else
()
)
};
then the "not equals" test will fail for some reason; it even fails if I use compare(). However, if I turn the test around:
declare function f:buildMainMenuItems() as element()*
{
let $site := collection('/db/teiJournal/site')//TEI[@xml:id='sitePages']/text/body
for $div in $site/div
return (
if ($div/@rend = 'link_only') then
<li><a href="{$div/head/ref/@target}">{data($div/head/@n)}</a></li>
else
if ($div/@rend = 'page_only') then
()
else
<li><a href="site.xhtml?id={$div/@xml:id}">{data($div/head/@n)}</a></li>
)
};
then it works. I'm not sure why; perhaps it's something odd in the way eXist handles string comparisons, or perhaps it's because I'm comparing a node instead of its text, but in that case I don't see why the equals should work either. In any case, it's a live site, so I had to do a quick fix to get it working. If this shows up on any other site, it'll be worth figuring out the problem.
More struggles trying to get authentication both working and pretty. At one point the site crumbled, and we had to restart Tomcat; two restarts of Cocoon alone, from the TC manager, seem to throw it into disaster. Finally we settled for functional but ugly. I need to work closely with Bruno to build in some proper authentication, in the long term.
Added a new page to the site.xml file, and tweaked the menu/site pages system slightly, so that you can now have a page which explicitly does not show up on the menu. This is for an access-denied page, but there are other circumstances in which this would be useful. Also added a link from the top of articles back up to the TOC page.
The requirement to authenticate users through their IALLT credentials before letting them access the content eventually came down to an IP address restriction, which would enable the IALLT server to get our content and serve it on to its authenticated clients, and no-one else to access it. After much research, and struggling with the remarkably unhelpful (in this respect) Tomcat 6 documentation, we seem to have figured out how to do this.
The key is to create a Context object for the web application, and then apply a Valve (actually, a <RemoteAddrValve>) to control access to the <Context>. There are various ways to create a Context, but on a default Tomcat installation, you need to know that the default Engine name is Catalina, and the default Host name is localhost. Then, you go to [Tomcat]/conf, and create a folder structure inside it consisting of [Engine name]/[Host name]. So in our default setup, we end up with:
[Tomcat]/conf/Catalina/localhost
Next, you create an XML file inside that folder named for the Web application you're trying to protect. In our case, the Web application is called ialltjournal, and so the filename is ialltjournal.xml. That file needs to look like this:
<?xml version="1.0" encoding="UTF-8"?> <Context> <Valve className="org.apache.catalina.valves.RemoteAddrValve" allow="192.168.*,142.104.128.*"/> </Context>
The only bit of the file you would change is the allow attribute, which consists of a list of regular expressions, comma-separated, which match IP addresses or ranges which you want to allow to access your content. You can also use a deny attribute instead, if you just want to block some ips.
Then you have to restart Tomcat (restarting the web application doesn't seem to be enough).
Stuff we don't yet know, and are still investigating:
I think it should be possible to make a Context more sophisticated by using a path attribute on the <Context> element, so that the restriction might be confined to specific folders or paths. However, I haven't been able to make that work.
Added an XML link to the article rendering, allowing the reader to view the XML markup.
As the site menus expand, the need for a drop-down system becomes more apparent. These are the problems:
- We must have something that works without JavaScript being enabled (so "pure" CSS).
- It must survive font resizing.
- It must work in IE7 (which means there must be more than pure CSS, because IE7 CSS positioning is screwy).
- It must degrade gracefully on IE6.
- It must be functional with no CSS or JS at all.
It took me all day, but I got there. The CSS is fairly straightforward, and conditions 1 and 2 are satisfied on all the decent browsers by that. However, to position the submenus correctly in IE7, I needed to add some extra code in, which I included in a conditional comment:
<!--[if IE 7]>
<script type="text/ecmascript" src=" textresizedetector.js"></script>
<script type="text/ecmascript">
// <![CDATA[
window.onload = startUp;
function findPos(obj) {
var curleft, curtop;
curleft = curtop = 0;
if (obj.offsetParent) {
curleft = obj.offsetLeft
curtop = obj.offsetTop
while (obj = obj.offsetParent) {
curleft += obj.offsetLeft
curtop += obj.offsetTop
}
}
return [curleft,curtop];
}
function fixMenuPositions(){
var uls = document.getElementsByTagName('ul');
for (var i=0; i<uls.length; i++){
if (uls[i].className == 'mainMenu'){
var lis = uls[i].getElementsByTagName('li');
for (var j=0; j<lis.length; j++){
if (lis[j].getElementsByTagName('ul').length > 0){
var anc = lis[j].getElementsByTagName('a')[0];
var ancPos = findPos(anc);
var sub = lis[j].getElementsByTagName('ul')[0];
sub.style.left = (ancPos[0] + 1) + 'px';
sub.style.top = (ancPos[1] + parseInt(anc.offsetHeight)) + 'px';
}
}
}
}
}
function init(){
var iBase = TextResizeDetector.addEventListener(fixMenuPositions, null);
//alert( "The base font size = " + iBase );
}
function startUp(){
fixMenuPositions();
/* id of element to check for and insert test SPAN into */
TextResizeDetector.TARGET_ELEMENT_ID = 'theBody';
/* function to call once TextResizeDetector was initialized */
TextResizeDetector.USER_INIT_FUNC = init;
}
// ]]>
</script>
<![endif]-->
The resize detector code is taken from a library I found on the Web, by Lawrence Carvalho. What this does is detect when the font size changes in the browser (when the user resizes it), and you can use that event to re-trigger the layout code which positions the menus correctly. So far so good; condition 3 is now satisfied.
Next, the IE6 issue. Here, there's no point in trying to get the submenus working; my plan for the site is that when you click on a top-level menu item, you'll get the complete page with all subdivs rendered, but if you click on a sublevel item you'll get just that bit. This means that only having the top level menu items still makes all the info accessible, so the thing to do is remove all the submenus completely:
<!--[if lt IE 7]>
<script type="text/ecmascript">
// <![CDATA[
window.onload = removeSubmenus;
function removeSubmenus(){
var uls = document.getElementsByTagName('ul');
for (var i=0; i<uls.length; i++){
if (uls[i].className == 'mainMenu'){
var subs = uls[i].getElementsByTagName('ul');
for (j=subs.length-1; j!=-1; j=j-1){
subs[j].parentNode.removeChild(subs[j]);
}
var lis = uls[i].getElementsByTagName('li');
for (var j=0; j<lis.length; j++){
lis[j].style.display = 'inline';
}
}
}
}
// ]]>
</script>
<![endif]-->
Note the buggering about to get a decrementing loop working when you're inside a comment (so > and -- are not allowed).
So the basic problems are now solved. Next, I need to look at spitting out these menu structures from the actual site.xml code.
Made some changes to the site at the request of the editors.
The global settings I had for security were requiring that I authenticate twice to get at the eXist JNLP, which didn't make sense, so I've limited authentication to the actual articles.
Completed the remaining steps in the post from November 22, so that search hits are now correctly highlighted when you go from a search to a document. We're not currently doing any sophisticated stuff when it comes to highlighting index items in documents -- all items are treated as loose search strings with no phrases or pluses -- but I think that's probably the best approach, given that you can never predict whether any given index term is a phrase or an agglomeration of some kind. So I think search functionality is now basically complete.
Our stopgap plan for the journal authentication problem is now to apply simple authentication with a single user/pw at our end, and then have the IALLT proxy draw content from our server using that authentication, while serving out to end users based on authentication against the IALLT users database.
In order to do this, we had to get basic authentication working on Tomcat/Cocoon, which Greg and I have spent the morning figuring out. Predictably, this is not clearly documented, and you have to hack around a bit to figure it out.
Throughout this documentation, I'll use a fictitious user "fred" with a password "bloggs", and a role "nerd".
The first thing you need to do is to edit the tomcat-users.xml file, in [tomcat]/conf. Add a new role, and a new user/password to go with it:
<?xml version='1.0' encoding='utf-8'?> <tomcat-users> <role rolename="manager"/> [...] <role rolename="nerd"/> <user [...]/> <user [...]/> <user username="fred" password="bloggs" roles="nerd"/> </tomcat-users>
This gives you a user and a role you can then use in the web application. Next, you need to edit the webapp's web.xml file in [tomcat]/webapps/[app_name]/WEB-INF. First, find the first <servlet> tag inside <web-app>, and add this at the end of it:
<security-role-ref>
<role-name>nerd</role-name>
<role-link>nerd</role-link>
</security-role-ref>
This enables Cocoon to be configured to use that role when specifying security constraints. Now we need to specify those constraints. Find the closing </web-app> tag at the end of the file, and add this just before it:
<security-constraint>
<web-resource-collection>
<web-resource-name>[Name of your web application]</web-resource-name>
<url-pattern>/*</url-pattern>
<url-pattern>/[whatever]/*</url-pattern>
<http-method>GET</http-method>
</web-resource-collection>
<auth-constraint>
<role-name>nerd</role-name>
</auth-constraint>
</security-constraint>
<!-- Define the Login Configuration for this Application -->
<login-config>
<auth-method>BASIC</auth-method>
<realm-name>[Name of your web application]</realm-name>
</login-config>
<!-- Security roles referenced by this web application -->
<security-role>
<description>
The role that is required to log in to the application
</description>
<role-name>nerd</role-name>
</security-role>
Save, upload to the server, and restart Tomcat (or just Cocoon, if you've already restarted Tomcat so it reads the users file). Any user name belonging to the "nerd" group can now log in and see the materials protected in this way.
Later note on restarting Tomcat: We tried restarting Tomcat (in this case, tomcat-dev), logged on as me using sudo on the init script as documented elsewhere. This generated a problem we have seen before: the running instance of tomcat-dev failed to stop, and when a new process was started, we ended up with two running. The only way we could find to solve this was to su to hcmc (under which the processes run), and kill both of them. The first (the new one) was effectively killed with just kill, but the other required kill -9. Once both were stopped, we were able to run the init script (as me, using sudo), and Tomcat restarted. Greg is asking sysadmin to figure out what the problem might be here; we've encountered it several times now.
Glossing of "XML" was inaccurate (my fault!).
Links from searches and indexes to articles should cause hits or keywords to be highlighted in the articles. This is a problem in several parts:
- Pass the highlight items to the article rendering pipeline through parameters on the URL (partly done; will need to be more sophisticated in the case of indexes).
- Run the search in XQuery (done, in
doc.xq). - Embed the phrases to be highlighted into the XML of the retrieved document (quite hard -- need to dismember the article to do that, unless I can find a hack to insert a node into a document somehow in
doc.xq). - Include our custom XSLT in the pipeline, to highlight the phrases.
- Render the match tags to highlighting in the output (done).
So there's more to do. What we have works in a basic way with non-phrase searches, but we'll need to work on 3 and 4, then go back to work on 1 once we know how it's all functioning.
The Dublin Core metadata embedded in the pages as <meta> tags is important, and should be well-formatted. I'd originally written that code to handle the single <teiHeader> elements in individual articles, but it's also being invoked on the site pages, especially on the Contents/Search page, which lists multiple articles. This was causing huge duplication of elements, because each article's header was included. I rewrote the XSLT code so that it handles both situations elegantly, with lots of distinct-values calls, and the results are now much better. This is especially important because I'm planning that this metadata will be displayed to the end-user through a JavaScript popup; this makes it readable by the user, but also parsable automatically by tools such as Zotero.
Pulled a long day on this one, starting from scratch with designing the system, through the XML data for the DB, XQuery to pull it out, and XSLT to configure and link it. This is how it works:
There's a file called indexes.xml that sits in the settings/default subcollection of the database. This is a TEI file which contains a list of items, like this:
<text>
<body>
<head>Indexes</head>
<list>
<item>
<name>People</name>
<code lang="xpath" n="find">/text//name[(@type='person') or not(@type)]</code>
<code n="display">if ($hit/surname and $hit/forename) then (concat($hit/surname, ', ', $hit/forename)) else (string($hit))</code>
</item>
<item>
<name>Organizations</name>
<code lang="xpath" n="find">/text//name[@type='org']</code>
</item>
[... and so on.]
Each item constitutes an index that will be created and displayed. The <name> element specifies the heading which will show up on the page, the first <code> element specifies the XPath which finds all the XML nodes you want to index, and the second (optional) <code> element is some XPath which can format the items you find in a particular way.
Next, an XQuery document called indexes.xq parses this, and constructs actual queries from it, which it throws at the database, producing all the actual indexes, again in the form of a TEI file with lots of lists and items. That file is passed to indexes.xsl, which figures out which index has been selected (based on a parameter from the URL, which XQuery has used to select an index by adding an attribute to it), and then renders each distinct item in the list, in order, followed by links to each of the documents that contains it.
The XSLT took a while, because some items need to be rendered using existing templates (e.g. abbreviations), while others are plain text and just need to be displayed. Also, this uses some of the new grouping features in XSLT 2.0, which are relatively new to me.
Right now, each index item is followed by a list of links to the articles that contain it. My last task for this week is to make those links, along with the links to articles found through searches in the Contents/Search page, display with the relevant text highlighted. Once that's done, I think phase one is complete.
Found this post on the Cocoon gmane list that may be very useful:
With a clear cache action you can clear (most) caches.
In the components:
<map:actions>
<map:action name="clear-cache" logger="sitemap.action.clear-cache" src="org.apache.cocoon.acting.ClearCacheAction"/>
</map:actions>
<!-- ... -->
In the pipelines:
<map:pipeline type="noncaching" internal-only="false">
<map:match pattern="clearcache.html">
<map:act type="clear-cache">
<map:generate src="status" type="status"/>
<map:serialize type="xml"/>
</map:act>
</map:match>
</map:pipeline>
The value of this may come when we have XUpdate actions which write configuration and preference information to the database; any pipeline which does this could invalidate the cache immediately afterwards, so any changes would take effect.
Spent most of the day hacking at the appearance of the site, with (I think) pretty good results. I've added variables in the db storage code so that a logo can be added to the journal title, and configured a range of different elements in the base and user CSS.
One major problem I'm having is with pipeline caching. Even though I've specified type="noncaching" on the whole pipeline, the system whereby data stored in the db as xsl variable elements, and pulled out via XQuery to be used as part of an XSLT transformation, is still suffering from caching. This will be a problem when we have a GUI which enables users to change that data. I'll have to find a way to tell Cocoon to clear its cache.
Doug sent over a lot of new versions of the logo, which has changed a little; circles now touch where previously they didn't. I traced this version and created a range of PNGs at various sizes and opacity.
The current versions of the IALLT logo on the site are rather pixellated GIFs which won't be resizable, and we need versions suitable for various locations and sizes, as well as for watermarking, so I traced one to create an SVG version and rendered it into bitmaps of various sizes. Sent the results to D. and H. for approval.
I decided that since the site rubric pages are mostly drawn from the site.xml file in the database, it would make sense to build the menu from there too, so I replaced the code which constructs the menu from a little block of XHTML in the database with XQuery which parses the site.xml file to create a menu. This entails adding a new type of <div> to the site.xml file, which simply constitutes a link, rather than content; this is used for the contents page, which needs its own XQuery and XSLT to function. This seems to be working fine.
Whatever was wrong with Tomcat seems to be fixed, so we were able to get on with deploying teiJournal on it. Rather than try to deploy our own custom-built WAR file, with time running out for the release deadline, we decided to go the tried and trusted route and start from the default Cocoon+eXist WAR file, and at the same time refine our documentation for doing such things. The full documentation is posted here.
Spent most of the day with Greg trying to deploy ANY webapp successfully on the new development Tomcat. No joy whatsoever. Not even an original download of the Cocoon/eXist WAR file will deploy. There's clearly something wrong with the way this Tomcat is set up, but we're frustrated in trying to figure it out by the fact that we don't even have read access to the Tomcat configuration files.
This is completely frustrating. If we can't get this going tomorrow, I'll have to consider deploying the teiJournal code into the existing Tomcat, new Java stuff and all. That's a bit scary, but we can't afford to spend a week trying to debug a screwed-up Tomcat install when we have a deadline looming.
We have a dev Tomcat working on Lettuce now, so that'll be the place to deploy the app. The problem now is how best to deploy the application itself under this Tomcat.
What we'll probably do is this:
- Change
web.xmlto rename the application from "Cocoon" to "teiJournal". - Also change
web.xmlto set all encodings to UTF-8 (some are still 8859-1 in the default setup). - Deploy the application itself in a directory called
ialltjournalin thewebappsdirectory of Tomcat.
Right now, we don't have permission to restart Tomcat, so we can't have it find the webapp where we've placed it; we've tried "deploying" it using the Tomcat manager, but since it's not a WAR file, that doesn't seem to work; telling it to deploy an application from the directory where its files are seems to cause those files to be deleted. Once Greg has control over Tomcat, we can restart it and see if it finds the new webapp or not.
Completed the system for the main site pages. They're based on a single document, in TEI, which has a number of <div> elements in its body. The document is stored in the db, and in response to a request asking for a particular id, the correct <div> is turned into a page. In the process, I abstracted quite a lot of XSLT content from the article XSL base file into a general file (general.xsl), so that it's available to other processing pipelines.
One thing remains to be done: we need to set up a page of indexes which harvests all the key data (names, references, etc.) into lists that can be browsed, linking back into articles. I'm not exactly sure how to approach that yet.
When search hits are displayed in the contents page, they now use only the preceding and following text() elements, unless these are less than 60 characters in length respectively, in which case another one is retrieved.
Spent the day on this, since we have a deadline to release the journal site at the end of November. The search/contents page is basically finished. Details:
- The search text box is visible by default, along with a search button, and a
More...button. - Pressing the
More...button shows the search filters. - This is the case UNLESS a search has previously been done using the filters; in that case, the filters are visible, and the correct items are selected in them.
- The page itself configures its main heading based on whether a search has been conducted, or it's just showing the whole list of articles.
- When there are search hits, the hits are shown in context, highlighted, in a block below the article entry in the TOC table.
Tested on a range of browsers. Seems to work well everywhere, although it looks ugly on IE6.
One remaining issue is trimming the size of the search hit contexts. Right now, I'm taking the two preceding and the two following text() nodes, without checking how big they are; I should trim them to a maximum length just to keep the output under control.
The next stage is to implement a site title bar and menu, probably via XSL includes, based on strings in the database; the menu will be a horizontal single-level thing, because we have hardly any content to put in it. I've written to the editors to get some content for the site pages.
Following that, I need to implement an addition to the article display code which allows the same search to be run against an article, and for it to have the same text highlighted as on the search page, so that, having found hits in an article, you can go to it and see them all highlighted there too. Not quite sure yet of the best way to approach that.
Began working on the XSLT for XHTML output of the search. Got the search form and fields all rendering themselves, with the correct selected data, and got started with the table of documents. The form itself seems to be losing all its data when it's submitted -- have to look at that. :-)
Rewrote and extended the contents.xq code to add:
- Sorting by document type (previously missing from the sort parameters, for some reason)
- XML
<list>elements containingitems for each distinct value of year, author, title (truncated), vol/issue, and content type. - Addition of
@rendattribute for an item in the list which matches an existing search parameter.
This will enable easy creation of drop-down lists with correctly-selected values in the XSLT which processes the contents page.
The next stage is to plan how that XSLT will be written -- in particular, how it will work with the rest of the site. I'm assuming at this stage that we'll want a standard set of processing libraries for the non-article site content, and that this content will be drawn from the db as well (although perhaps from another area of the db). I'll need to look at the strings components we have in the db, to see if it's going to be sufficient to build all the surrounding content with those alone, or whether we'll need some user-editable XML files for menus, headers, footers etc.
Turned off Java binding in eXist's conf.xml file (in cocoon/WEB-INF), and the Java module continues to work fine, so we've nailed that.
Looking at the rendering of documents in my pilot install on Greg's machine, I noticed whitespace missing from mixed content. This is a problem we've encountered before, but it took a while to remember what needs doing. This part of conf.xml needs to be changed, from:
<indexer caseSensitive="yes" index-depth="5" preserve-whitespace-mixed-content="no" stemming="no" suppress-whitespace="both" tokenizer="org.exist.storage.analysis.SimpleTokenizer" track-term-freq="yes" validation="none">
to:
<indexer caseSensitive="yes" index-depth="5" preserve-whitespace-mixed-content="yes" stemming="no" suppress-whitespace="none" tokenizer="org.exist.storage.analysis.SimpleTokenizer" track-term-freq="yes" validation="none">
In other words, two changes to attribute values. Then you have to re-upload all the documents in the the db.
Built my Java module and got it working on the test Tomcat stack on Greg's machine. As usual, the eXist documentation was slightly wrong, adding overhead to the existing pain. For the record, this is what works when you're importing the Java module in XQuery:
import module namespace su="http://hcmc.uvic.ca/namespaces/xqsearchparser" at "java:ca.uvic.hcmc.xqsearchutils.SearchParserModule";
The next thing to try is turning off Java binding in the config file. If it still works with binding turned off, then we have no security issue, and I can then try putting my module into the existing Cocoon setup on Lettuce; that would let me test and get the TOC page working for teiJournal while we're still waiting for it to get its own Tomcat stack.
Marked up Doug's intro, and in the process added handling for the <foreign> tag, and documented use of <hi> and <foreign> on the Website.
The president and president-elect's intro to the journal was marked up and added to the db.
Finally got the driver and function wrapper classes to build against the eXist jars, so I can now flesh out the wrapper that turns my search parser classes into a module I can call from XQuery without turning on Java binding. This was harder than I wanted it to be, mainly due to my unfamiliarity with the eXist source.
I've spent a lot of the day struggling to find a setup which will allow me to write extension modules for eXist. This is probably a fairly simple issue, but I'm new to Java so I'm having problems. I tried both in Eclipse and in NetBeans, and got a bit closer with the latter. The issue is figuring out how to link in the eXist source, and the libraries on which it depends such as apache and anntlr files. So far I haven't found a setup which enables my IDE to successfully build my classes while finding all of the related classes in the various locations.
Finally, I went through and added all the JAR files in the eXist tree to the project. That seems to have helped a bit, but it's still failing to find some stuff (QName and SequenceType). Back to this tomorrow.
Started looking through the instructions here about writing XQuery modules for Java. Downloaded the latest stable eXist release to get the source, so I can link into it with Eclipse. Discovered that the instructions refer to org.exist.xpath, which isn't in the source; these functions seem to have moved to org.exist.xquery.
I started with the decision that my own packages ought to be properly hierarchical and named, so I created a new package called ca.uvic.hcmc.xqsearchutils, and moved the current search classes into it. Then I recompiled the jar file, and checked that I can still invoke the classes in the old way, based on Java binding. This means I have to use this line:
declare namespace su="java:ca.uvic.hcmc.xqsearchutils.SearchParser";
instead of the simple class name to invoke the class. That works fine on our test system.
Next, I tried linking in the eXist source to my Eclipse project. There are lots of errors and warnings when I do this, so linking to an external folder might not be the way to go. I'm going to try putting the source inside my existing Java library.
This documentation covers the changes you have to make to the default download of Cocoon + eXist built as a WAR file to get everything working for our regular projects (of which teiJournal is a good example). This process was tested using a clean download of Cocoon+eXist today.
- First, rename the
WARfile to "cocoon". If you don't do that, you'll be dealing with ungainly long directory structures for ever. - Next, deploy the
WARfile (i.e. Stop Tomcat if it's running, put theWARfile in its webapps directory, and restart it. - Check that Cocoon has deployed OK by going into the Tomcat manager application to see it, and also going to
:8080/cocoon/. - Stop Cocoon from within the Tomcat manager so you can add some custom Java libraries.
- For
teiJournal, add the following libraries to cocoon/WEB-INF/lib:TitleSortComparator.jar(sorts titles ignoring leading articles, etc.)xqSearchUtils.jar(contains the search-string parsing functionality used for teiJournal's search).
- For support of XSLT 2.0, you need to add and configure the Saxon 8 libraries. Get the Saxon-B download from here. Then extract all the .jar libraries into
cocoon/WEB-INF/lib. - Now we need to configure Cocoon so that Saxon can be called. First, open
cocoon/WEB-INF/cocoon.xconf, and find the bit that refers to Saxon XSLT, which is commented out by default. Uncomment the code and change it according to the instructions in the file, so that it enables Saxon 8:<component logger="core.xslt" role="org.apache.excalibur.xml.xslt.XSLTProcessor/saxon" class="org.apache.cocoon.components.xslt.TraxProcessor"> <parameter name="use-store" value="true"/> <parameter name="transformer-factory" value="net.sf.saxon.TransformerFactoryImpl"/> </component> - Now we need to edit
cocoon/sitemap.xmapto enable the Saxon transformer. In the<map:transformers>section, add this below the other XSLT transformers:<map:transformer name="saxon" pool-grow="2" pool-max="32" pool-min="8" src="org.apache.cocoon.transformation.TraxTransformer"> <use-request-parameters>false</use-request-parameters> <use-browser-capabilities-db>false</use-browser-capabilities-db> <xslt-processor-role>saxon</xslt-processor-role> </map:transformer> - Add the following in the
<map:serializers>section, to enable a couple more useful output formats:<!-- Customization: compatibility setting for IE6 --> <map:serializer logger="sitemap.serializer.xhtml" mime-type="text/html" name="xhtml11_compat" pool-grow="2" pool-max="64" pool-min="2" src="org.apache.cocoon.serialization.XMLSerializer"> <doctype-public>-//W3C//DTD XHTML 1.1//EN</doctype-public> <doctype-system>http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd</doctype-system> <encoding>UTF-8</encoding> </map:serializer> <!-- Customization: set text output to UTF-8 --> <map:serializer logger="sitemap.serializer.text" mime-type="text/plain" name="text" src="org.apache.cocoon.serialization.TextSerializer"> <encoding>UTF-8</encoding> </map:serializer> - Now, if you're going to use Java Binding in eXist XQuery modules, you'll need to enable it. Open
cocoon/WEB-INF/conf.xml, and change<xquery enable-java-binding="no">to<xquery enable-java-binding="yes">. IF YOU DO THIS, BE AWARE OF THE SECURITY ISSUES -- see below. - Restart Tomcat (restarting just Cocoon doesn't seem to be enough).
- Check that Cocoon is running (
:8080/cocoon/). - Check that eXist is running (:
8080/cocoon/samples/blocks/exist/). - Start the eXist client for the first time (using the
Webstart Client Launchlink on the menu of the page above). - There will be no admin password at first; just log in as admin with no password, then change the admin password. You'll immediately get an error; don't worry, just close the client down and restart it, then log in with the new password.
- Now you can add data and a project folder as necessary, and test the results.
- Now, IF YOU'RE USING JAVA BINDING, you'll need to configure some security using XACML. We don't know how to do that yet :-)
Finished writing and doing basic testing of highlight-matching.xsl. The purpose of this file is to create supplement then normal way eXist tags
search hits. When single items are searched for using the normal eXist &= and
|= operators, each hit is tagged with an <exist:match> tag, This tag can then
be used in the output to highlight the hits. However, when we search for an
"exact phrase", we have to use the XPath contains() function in the XQuery.
In this situation, the hits are not highlighted, because eXist isn't using its tokenized
index. Therefore we need to keep track of the phrases which the user has
searched on (only those searched using the + or AND operators), and then
process each text node in the output to see if any of those phrases appear
in the node. If they do, we need to highlight them.
This is achieved by first harvesting the search phrases from the <teiHeader> of the
search results document, and putting them into a sequence called $phrasesToTag.
Then we match on each text() node, and (assuming there are any phrases to
highlight), we pass that node to a recursive template called tagMatches. This
checks each of the elements in $phrasesToHighlight against the text node, using
the contains() function; if it finds the phrase within the text node, it passes the
text node and the phrase to another recursive function called tagMatch, which
highlights all instances of that phrase.
The only limitation at the moment is that if there are multiple phrases, and more than one of them shows up in the same text node, only the first will be tagged. This is annoying, but it's not really a big deal; it's unlikely that multiple matches of phrases will show up often in the same text node, and in any case, the text node itself (or its ancestor node) will be returned as part of the hit report due to the first phrase being tagged in it, so we won't worry so much about this.
There's also probably room for some optimization in this code, but that will come later.
I've found the solution to an old problem: how to get the request URI into the output. I can pass it as a parameter into an XSLT transformation like this:
<map:transform type="saxon" src="xsl/highlight_matches.xsl">
<map:parameter name="browserURI" value="{request:requestURI}?{request:queryString}" />
</map:transform>
It doesn't include the protocol, server, or Tomcat port, but those are not important anyway, really.
I'm also beginning work on the code which highlights the phrase matches through XSLT. Actually, it seems most likely that this will be achieved through a call to a Java class, as documented here.
Working on the Java classes alongside the XQuery code that's calling them, I've now extensively revised both, to give results which seem to work well, and give valid, usable output. The list of phrases to highlight is now provided as an array by the Java class, and the XQuery iterates through that array to create a list of items, resulting in actual XML elements rather than textual data. When there's no search string, a dummy text body is returned for each document header, so that the resulting corpus file is valid.
Because testing with a Java class can be risky, Greg's set up a test environment for me to hack at, with a working Tomcat/Cocoon/eXist stack containing teiJournal.
In this environment, I've deployed the Java classes I wrote, and integrated them into the XQuery. This is how you do it:
- Add the jar file which contains your classes to the
Cocoon/WEB-INF/libdirectory. - Add a namespace for your class like this:
declare namespace su="java:SearchParser";
My class is not qualified by a hierarchy; it's a top-level class in a library calledxqSearchUtils.jar. It appears that the name of the jar file does not have to match the name of the class; the two classes in my jar file are calledSearchParserandSearchBit. - Construct an instance of your class like this:
declare variable $sp := su:new($search);
This is equivalent to Java:SearchParser sp = new SearchParser(search);
In this case, $search is the variable containing the search string the user has submitted. - Call functions on the object like this:
declare variable $searchStringBit := su:escGetXQueryClauses($sp);
which is equivalent to Java:String searchStringBit = sp.escGetXQueryClauses();
The only slight complications I encountered concerned the escaping of strings coming back from the Java object; to handle this, I added extra functions to the object to escape strings for XML. However, I have one slight issue remaining: the "phrases to highlight" stuff is coming back as a string, even though it's a set of XML <seg> elements. I need to figure out how to get elements from it.
Once that's all working, I need to turn on XACML, and figure out how to nail down the use of Java so only my classes can be called.
Added handling for the AND, NOT and OR keywords, so they are actually converted to the equivalent +, - and nothing-at-all equivalents before the search string is parsed. Strings which use either now give equivalent results.
Added another function to the Java class, to enable the calling XQuery to embed XML detailing any phrases which will need to be highlighted in output. This is explained in the comments in the class file:
* This is how you would use this class: * * 1. Get the search string as input from a form, and collect it * in the XQuery using request.getParameter(). * * 2. Create a SearchParser instance, passing the search string to * the constructor. * * 3. When you need the XQuery clauses to tack onto your XQuery * code, call the SearchParser's getXQueryClauses() method. What * you get back can form part of a string passed to util:eval(). * * 4. When you need to embed phrases that your XSLT can find and * highlight, call the SearchParser's getPhrasesToHighlight() method. * This gives you back a <seg type="searchHitPhrase">...</seg> element * for each phrase which is included in the search. This is necessary * because, when we search using eXist's &= or |= operators, the hits * come back tagged with <exist:match> elements, but when we use the * XPath contains() function (which is necessary for doing phrase * searches), they don't come back highlighted. XSLT can use the <seg> * elements to perform some later highlighting on search results, * perhaps by adding <exist:match> elements which can be rendered in * the normal way later in the XSLT.
Having decided, for the moment, to offload the heavy lifting of the search string parsing to Java code, I rewrote the contents.xq page so that it just does a simple |= job on the search string contents, so I can carry on working on the rest of the page. This seems to work fine. Next, we move on to the XSLT, which is going to be complicated too: it will have to find all the matches that were achieved with contains(), which will not be tagged up with <exist:match> tags, and either tag them up or highlight them in some other way. That means I really do need to know what the search string was, even at the XSLT level. Perhaps this tagging should take place in a preliminary XSLT operation, prior to the main rendering of the page.
Also, it strikes me that the Java class I'm writing ought also to be able to spit out a list of the phrases that will need highlighting, in a format that's easily parsed by XSLT. That would make things quicker and simpler.
My two simple classes for parsing a Google-style search string and emitting XQuery clauses are now finished and tested; I've added extra features for handling redundant spaces, returns, and other deformities in the input, and everything seems to be working pretty well. The only slight difficulty is the problem of handling optional phrase matches; with mandatory phrases, we can use [contains(., "the phrase")], but with optional phrases (not preceded by +) we have to fall back to adding all the components of the phrase to a [. |= "the phrase and other stuff"] clause. This is not ideal, but I don't see a comparable XQuery clause I can use, and I don't want to break the system down any further. Another option would be to treat all phrases which are not minused ("must not contain") as being mandatory; we should consider that, or even make it a switch.
The next stage is to build the jar package, code links to it into the XQuery, and test it on a sandbox system that Greg is setting up, where we can mess with the XACML settings and turn on Java binding.
After some serious brainstorming with others in the lab, it seems that it's going to be virtually impossible to parse a Google-style search string in XQuery. The best approach is probably going to be using a Java class, which can be called from inside XQuery, to parse out the components of the search string and return a block of XQuery clauses which can be appended to the query as appropriate. This will entail enabling and configuring XACML in order to protect the db against the running of arbitrary Java code.
Started writing the Java package in Eclipse; so far it can parse out each of the components of the search string successfully, and decide which type of components they are. Next I need to build the XQuery in the most efficient way (for instance, amalgamating all of the single-word must-contain components into one &= clause).
Marked up the editorial introduction to the volume.
The author of the third article wanted to include much more detailed information in the affiliation tag, so I've added the new info, and changed the XSLT so that it simply inserts a linebreak after the name and then processes the rest of the contents as they are. <lb> tags have to be used to divide lines, because we can't have <p> tags inside <affiliation>.
Today I got the sorting stuff written and tested. It's basically a bunch of nested if-then-elses, with some fancy-footwork to handle sorting on two items at once (such as volume+issue). While this sorting does include sorting by title, with XSLT we can call our own little java library containing a sort comparator, which enables us to handle e.g. sorting titles while ignoring leading articles. We'll probably sort the returns in XQuery anyway, but pass the sort key on to the XSLT so that it can decide if it needs to sort them again with a more sophisticated comparator.
I also started work on the text search, which is a more complicated business. First, we need to split up the search string into a sequence of items. Items can be:
- quoted strings (for exact matches), or
- individual words.
Each item can be preceded by
- + (must match),
- - (must not match), or
- nothing (optional match).
Each non-quoted string may contain wildcards:
- * (any char(s))
- ? (any single char)
For each item, we must generate an XQuery expression, and then we have to string them all together in an appropriate way to create a single clause.
I've done this job in a less efficient and more constrained way for the Mariage project, but this is an opportunity to create a really good mapping between Google-style search syntax and eXist search functionality. There are significant problems, of course, especially with match-highlighting, but I think we can do a reasonable job.
Got the framework of the XQuery done, with search filters such as volume, issue, author, keyword etc. working. Still have to get the search system working, with useful match returns -- should the XQuery do the KWIC construction, or the XSLT? -- and I also need to get ordering working.
The plan to merge TOC and Search is a good one, though. We could actually put the search form on every page, if we want. Right now, the TOC/search is handling only texts in the /texts/ subcollection, but if the rest of the site content is to be stored in the db too -- which I think it should -- then we ought to allow an option to sort that material as well. That should perhaps be a switch, but I'm not sure yet how best to implement it.
Got the third article completed and posted. In the process, I made a couple of tweaks to the CSS (line-height and font sizes) to make the page look a bit easier on the eye.
Did some work on the Bibliographical Markup page of the encoding documentation. It's still in need of worked examples of <biblStruct> elements, but I'll add those soon.
Managed to make some progress this morning, despite network issues. This is my report to the group:
I've just had a chance to do the biblio on the latest article. I looked through the APA guide, including the electronic resources PDF, and there's no specific mention of video or DVD at all, as far as I can see; the key category seems to be "Motion picture". Accordingly, I've decided to treat the item as a motion picture, and relegate the information about the format to a note, like this:
<biblStruct xml:id="dörnyei_2005" rend="video">
<monogr>
<respStmt>
<resp>Speaker</resp>
<name>
<forename>Z.</forename>
<surname>Dörnyei</surname>
</name>
</respStmt>
<title level="m">A closer look at Motivation in the language learning classroom</title>
<imprint>
<pubPlace>Stirling</pubPlace>
<publisher>Scottish Centre for Information on Language Teaching and Research (Professional Services), University of Stirling</publisher>
<date when="2005"></date>
</imprint>
</monogr>
<note>DVD and online video.</note>
<note><ref target="http://www.scilt.stir.ac.uk/dvd/index.html"><date notAfter="2007-09-18">September 18, 2007</date></ref></note>
</biblStruct>
This has the advantage of neutrality over which format (DVD or online video) is primary. I've written rendering code for it, and you can see the output on the page. There's no content in the doc at the moment, just the framework and the bibliography. So far we're handling the following reference types:
- book
- book chapter
- journal article
- presentation
- video
and we'll be adding more as we go along. Meanwhile, I'm refining the basic code every time, to handle different contingencies that crop up (such as the need to make Dörnyei a "speaker" on the video, rather than an author or an editor).
I've also begun work on the table-of-contents code, which is still in its early stages; it needs to be fairly complex, because I'm envisaging it as combined with the search system, like this. The idea is that the complete TOC of articles would be available on one page, but there would be a "Search/Filter" button at the top; if you click that, a form will appear, where you can choose to filter the article list by a range of different criteria, including volume, start date, end date, author, type of article, keywords, and even search text; and you'll be able to sort the results in various ways. Clicking an "Apply" button would retrieve the new results and show them as a TOC. If you've searched for text, then you'll also see gobbets of text with the search term highlighted, in the TOC list. Any of these searches or TOCs would be accessible through a URL (all the parameters are in GET variables), so anyone could send out the URL of a list of articles as a "collection". I'm hoping to have a basic TOC available within a week or so, but the searching/filtering bit might take a little longer.
I'm working on a script called contents.xq which will retrieve a list of documents from the db based on a very flexible series of parameters, and return them as a well-formed TEI document containing a <listBibl> of <biblStruct>s. The basics are in place, but I'm stuck on one really annoying detail: I'd like to include the request URL from which the data was generated, and I can't find a way to get that information, or pass it into the XQuery. I'm sure it's possible through the sitemap, but I can't figure it out yet. If it proves too difficult, I can probably pass it into the transformation instead.
Wrote a bit more of the XSLT-to-CSS code.
Converting between CSS and XSL attribute sets in XSLT is necessary for storing styles in the db, and a bit fiddly to do manually, so I've been working on a Java app to handle it. CSS to attribute-sets is working, and I'm halfway through the reverse. This code may be repurposed in a Java library that could be called from a Cocoon pipeline, as part of the editor's GUI.
Made some final fixes to XSLT and CSS based on the W3C validator, and reported to the editors as follows:
I've finished a first pass through generating the XHTML output from the two documents we've marked up so far. The results are here and here.
Some points to note:
- Both XHTML and CSS validate, according to the W3C validator.
- My choices as to fonts, spacing, sizing etc. are just arbitrary; I'll need your input into that.
- Wherever APA has anything to say about how something should be presented, I've tried to follow it (for instance, in the display of tables with no vertical and few horizontal lines), but I may have missed some APA diktats, so do let me know if you see anything odd.
- Notes and references work as popups on the right of the text (hence the larger margin on the right). However, if the user has JavaScript turned off, they should just work as straightforward internal links.
- Images embedded in the text are links to full-size versions of themselves.
- Metadata is embedded in the source of the document as Dublin Core meta tags. I plan to add some JavaScript which can parse it out from the header and present it to the user in a more human-readable form, but the meta tags are important for machine-reading, and for folks who turn off their JavaScript.
I think it would be useful at this stage to concentrate on making sure all the display features are working as they should, and following APA, and on making some basic choices about font style and display characteristics. Once we've got an XHTML appearance we're happy with, I can start porting that over into the XSL:FO/PDF output, which is a bit more tricky.
Other things on my mind:
I'm wondering if it would be useful to have a view of the text in which the JavaScript and CSS is embedded directly into the document, so that it would function as a single portable file. This portability would be undermined by the fact that images would still have to be externally linked, though, so perhaps it's pointless.
Tables have a minimum width which is determined by the minimum wrappable size of their content cells, so they sometimes stick out beyond the text column -- see, for instance, Table 3 in Yaden, with your browser window sized a bit smaller than usual.
This is probably unavoidable, but if you'd like to put some thought into ways to avoid it, I'd be happy to have suggestions.
URIs in output are often problematic because they have no spaces which allow text-wrapping to trigger. Added a custom XSLT function to add zero-width spaces after each slash and period, to make text-wrapping feasible for the user agent.
Got some basic layout styles done, and then added the note and reference popup code. The way I've done it means that if JavaScript is turned off, note and reference links simply bounce you to the bottom of the page (or the appropriate place); if JS is on, then the href attribute is removed, and an onclick event pops up the relevant info in the right margin. This is basically working, except for some types of link (internal links to tables, appendices etc.), which need to be looked at.
But we're nearly there for the XHTML!
Document types handled so far are books, journal articles (with and without authors), book chapters, and presentations. This covers everything in the first two documents. Checking the previous entry to see if the name list is the same, and replacing with a dash, actually turns out to be unnecessary; the APA style guide doesn't seem to mention it, and shows examples of multiple items by the same authors with names shown in full (p.220, section 4.04).
Abstracted the regexp period-adding code into an external function:
<xsl:function name="mdh:addPeriodIfNeeded" as="xs:string">
<!-- Incoming parameters -->
<xsl:param name="inNode" as="node()" />
<xsl:sequence
select="if (not(matches($inNode//text()[last()], '.*[\.\?!]{1}$')))
then '.'
else ''" />
</xsl:function>
According to the eXist docs, following:: and preceding:: will work, but not with wildcards; so following::text() is worth a shot...
Today:
- Added rendering for the document title and authors to the output.
- Re-thought the db structure for default strings and styles. It turns out we'll need GUI strings specific to individual style guides (e.g. "Retrieved [date], from [URL]" for APA), so I've subdivided the default subcollection into strings and styles subcollections, added an
apa_strings.xslfile, and updated thegetGuiStrings.xqcode so that it collects strings from the wholedefault/stringssubcollection. Similarly, the code for retrieving styles had to be slightly updated to take account of the db structure change. - Greatly expanded the db style information. We're now getting down to the nitty-gritty of rendering styles as the XHTML code moves forward, and I'm making a lot of decisions about what goes in the base styles and what goes into the style-guide style document.
- Added rendering for appendices.
- Began serious work on the bibliography rendering. So far I've done books and journal articles; a lot of stuff that needs to be done for all document types is now complete, including retrieval information for electronic references, name rendering, and title handling.
Tomorrow I'll try to finish first pass through the biblio code. One major issue still remains: finding out if the current set of authors (or editors, or whatever) is the same as the previous set, so that a dash should be used. That may take some thought.
Some of the code written today makes good use of the new features in XSLT/XPath 2.0 (for instance, I use a regular expression match to determine whether a title ends with punctuation or not, so I can add a period in the reference only when it's needed). The power of this has got me thinking about the possibility that the old thorny issue of commas inside quote marks could be handled this way. Imagine that, when rendering an article title in the text, the code looks ahead to the next text() element to see if it starts with punctuation. If it does, it grabs that punctuation and includes it in before the closing quote; similarly, a text() matching template could check for the preceding sibling to see if it's an element that would be rendered with quotes, and if so, any leading punctuation is removed. That would be cool. It would require a list of all elements that are rendered with quotes, to check against. The only thing not quite clear to me yet is how to reliably find the text() element immediately following the quoted element. There is a following:: axis, so following::text() should do it, but IIRC eXist doesn't yet support this axis.
Cracked a major set of problems for teiJournal today.
Display styles are stored in three different places in the database: base_styles.xsl,
[styleguide]_styles.xsl, and the user's styles.xsl (containing customized styles). Each file contains a block of <xsl:attribute-set> elements, each of which represents a ruleset, and contains a set of <xsl:attribute> elements, each of which represents a property and a value.
These blocks then have to be combined in an intelligent way. The basic hierarchy is that base styles are overridden by any applicable styles in the style guide, and those styles are overridden by any in the user styles file. So any user styles replace styles from the other two files, style guide styles replace base styles, and base styles are output where there are no overrides in the other two files. Furthermore, where there are rulesets or rules in either of the two lower files which are not represented in their ancestors, these need to be output as well.
The big step forward today was the creation of an XQuery file capable of doing this cascading combination. The resulting code is pretty small, and worth documenting in full. There are two functions, f:getCombinedDoc(), which retrieves all the rulesets from the three documents in the database and joins them together into one file, and then f:getAttributeSets(), which combines rulesets together, ignoring any overridden rules, to produce a single source in the form of an <xsl:stylesheet> document. The functions look like this:
declare function f:getCombinedDoc() as element(){
let $guideId := request:get-parameter('guide', 'apa'),
$base := doc('/db/teiJournal/settings/default/base_styles.xsl'),
$guidePath := concat('/db/teiJournal/settings/default/', $guideId, '_styles.xsl'),
$guide := doc($guidePath),
$user := doc('/db/teiJournal/settings/user/styles.xsl')
return
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
{$base//xsl:attribute-set}
{$guide//xsl:attribute-set}
{$user//xsl:attribute-set}
</xsl:stylesheet>
};
declare function f:getAttributeSets() as element()*{
let $doc := f:getCombinedDoc()
for $setName in distinct-values($doc//xsl:attribute-set/@name)
return
<xsl:attribute-set name="{$setName}">
{
for $attName in distinct-values($doc/xsl:attribute-set[@name=$setName]/xsl:attribute/@name)
return
$doc//xsl:attribute[@name=$attName][./parent::xsl:attribute-set[@name=$setName]][position() = last()]
}
</xsl:attribute-set>
};
The resulting document is then passed on to an XSLT transformation which turns it into a real CSS stylesheet. The pipeline looks like this:
<map:match pattern="*/style.css">
<map:generate src="xq/getStyleSheet.xq" type="xquery">
<map:parameter name="guide" value="{1}" />
</map:generate>
<map:transform type="saxon" src="xsl/attribute_sets_to_css.xsl"/>
<map:serialize type="text" mime-type="text/css" />
</map:match>
Note the mime-type attribute on the serializer: without this, the file is served as text/plain, and the browser fails to interpret it as CSS, so it doesn't apply it to the Web page (this took half an hour to figure out).
So styles are now being applied to the XHTML output, and I can begin refining those styles by building the attribute-set files. Meanwhile, the last bits and pieces of the XHTML output itself need to be completed (appendix and biblio handling).
All the XML tags we're currently using in the body text are now handled, including all the <hi> variants, <mentioned>, <term>, <soCalled>, etc. Images are also handled -- they show up with a class which will constrain their size, but they are also links to the full image. There's a slight oddity with the way FF shows images when they're not part of a document, though; I've posted a query on the Cocoon list about that.
Notes are handled as in the Mariage site, except that they're simpler (no refs to annotations to handle). The JS that pops them up will follow the model of that on EMLS.
Actually, I'm wondering if we could manage to handle note links (from markers to actual notes) in a couple of ways, depending on whether JavaScript is turned on. We could default to a straight link (<a> tag) which would bounce you down the page to the note in the note list; then, on page load, if JS is turned on we could have a function which iterates through all the note links, and hides the <a> tag, inserting another tag right after it which pops up the note in the right margin, as on EMLS. This would be elegant and flexible.
This is handy when you're checking whether your XSLT handles all the elements your documents happen to use:
distinct-values(//*/descendant-or-self::*/name(.))Started working through the basic structure of the XHTML output. The headings (APA stuff) were a bit tricky, but I've figured it out; headings are always h2 down to h6 in XHTML tag terms, but they also get a class attribute which is based on the level they're at and the number of levels, so we can style them appropriately. Lists, tables and quotes are handled, as are names and abbreviations. There are still titles, figures, graphics, notes, mentioned/soCalled/term etc., and the dreaded bibliography to do. There's also the wrinkle that appendices may have nested headings, and those headings are styled based on THEIR nesting level, not the levels in the main text. However, I think it's reasonable to assume no more than three levels in appendix headers, so we can avoid a lot of calculation that way.
We have a slightly interesting dilemma which is the result of some oddities in the APA style. Articles may have multiple levels of header in them, if they're divided into sections (both the articles I've worked on so far have two levels of header). APA, rather strangely, chooses to style headers based on the number of levels that happen to be present in the article; so, for example, where there are two or three levels, the second level is aligned left, but where there are four levels, the second level is centred, and the third level is left-aligned instead. For full details, see the APA Style Guide section 3.32.
Quite frankly, I think this is astoundingly silly, and so does everyone else I've shown it to. it means that the second level heading in one article may well be styled differently from the second level heading in another article. When we get to five levels, it gets even sillier; an ugly all-caps header is inserted at the top level, pushing all the other levels down, and making that particular article look radically different from others which use fewer levels of heading.
I've never really worked seriously with APA before, so this is new to me. Chicago and MLA seem to have nothing to say about it, other than Chicago's pragmatic assertion that levels of heading are "differentiated by type style and placement" (1.74). It makes for an interesting problem for XSLT and CSS, to say the least!
I've been kicking around some ideas for how the CSS system should operate, and I think I've come to some conclusions:
We have a range of different types of CSS data that will need to be combined. First of all, we have the base (default) CSS for the Website; then there will be customizations overrides for that CSS (user). This distinction is the same as that for the interface strings. Next, we have the base CSS for article/contribution display; then, layered on top of that, is the CSS specific to each style guide (APA, MLA etc.); then on top of that come the user customizations of that display. It's really complicated trying to figure out exactly how all of this should fit together.
I need a system which enables a single stylesheet (or possibly two) to be created from the database through an XQuery, which intelligently merges all of these layers of data, which are stored in a range of different XSLT attribute-sets in the db. The problem is that some of the rulesets may be additive (at the base level, for instance, <h2> tags may be set to text-align: center, while further down they're specified as a particular font size), while others may be replacements for each other; it will be difficult to tell the difference. One approach is to gather all the instances of a particular selector from all the db files, then iterate through their attribute @name values, using only the lowest-level one (so that an h2{text-align: center;} at the lowest level would be overridden by a text-align: left setting in a user file further down the cascade). That this is possible shows the advantages of storing this info in XML format -- it's easy to parse and compare values -- but the problem is slightly complicated by the existence of shorthand properties. However, as long as user rules are included after any default rules with which they might clash, the cascade will take care of the problem, albeit at the cost of a larger file.
So the "hard" problem is really how to divide up the data amongst the various types of file. The cascade should probably go like this:
- "Base" file -- controls the site interface, and sets core font choices etc.
- "Style" file (APA, etc.). This is another base file which is added onto the top of the site style, to provide new styles required for the display of articles in the default style, etc.
- "User" file. This is a set of customizations which are added at the end of the cascade. When the user is customizing, they start from the style guide they've chosen (in other words, their GUI will present them with a combined set of existing CSS, based on the combination of the "base" and "style" data). Their customizations are modifications to this.
The only wrinkle here is that, if the user then changes the style choice AFTER adding customizations, those customizations might not interact perfectly with the new style. However, that's to be expected, and the user can be warned about it, and can modify the user styles appropriately after choosing a different style.
Final details were small but took a long time to figure out. Among them were issues of subelements. For instance, there's a need to include the names of editors and encoders in the data, but the only tag available for this is DC.contributor, which is vague. This can be qualified, like this: DC.contributor.editor. However, there's no official set of terms for this. There was a proposal to add "editor" but it was voted down a few years ago. Therefore I've had to use my own unofficial extensions, and I'm simply basing these on the value of the <resp> element in the relevant <respStmt> tags.
Beyond that, everything is fairly straightforward, although the implementation took a little while. My plan now is to base the Web view of the metadata on the Dublin Core <meta> tags that are already in the header, parsed through JavaScript, so that (for instance) the JS code would iterate through each <meta> tag, and if its name begins with "DC.", take the string after the last dot (e.g. "editor"), upper-case the first letter ("Editor") and then show the value in the content attribute. That would make a neat little JS class that could be used for any page with DC metadata in it.
Began working on the XSLT for XHTML output of articles. Noticed a problem with caching in Cocoon: where an XSLT file includes another XSTL file using <xsl:import>, changes to the imported file will not be registered by the transformer; it seems to use a cached copy. Only an update to the root XSLT file will trigger reloading of all the cached files. Reported this to sysadmin, because I've never seen this behaviour before, and I use imports a lot.
Spent the morning working through the Dublin Core spec, and making sure as much as possible of the <teiHeader> metadata is represented in HTML <meta> tags in the header of the document. I still have a few bits of info that need to be included, but it's basically done.
We need to run customized XSLT transformations based on path (so that e.g. teiJournal/apa/doc.htm gives us a page rendered using the APA style). My intention was to pass the directory name into the XSLT transformation as a parameter, which I can do, but I can't then use that to selectively import other XSLT files because imports happen before params can be declared.
The solution, therefore, is to have an apa/xhtml.xsl file, which then imports a range of files, including one which is the framework file (containing the root template match). This will (I hope) enable me to use the same basic page code, down to the body tag, from a single file in the xsl folder root, but then render the details using files in the apa folder. I'm working on that now.
Dealt with a fairly complicated problem this morning, and it should be documented in detail because it's the sort of thing you can only figure out by trial and error. The problem is this:
We need to store user preferences in the eXist database, because they should be easily backed up, and they should be editable through an XQuery/XUpdate-based GUI (in the long run). By preferences here, I mean a range of different things, including user strings (labels and captions for the GUI), colours and fonts, and straightforward settings choices, such as the choice to use APA style. We've already figured out how to store CSS information in <xsl:attribute-set> nodes, then use XQuery to retrieve it formatted as a CSS file for the browser; that's a relatively simple issue, because the browser will always request the file directly, through a call to a URL which triggers a Cocoon pipeline. A more complex issue concerns strings for GUI captions etc. These are typically required DURING an XSLT transformation. An added wrinkle is that there are default values and possible user overrides, and the system needs to be able to deliver a set of values where the user overrides are chosen if they exist, but the defaults are returned if they're not.
This is the way I'm doing it:
First, I store the two sets of strings in two separate files in the database:
/db/teiJournal/settings/default/strings.xsl
/db/teiJournal/settings/user/strings.xsl
The format of these files is straight XSLT 2.0, and each file consists simply of a list of <xsl:variable> elements. Next, I create an XQuery file, getGuiStrings.xq, which can merge the two files to create one file, with user values where they exist, and default values where they don't. This is the meat of the file:
declare function f:getStrings() as element()*{
for $defV in doc('/db/teiJournal/settings/default/strings.xsl')//xsl:variable
let $varName := $defV/@name,
$userV := doc('/db/teiJournal/settings/user/strings.xsl')//xsl:variable[@name=$varName]
return if ($userV) then
$userV
else
$defV
};
(:
===================================================
DOCUMENT NODESET
---------------------------------------------------
:)
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
{f:getStrings()}
</xsl:stylesheet>
(:
======================== END ========================
:)
Then we need a sitemap pipeline which makes it possible to access the output from this XQuery through a URL:
<map:match pattern="xsl/db/guiStrings.xsl"> <map:generate src="xq/getGuiStrings.xq" type="xquery"/> <map:serialize type="xml" /> </map:match>
Finally, the actual base XSLT file on the filesystem, which is called when producing output, must be able to import that file. That took a little figuring out; it requires the use of the cocoon:/ protocol:
<xsl:import href="cocoon:/xsl/db/guiStrings.xsl"/>
Without the Cocoon protocol, it won't work because the XSL engine will look on the filesystem instead of invoking the Cocoon pipeline.
Got this working with a simple example using the plain text rendering I built yesterday.
I have a basic (or rather better than basic) text rendering system in place now. The difficult questions are what type of information should be included or left out (for instance, links to external documents need to have their targets rendered, but links to internal components such as the ids of biblio items need not be included), and how to present information in such a way that it's passably complete and detailed, but not actually in any particular style (because we're going to have only one generic output format for text).
The finished system covers all the tags we've used in the two articles so far, and does a reasonable job, IMHO, of the bibliography. Spacing is rendered quite well, as is punctuation; the routines for handling these will come in handy when I move on to the XHTML, which comes next.
Tried setting up an ialltjournal folder in home1t on Lettuce, but found that I couldn't get the Cocoon URLs to work, so I think there's some config that needs to be done by Greg, or perhaps it's required that the folder be the home folder of an actual user. In any case, I can wait till next week when Greg's back before I set that up. In the meantime, I can work on XSLT etc. in the teiJournal folder of my own account.
Set up a basic sitemap and some XQuery to retrieve a document, and tested serializing as XML, which works fine. Then I started working on the text output, which is not crucial (it's only really intended for text analysis) but makes a good learning tool for looking at all the textual features we're encoding.
No big surprises, although both appendices seem to me to be representations of documents that might better be shown as screencaps or scans.
A bit more thought suggests that the underlines for signature etc. referred to in the preceding post should be encoded as a string of non-breaking spaces inside a <hi rend="underline"> tag.
Spent most of the day working on the second article, which raises all sorts of interesting issues. Among them are these:
- The article has one place where there are two tables, each with their own captions, and an overall caption for the pair of tables. The only way I can think of to capture this is to have the tables as components of a
<figure>element. This has the added bonus that it reduces the number of block elements that appear in a paragraph, and enables us to say that all breakout/embedded components (images, graphics, tables etc.) are (from a markup point of view) figures, and appear in a<figure>tag. - Another such element in the document consists only of text, being a representation of a form filled out by students. This, IMHO, ought actually to be a scan of the original document, but there are situations in which I can imagine text being used in this way. There is a
<floatingText>element in TEI, but this is not really floating text; instead, I'm marking it up in an<ab>tag, which is itself inside a<figure>tag. - All this has brought to light also the issue of caption placement (it should be above tables and below figures in APA), and also numbering (which, the APA suggests, should be distinct for tables and figures, with Figure 1, Figure 2 etc. alongside Table 1, Table 2). Our markup needs to be flexible enough to accommodate not only this but other systems prescribed by other styles.
- The "form" illustration mentioned above contains some horizontal lines (for signing and dating). I'm not sure how I'm supposed to do those in TEI; that'll need some research. I've looked around a bit, but I can't find anything so far.
- I had to add the
tagdocsmodule to the schema to get the<code>tag, but I stripped out most of its components, and took the opportunity to strip out some stuff from elsewhere at the same time, so the resulting schema came in a bit smaller. That's not a bad approach -- every time you add or change something, take a look through one or two of the modules you already have and see if there's anything you feel you can dump. - Had to add
<hi rend="compLabel">to handle the captions of buttons, menus etc. in computer GUIs. There might be more of this kind of thing. - Lots of back-and-forth on the TEI list about my suggestion to allow
<ref>as a child of<biblStruct>, so we can include the URIs of electronic references. - Also lots of discussion over whether lists can/should appear as siblings of paragraphs, or whether they might as well be deemed to be inside paragraphs (perhaps as the only constituent). My feeling is that my schema would be much simpler if everything like that had to be a child of
<p>.
Created files for all the graphics in the 2nd article. The simplest way to do this is to select and copy them in OpenOffice, then do Acquire/From Clipboard in the GIMP. Saved them as JPEG at full quality.
The appendix issue is clear from the guidelines; each appendix should be a <div type="appendix"> inside the <back> element.
Got the metadata from the editors and created the header for the article, then started doing the markup. This article has two items we haven't yet dealt with, tables and images, as well as a number of inline features which raised questions that I referred back to the editors. The tables are straightforward grid tables, so they're easy, but the images are fairly low-res (inline in the wp document), so I asked for originals, and began the discussion of acceptable and desired formats. There are also issues relating to URIs mentioned in the text (should ref URIs be in footnotes?), and there are appendices at the end; I need to do more reading in the Guidelines about appendices to find out whether they go in the <back> element or not, and what they should actually be (<div>s, presumably, but it's not clear yet).
The biblio for the second article is shorter than that in the first, but it has some new features (electronic sources, conference presentations) for which I've had to devise markup strategies. That's now done, although some of these strategies may change if I get responses from the TEI list on some of my questions.
Picked up a copy of the APA Style Guide from the bookstore, and began looking at differences between that and Chicago. The guide is from 2001, and is supplemented by a PDF which is only available for purchase online, relating to electronic resources; we'll need that, but right now we don't have a credit card to purchase it, and I can't find it at the library (I've written to them to ask if they have it, and how to access it).
In the meantime, I started work on marking up the second IALLT document. The reference section of this one is shorter, but it has electronic references, which the previous one didn't, and I haven't made any decisions on how to mark up URLs and last-accessed dates yet. The TEI P5 Guidelines don't have any reference to bibliographic info for electronic documents either, so it's not clear how you're supposed to mark them up. I posted a query to the TEI list.
Spent a long time writing more explanatory material about on the markup instructions page, based on the tags used so far in the article completed today. Much more to do, but we're making progress.
Completed the markup of this article, in the process expanding my documentation, and also further refining the schema so that the @type attribute on <name> tags is now restricted to a fixed value list (with a default value of "person").
To check that all instances of <abbr> tags have counterparts which are part of a <choice> tag (meaning they have an expansion somewhere in the text):
//abbr[not(.=//choice/abbr)]The @target attribute on <ref> tags should always (so far, I think) point to the @xml:id attribute on a <biblStruct> tag in the bibliography. The editor doesn't check this automatically. There may be a way to force it to do so, using the schema, but I haven't figured that out yet. In the meantime, this piece of XPath evaluated as XPath 2.0 in oXygen will do the trick:
//ref/@target[not(substring(., 2)=//biblStruct/@xml:id)]
It gives back a list of any @target attributes that don't point to a <biblStruct> (a condition which suggests they're mistyped, or that a <biblStruct> is missing or has the wrong xml:id).
Did further work on the ODD file to add all the values proposed for different document types in the taxonomy here. The taxonomy itself needs to be updated, though; it assumes that the document type will be specified through a <classCode> tag, as was the case with the P4 projects, but I've decided to go with the @rend on the root tag instead, because a) it's actually a categorization which is made solely for the purposes of output rendering, and b) using an attribute list allows me to force its presence and restrict its values in a way that's helpful to encoders.
As I get further into marking up the first text, I've begun to flesh out the documentation for encoders. I've added a new Markup instructions document, which will grow as we go along, and also made some basic changes to the Textual features page, which overlaps it a bit. I've also posted a detailed query to the TEI list about whether block items such as lists and linegroups should ever, or need ever, appear outside of paragraphs. My sense right now is that they sometimes have to be inside paragraphs, but I can't think of any situations in which they must be outside paragraphs, so I might be able to customize the schema heavily to restrict them in this way, and make life simpler for encoders.
I've begun adding more blocks to the ODD file used to generate the schema, to restrict the range of values available for specific attributes. It took a lot of experimentation to figure out exactly how to do this; the only really useful documentation I was able to find was this, but even then I had to hack around at the values of the @mode attribute (they have to be "change" on the <elementSpec> tag and "replace" on the <attDef> tag, for some reason). This is what I have so far (both blocks are simplifications, as a proof of concept):
<elementSpec module="core" mode="change" ident="hi">
<attList>
<attDef ident="rend" mode="replace" usage="req">
<valList type="closed">
<valItem ident="bold">
<gloss>Rendered in bold.</gloss>
</valItem>
<valItem ident="italic">
<gloss>Rendered in italics</gloss>
</valItem>
<valItem ident="strikethrough">
<gloss>Rendered with a horizontal line through the middle of the text.</gloss>
</valItem>
<valItem ident="underline">
<gloss>Rendered with a line below the text.</gloss>
</valItem>
<valItem ident="foreign">
<gloss>Rendered in a distinct manner to highlight the fact that this word is not in the main language of the text.</gloss>
</valItem>
</valList>
</attDef>
</attList>
</elementSpec>
<elementSpec mode="change" ident="TEI">
<attList>
<attDef ident="rend" mode="replace" usage="req">
<valList type="closed">
<valItem ident="article">
<gloss>This is a full journal article.</gloss>
</valItem>
<valItem ident="review">
<gloss>This is a review of a publication.</gloss>
</valItem>
<valItem ident="editorial">
<gloss>This is editorial content.</gloss>
</valItem>
</valList>
</attDef>
</attList>
</elementSpec>
I'd previously removed commas in inline citations to comply with Chicago, in the part of the text I've marked up so far; went back and replaced them, to comply with APA.
Just heard from the editors that the board have decided to go with APA style for the online journal, so work done so far towards Chicago will have to be redone. First I have to get an APA style guide, then I'll have to re-edit some of the document I've already worked on, to undo changes I made to conform to Chicago (such as deleting commas between author and date in inline citations). Differences don't appear to be major, but they'll need some thought. This will delay us a bit.
Early decisions:
- Where acronyms appear in the text, they might appear in two forms simultaneously, if the authors have provided an inline expansion: with the
<expan>first, followed by the<abbr>, or vice versa. In either case, the only one that should be marked up is the<abbr>, and it should be done as a<choice>tag:<choice><abbr>SLA</abbr><expan>Second Language Acquisition</expan></choice>
Subsequent instances of the abbreviation can be marked simply with an<abbr>tag, and the XSLT can then retrieve the expansion as required from wherever it occurs in the document. - Quotes will be rendered like this:
<cit><quote>“a growing body of research indicating that mechanical drills do not facilitate the development of explicit or implicit knowledge”</quote> <ref target="#aski_2005">(Aski, 2005, p. 333)</ref></cit>
However, it's not yet clear whether the original quotes should be left in place or not; awaiting editorial decisions on this. - Where a quotation appears without any associated reference (because the reference is elsewhere in the text, or implied), the reference is supplied in an empty ref tag:
...finding themselves in the familiar position of having to <cit><quote>"constantly reevaluate their services to meet the changing needs of their service population"</quote><ref target="#farkas_2007"></ref></cit> and continuing in the struggle to <cit><quote>"define themselves as more than a repository for books"</quote><ref target="#farkas_2007"></ref></cit>, ...
This enables all quotes to be linked to a reference.
One made during this markup: When an item has no author or editor under which to list it, the <title> element will have a child <name> element which will be detected by the XSLT; the XSLT will then use that as the initial part of the entry.
It turns out that <textDesc> will absolutely not do for the document type, so I'm now looking at using an @rend attribute on the <fileDesc> element; that seems appropriate, given that the distinction is primarily being made for the purposes of rendering the output.
Lots more back and forth on this through TEI-L, along with discussion of how to handle forenames. I also marked up a large block of the bibliography for the first article, and sent a query over to the editors about one item.
Some useful feedback from Lou on the TEI list about where to encode the info relating to the journal issue and volume. As a result, I've now moved the journal title itself into <seriesStmt>, along with the other related info:
<seriesStmt>
<title level="j">IALLT Journal</title>
<idno type="vol">39</idno>
<idno type="issue">1</idno>
<respStmt>
<resp>editor</resp>
<name>
<forename>Heather</forename>
<surname>McCullough</surname>
</name>
<name>
<forename>Douglas W.</forename>
<surname>Canfield</surname>
</name>
</respStmt>
</seriesStmt>
I have a couple of issues still in discussion on the list, including how to encode the text type; I think the best solution would be this:
<profileDesc>
<textDesc n="article"></textDesc>
</profileDesc>
where the @n attribute is constrained to a list of fixed values by a modification in the ODD file. Still waiting for feedback from the list on this.
Finally, I wrote to the IALLT folks to check whether it would be OK to use a block of their metadata in the TEI Guidelines, once all this is agreed on.
Mapped out a header for the first of the sample documents, with only a couple of pieces of info still missing (journal volume and issue, and page numbers). Sent that to the TEI list for some comments, along with a request for suggestions about how to encode the missing info.
Then continued encoding the bibliography of the sample doc. Decided on two things:
- Article titles will have trailing periods inside their tags (because sometimes they have question marks or exclamation marks instead, and it's simpler just to include whatever punctuation should be there, rather than try to add it in later).
- Dates should have no content, just a
whenattribute:<date when=2006"></date>.
Got the first sample documents from the IALLT folks. First off, I went through the previous edition they sent (which I also happen to have in hard copy), and got confirmation of a number of details with regard to layout, running titles etc. (see the email correspondence for details). Next, I started doing the markup, beginning with the header and then going on to the reference list. As I come across each specific type of reference, I'm documenting it by example, and also writing the XSLT template(s) needed to get it into XHTML. This will be a long slow process, but it should give us a very flexible system in the end. So far I've done only journal articles.
Started working on the schema for teiJournal, with Roma and a sample document, which I'm also marking up, bit by bit. I hit two problems today:
- The
<affiliation>element was not available as a sibling of<name>. I want to use it as part of the<biblStruct>which will appear in the<sourceDesc>, from which all other title/author info in the document is derived, and so I need to put the affiliation into the author tag, but<affiliation>is not part ofmodel.addressLike. I confirmed that with the TEI list, and then customized my ODD file and generated a new schema which allows me to use it where I need it. - My plan to use XInclude to reduce duplication and redundancy in the TEI file has been somewhat undermined by the fact that oXygen cannot validate a file which contains XIncludes pointing at other locations in the same document. There seems to be no workaround. I've posted a message to the oXygen forums (which were down for most of the day). If this doesn't work, I need to make sure at the very least that eXist can handle the expansion of such XIncludes, otherwise the whole plan will be scuppered, and we'll have to go back to having three copies of the title and authors in the TEI file. That's really bad from the point of view of making life bearable for folks doing markup.
Finnished off the presentation by creating a couple of diagrams, and then added a block of explanatory text to each slide to act as a "script" for Karin. Used the CSS and scripting hacks we used for CASTA to get a decent printable version of the presentation direct from the XML, incorporating the script stuff. Posted it on the teiJournal site, and handed the printout over to Karin for her to take a look at. We have plenty of time for tweaking before July.