Four out of five done...
Looks like OJS will simply be taking the donated code for NLM handling, rather than writing their own, and the donated code uses NLM 2.3, so I'm now writing a converter to turn my 3.0 output into 2.3. There are about four major areas of difference, two of which I've already dealt with, and I've also gone back and elaborated one area of the original TEI-to-NLM-3.0 conversion for greater consistency.
Met with PAB to discuss the plan for her database prototype of (initially) 100 images from Old Norse mythology. This prototype is part of her dissertation and will be described in Chapter Two. Outcomes:
- Initially, metadata only will be stored, with a search engine built on that metadata.
- These are the metadata components that needs to be encapsulated, in a normal
<teiHeader>for each image:- Title (possibly)
- Topic, as descriptive caption
- Topic, as keyword list
- Source (full bibliographical record of source document).
- Artist (lookup to centralized
<personList>) - Medium (oil on canvas, etc.)
- Anything else from msDesc?
- Initially, PAB will markup up around 10 test documents.
- Once that's done, we'll create a Web-based form to generate the file, which will help her create the remaining files, and possibly later be available to outside contributors to suggest new images.
- Data for names of the gods etc. will be encoded twice, in English and standardized Old Norse, using
@xml:langattributes. Other data will only be in English. - It will also be necessary to include elements and/or attributes to encode the degree of certainty with regard to any of these pieces of data, and responsibility.
- The setup will be the usual Cocoon + eXist (but, if possible, Cocoon 2.2, very sparse).
- The search page will start off by showing all 100 thumbnails, and selections from dropdowns etc. will simply reduce the thumbnail list.
- "Related images" values for any given image can be derived mechanically from the metadata.
Off a bit early. Bringing those hours down.
Basic backup (saving the output of a pipeline into a directory on the filesystem) is now working, thanks to some help setting permissions from sysadmin.
Now a proper GUI and plan needs to be implemented. This is my first shot at a plan:
- There needs to be an editorial index page which shows a list of all the documents (published and in proof) which are in the system.
- The page should have links to back up specific output formats for each document.
- Those links would invoke the pipeline which calls the flowscript, but they would do it through an AJAX call, so that the index page does not need to be replaced.
- The AJAX script would call the pipeline, and write the server response to a
<div>on the page. - The server response needs to be encoded in a TEI file of messages, which is stored in the db. This would be similar to the
site.xmlfile which currently holds all the site rubric. - The pipeline which sends back the message would retrieve a block of something from the XML file, and pass it for processing to the
<site.xsl>file, but in some manner which preventssite.xslfrom building a full page; we only need an XHTML div, for insertion into the index page. - One outcome is an error message; this message should give a warning about permissions, the most likely cause for failure of the operation.
- Once all this is working for individual files/formats, the next stage is to enhance the AJAX page so that it can do the whole lot.
- This would work by having the JavaScript create a queue of URLs to be called, and when it gets a successful response from each one, it invokes the next one, also reporting its progress as it goes. There would also need to be a method for bailing on the process.
- A similar batch function should be available for each individual document, invoking all the formats.
- Finally, we need directory browsing to be available through the Cocoon sitemap, so the editor (or indeed regular readers) can see and access all the backup files.
This setup would give the option for the editor to backup the whole collection, or just one changed file, in all its output formats (including the source XML, presumably), so that a regular backup of changes could be taken, and also when a single file is edited, copies could be regenerated for only that file.
It's long been a plan of mine (following my own recommendations from our Finland presentation) to build in a system whereby hard copies of all the XHTML, PDF and other output formats can be (or perhaps are automatically) saved on the server, so that should Tomcat or Cocoon go down in some catastrophic way, those files are all available to the editors. I implemented that today, following some instructions here. There was a lot of messing around initially getting the folder paths right in the parameters to the flowscript, but after tailing the Cocoon log for a while, I got that sorted.
However, it didn't work in the end, due to a permissions issue. The Tomcat process runs as apachsrv, which apparently doesn't have permission to write to those folders -- which makes perfect sense from a security point of view. We're working on that with sysadmin now.
NLM 3.0 output is now working and on the Website. Mostly did this yesterday, but blogged it in the wrong blog.
Noticed stuff like this:
“Poikileâ€
littering the data. Thought I got that in the first round of tidying.
Notes regarding new schema:
Date ranges - found stuff using "1.5 mil BP" that refers to paleolithic objects
General uncertainty is registered (using ?) in approximately 540 records
There are 1837 locations - some of them look like this <1400 BC.>, this <-->, or this <ms. illus.>. More to the point is that the location field frequently includes non-city name strings (like country names and site names. 21 locations account for 2700 objects - there are 1837 locations in the data, with only 169 locations having 10 or more objects.
Sites - only about 3500 objects have the site noted. There are 284 sites in the db.
Keywords - wanted to use TAPoR tools to do a word count etc. but it gacked 4 tries out of 4, so I did it on the CLI. Out of the 12000+ individual words in the keywords field I found that only 780 some-odd occurred 10 time or more. 110ish occurred 50 times or more, and about 30 occur 100 times or more. I did not remove many words (single letter words, abbreviations like BC etc and numbers) so these counts are likely lower. The real problem is that most aren't actually keywords (Rome, Greece etc.), are effectively duplicates (erotic and eroticism) or are repeated in other fields in the database. I attached 2 files to this post: All keywords, and the top keywords. The numbers indicate frequency.
Notes - I haven't done as much work there as I have on keywords, but I expect it to look similarly bloated. As a matter of interest, the keywords and notes fields are character for character duplicates in 2878 records. I did that with this handy little query:
SELECT keywords, notes
FROM `aggregated_view`
WHERE keywords = notes
and keywords !=""
and notes !=""
View - the data look to be similar in nature to other fields, most often keywords. There needs to be clarification on this field's importance and how to distinguish it from other fields.
The title says it all...
PC reported that the About box was showing two versions on Win98 and WinME, the old 1.7.2.6 version and the new 1.8.0.0. On investigation, I found the application options "Product version" setting was set to the old version; however, I don't use this information when creating the About box, so I suspect that Win9x is incorrectly reading the product version info from the executable. Since the problem doesn't show up on Win2000+, and we don't officially support Win9x, I've just made a silent change of the Product Version to "1.8" (which is what it should be, rather than the full dotted version, which is stored in a different place). Waiting to see if that makes a difference. What seems to be happening is this:
In my app, I output a string between the copyright info and the link, based on the "Comments" field in the executable version info. In IMT, there is no comments field, so there's nothing to output; on Win2000 and above, nothing appears there at all (in fact there isn't even an empty line). However, on Win9x, it seems that if there's no Comments field, Windows just reads the Product Version field instead. I think that's where the information is coming from.