I used jira to create XHTML versions of the OCRed volumes, and integrated the results into the existing search; it now seems to work pretty well, so I pushed it to the live jetty.
The headers for the translation itself were appearing at the bottom of a page in the PDF. I set up a special case of
div0[@type='NewPage'] which the XSLT now handles to force a page-break in this case. Republished the XAR.
I've extracted the editorial content from the master volumes, and I'm now building a separate index page for each of the volumes we've created, including the editorial stuff. On the other listings pages, volume numbers are now links to the volume pages. This answers (I think) the last of HT's requirements for the new site, other than a redesign.
I've also tested Apache Tika with the old volume PDFs, and the results are very promising; I think we may be able to process them to ugly XHTML, which eXist could then index, and provide people with search capabilities and linking out to the specific page in which the hit is found.
The new translation is done, thanks to numerous hacky shortcuts to converting ODT to TEI, including search-and-replace on the contents.xml file. The results highlighted a couple of minor layout annoyances, so I've also fixed those. The document comes in at about 100 pages. It's now posted for proofing.
Received a new translation and started working on it. I've used the LibreOffice macro search tool to good effect, enabling me to add some tagging in a semi-automated way to the word-processor doc using styles, and I'm now transferring that content into the XML document. The metadata is done, the bibliography is done, and I'm working through the editorial intro.
Simple redirect XQuery module that handles both the PDFs and the HTMLs now implemented.
Worked a lot on the search today, constraining it to published documents only, and tweaking how it returns results. I've also added VNU validation of the HTML to the build process, and fixed some problems arising out of that; I've turned popup notes into
<aside> elements so that their inline text can be ignored by the indexer, while the footnote rendering at the bottom of the document will be indexed; and I've refined the indexing after using the monex profiler. I also tweaked the P5 output so that it validates with the correct schema links. I think we're more or less there now; what we have is already much better than what's on the site, and I see no problem issues at all.
I'm now happy with the way everything is working. There are tweaks I could make -- I need to put in place redirects for the old URLs, and I should revisit the collection.xconf and indexing, and there will be more pages that need to be created for the new site -- but it's all basically there, and what we have now could replace the current webapp immediately. Will start that process next week.
I've added an eXist app build to the process and I'm now testing and bugfixing locally to bring this project thoroughly into the Enddings fold. Everything is basically working, but I'm finding I now want to enhance some aspects of the rendering and display so that it's a cleaner and simpler setup than the previous app; I'm adding citation stuff in a footer, as well as getting the search working, based on Mariage. Should all be done soon.
The new static build process is now complete, including all the DC header stuff, the teiHeader display widget, and the search page (which of course won't do anything until it's in the context of a webapp). With HT's approval, I have now switched all the old URLs over to the new on the existing site, and added 301 redirects through the sitemap.xmap. See the relevant stanzas there to see how it's done; I doubt we'll ever need to do this sort of thing again with such an old Cocoon, but it took me a while to figure out. I still need to add validation to the build process.
:: Next Page >>
This is the blog for volumes 15 to 19 of the journal Scandinavian-Canadian Studies / Études scandinaves au Canada. Our aim is to provide Web-based access to the contents of the print journal in a range of different formats, including PDF, HTML, XML (TEI P5), and plain text (UTF-8).