Fixed bugs reported by JT when empty search was performed; diagnosed and reported the Heritrix/Wayback Machine bug to them, and tried various workarounds in the JavaScript (none successful yet); implemented the 404 page; and started work on the eXist library to handle legacy URLs. It's working for xbrowse.xq URLs already, and extending it for xgallery.xq and xscan.xq should be relatively straightforward.
Category: "Activity log"
I've completely reworked the creation of the gallery pages and their functionality to eliminate the need for any JavaScript. In the process I found and fixed a few bugs. I've also tweaked the search display so that kwic results don't accidentally include calendar content, added some helpful captions and explanations in a couple of places, and fixed a lot of inconsistencies in the markup related to literal square brackets in supplied elements and whitespace. I think all bugs are now fixed.
I've also wrestled quite a bit with the eXist .sh scripts, which courtesy of the installer are littered with hard-coded paths; I'm now able to discover and use the relative path based on the script location, and this works locally. I found I was unable to stop a running Jetty instance on Peach, so RE is looking into this; meanwhile, I may port these reconfigured relative path fixes to the server instance to make sure they work there, once I'm able to stop and restart the Graves app.
CC's, CD's and mine are now done.
There were a bunch of hard-coded references to image locations that needed to be filtered out, and a couple of other bugs that were preventing the display of datelines and links to transcriptions in scan pages; then I started looking at a better approach to the gallery scans, which currently use an ugly iframe hack to display documents. I have a plan, and I've disabled the old approach, but I haven't yet implemented the new one.
After some wrestling with scoping issues, I've finally got a discrete module working which performs query result caching in a fairly elegant manner. The payoff is quite significant; subsequent calls to the same search for (e.g.) the next page of results drop from taking 300 milliseconds to 1 millisecond. The strategy for pruning the cached results by using two maps, one keyed by timestamp, works a treat. This will be used in all of our eXist projects going forward.
I have a couple of small bugs to fix now, one where if there are exactly eleven results, you see only ten of them; and one where the large image in a gallery scan page-image doesn't show up.
I'm attempting to implement session-based query caching for the Graves project, since it's the simplest and the first, so that I can use the approach in the other projects when it's perfected. There don't seem to be any clear working examples of this in the wild; the Shakespeare example caches only a single result set and as a consequence has bugs if you do multiple searches in different tabs. I have an approach mapped out, and I'm beginning to write code for it, but I'm a little surprised there are no examples of anyone else doing it. I'm pretty sure it'll be really quite important for speed and responsiveness in the larger projects.
Fixed a couple of bugs last night; determined that the failure to function on the Myths eXist/Jetty is because it's eXist 2.2; wrote to RE to start figuring out a procedure for deploying new eXists, after poking around on Peach to find out how the current setup for Myths works.
Noticed that the gallery is not functioning as it should; looks like perhaps JT didn't finish the work on that after all, so I'll have to go back and look at it. Meanwhile, also looking at approaches to caching search result sets, in eXist; posted a naive question to the eXist list after trying to figure out how/if the Shakespeare example project is doing it. I have an approach in mind, based on hashing the query and storing the result set in a map keyed by the hash of the query; when the same session makes another query, you can hash the query and check for anything in the map. But if there's existing sample code that's well-tried I'd rather start with that.
Much progress today:
- Wrote some XSLT to populate the metadata page. Still very crude; should actually be hand-written content from EGW.
- Fixed the remaining four reference errors from the diagnostics.
- Rewrote the diagnostics so they link to the current product rather than to the old site.
- Parameterized the build process so that it can either link (as up to now) to images remotely on graves.uvic.ca, or use local copies of images, which are then built into the XAR file. This can be controlled by a command-line switch.
- Put a built copy of the site on the web for crawling; CD will have it crawl both the original and the new, and we'll see what the differences are.
- Tested the new search on the Myths install of eXist, and THE SEARCH FAILS. Trying to find out what the differences might be between that instance and my local one. In the long run this is not a problem -- we'll deploy with a known-good eXist containing no other projects -- but this is puzzling.
I've got the new search working, and it's a lot cleaner and more straightforward. I've also added keyword highlighting when you go from the search to a document containing a hit (which wasn't there in the original site).
As far as I can see, the Graves project is pretty much ready for Endings now. We have:
- A complete static build, created fresh after every change here on Jenkins. (No functional search in the static build of course, but the browse options work fine.)
- An eXist XAR app which can be deployed in a standalone eXist instance, also built on Jenkins.
- A diagnostics report which shows only four minor issues (all instances of the same thing) remaining to be fixed in the data.
We're now ready to do some testing with the Heritrix crawler. I know the current version of Graves is useless for the crawler because everything is a db query rather than a URL; the new one all URL-based, so I'd like to set up a test crawl asap and compare the results with what's in archive.org at the moment. I'll enlist CD's help with that.
The other thing I'd like to start working on is creating a site configuration for Solr. I think we have a decision to make at this point, though: do we go ahead and deploy the new site on a fresh eXist instance and then point the graves.uvic.ca domain name at it, or should we be a little cautious and wait till the Heritrix testing is done before trying that?
One more thing to think about:
The page-images all currently live in a folder on our Tomcat server, served by the old web application, and the static build is simply pointing at them. Obviously we should eventually pull them into SVN and build them into the eXist webapp and the static build. That will add about 160MB to the size of the svn repo (which is no problem) and twice that much to the products on Jenkins (which is a problem, because the Jenkins server has very little disk space free at this point). What I propose to do is to parameterize the build so that it can be run with a switch that says "point to remote images" (default for now) and "assume local images" (which I won't use till we're ready). Then I need to get sysadmin to give me more disk space on Jenkins, and we can go ahead and switch to a local build.
I've implemented the search, a process which required some changes to the way metadata (particularly dates) was stored in the HTML header; I've made a start on the results-paging functionality, with a lot more to do, and I've fixed some oddities and bugs in the rendering. I also want to give some more attention to turning a submitted query string into a proper Lucene query; that'll translate to all the projects.