I now have (I hope) full crawls of the ISE site, and I will carry on to get full crawls of the others. I'm now at the stage where I'm trying to fix problems, such as the prominent message touting a user survey which needs to be removed from every page, and the fact that the nbsp entity is referenced everywhere (it should be a numerical entity). A few more hours of this sort of work will be needed, given the scale of the project (19GB+ for ISE itself).
Doing a re-crawl with deeper depth setting...
Modified the code in db_to_servitengasse_xml, db_to_person_html and db_to_building_inc files so that for those few instances where a person has two addressess on Servitengasse, we don't get two entries e.g. in the list of people associated with a building or in the servitengasse.xml file (which is used to generate the servitengasse.json file).
Also made some tweaks to the punctuation etc. in the output and ensured that occupancy_notes field is displayed under all output circumstances in voluntary and collected instances of occupancy.
Meeting with MN and SA today:
- Wheelock site: add page for ANKI deck download.
- Athenaze site: add page for ANKI deck download.
- Wheelock site: redesign as the testbed for HotPot redesign/HTML5. (Deadline one month.)
- Wheelock site: re-encode tabular exercise layouts to make them mobile-friendly. (No deadline; examples to be done by HCMC; exhaustive implementation to happen in the fall with next workstudy student.)
- Both sites: Add page for credits, and find out all the historical contributors and add them.
- Both sites: get logs added to our current set of log dumps.
From MT, got a list of all the additional files that will need to be downloaded. I've built a test script for QME, as the smallest, and I'm now testing that.
Following the download, I'll still have to build a local server setup so I can test using the actual URLs, because they're hard-coded throughout, and can't be changed because the same fragment may be pulled into any page at any point in the tree. Grrr.
I've now also pulled the content from the DRE and QME sites. All three sets of data have the same range of issues, the most serious of which is a preponderance of hard-coded links to the domain (http://internetshakespeare.uvic.ca for example) rather than relative links. The WGET tool I used to pull the content should have been able to make those links relative, but in most cases it didn't, possibly because the content is not valid HTML so couldn't easily be parsed.
So I'm going to have to fix all those links using a script, which I'm writing in Python now. It's doubly tricky because of the mobile vs desktop issue, which results in links which look like this:
<a href="../../../frommobile.html%3Fto=http:%252F%252Fdigitalrenaissance.uvic.ca%252FFoyer%252Fcopyright.html">
So it will take a while before I can get this all normalized and working. Meanwhile, I don't yet have the list of AJAX files that need to be pulled from the server, so I won't be able to get those until MT sends it to me.
After several false starts due to work still being done on the site despite yesterday being the deadline, eventually got a standard WGET of the site. Got over 67,000 files, but a) lots of them still link to the live site with a full URL, despite telling WGET to fix that, and b) MT tells me that there are AJAX files that I won't have because there's no sitemap linking them. Will have to try to address this on Monday.
Following the issue of the missing NFLD newspaper documents, we determined that the problem was bad linking in the documents themselves, which DH fixed; I then had to rebuild to schema to get things to validate. I also got tired of the tedious process whereby XML documents are validated one by one, something that was necessary because jing suffered from a stack overflow due (I think) to the complexity of the folder structure within which the XML files were found. I discovered that if I just copied the XML files to a temp directory in a flat layout, jing would validate them just fine, and I could also give it a bit more memory anyway by setting my ANT_OPTS like this in .bashrc:
ANT_OPTS="-Xmx8G -Xss8m" export ANT_OPTS
I think the -Xss solves the stack overflow. This means that I can now validate the XML in a few seconds. I then started looking at the Schematron, which has never been run as part of the build. Borrowing from other more recent project builds, I'm now generating a static .sch file as part of the ODD build process, and then compiling that immediately to create an XSLT file, following the model of DVPP. The validation process then runs that against the document collection, stores the results in a temp directory, and a second process parses those results to generate errors and fail the build if necessary. Found and fixed several errors in this process.
So now the NFLD documents are included in the site, and our build process is WAY faster.
Usual weekly meeting with DH, at which we hacked at the problem of the missing Newfoundlander documents, and determined that they're on the site, but just lack entries in contents documents, because their metadata doesn't match the taxonomies. DH will fix.
Met with AW (Ling) to discuss possible text-encoding project for First Nations texts. This will probably solidify into a project proposal for initial work leading to a SSHRC application next year.