I've been working on generating JSON files in support of the search, mostly with the aim of supporting the search filters, as we do in other projects like Keats. The work is going well, but what I'm not really clear about yet is where to draw the line between convenience and file size; for example, I can include details of the periodical title and folder path with every poem retrieved under any category, or I could include only an id and have a small JSON file for periodical information which I use to look that info up when needed. Still thinking about all of this.
Category: "Activity log"
Today's meeting was productive, and following it I implemented a process for reporting on the main rhyme-scheme of a poem and variant stanzas that don't follow it. This threw up a couple of issues which I fixed with Schematron and cleanup. I've also re-worked the way the page-image pops up when you mouse over it, and parameterized the URLs used in building poems so that on Jenkins we should (if all goes well) get proper relative links to other site pages, whereas on the local build for encoders, you get fixed links to the Jenkins build. 360 minutes.
In our usual weekly meeting we talked about a few things including the common font-style: italics typo, which I added a Schematron rule for; the asterisk line encoding pattern, which was hastily conceived by me and basically silly, so I've replaced it with a saner and more extensible approach, with changed rendering; and various issues with rhyme. I also fixed a bunch of first-line issues in the db, and re-worked the diagnostics which discovers those problems so that it doesn't trigger on anything like as many false positives; and I rebuilt the TEI for Once a Week, which has lots of new content. 210 minutes.
Macs can't display the black triangle Unicode characters, so I switched to plus and minus signs for the TOC.
I've now finished and documented the rhyme-finding tool, and the team are testing it. Meanwhile, there's a need to be able to nimbly merge some components of the metadata db into a small subset of the TEI files -- specifically, a single year for a single periodical -- to allow indexing fixes and updates to be propagated on a folder-by-folder basis so the encoding can proceed without running the whole massive operation. I've therefore modularized that process, and it can now be called with parameters for periodical folder and year, and tested the result successfully with Chambers 1840, which is next on the encoding list. I did the same thing to the OCR process, which usually needs to be run after the db merge process anyway. This will make life easier going forward. 240 minutes.
Had to meld together four long poems into one, as a result of an indexing error that had to be corrected. In the process, I worked out a way to use the What Rhymes With functionality to find candidate existing tagged rhymes inside the poem you're currently working on, which should help speed up the rhyme labelling for longer poems. I'll show the team tomorrow. 240 minutes.
Tagged some echo figures in 1840 Chartist, and in the process refined the detection algorithm a bit to cope with full-stanza echoes.
Met with K & K and talked through our process regarding imperfect rhymes; wrote up the result in our documentation. Got PS to help with debugging a particular rendering issue with inverted line wraps, then documented the method of encoding those in the schema. Fixed a couple of bugs reported by KAF, and started thinking through a process that might help encoders detect when rhymes in a long poem are echoes of rhymes earlier on.
2nd half of Feb timesheets done. No TS for VL because she didn't clock any hours.
KAF reported that her December timesheets had not been processed; went through the whole history for her and KSHF with Payroll, and she was right, so we re-did and re-submitted those timesheets. Then worked on some CSS to show how to handle a line wrapped above its end. After that, reworked the CSS that shows expanding page-images in the rendered view to work around a problem with Chrome, and made the poem div scroll so that you can see the rhymes and the poem easily. Tagged a poem myself in the middle of all this, and began the process of tracking down and fixing instances of rhymeHalf where both the first and the second instance are tagged that way. The first should be normal tagging. There are nearly 400 of those, but fixing them is not too hard, and it's also an opportunity to see how people have been tagging.
After tagging one more 1840 Chartist poem (I'm moving slowly but steadily through them), I went back to our documentation and revised the section on rhyme tagging to reflect changes we've made in our practices due to growing experience.
Met with the RAs doing indexing and discussed the problem with file naming yesterday; they raised a couple more questions about how to handle illustrations which are inserted rather than part of the regular page run, and we also made a long-term plan to make more changes to filenames in a mechanical operation to incorporate the series numbers, to be done only after the indexing work on the two current periodicals has finished. Wrote to AC to get feedback on these questions.
After doing some tweaks to the poem rendering and encoding a poem, I looked at the diagnostics to discover there were 609 broken links from the db to images no longer where they were. It turned out the RAs has been re-organizing Cornhill and Once a Week images into series folders, but not updating the db. After some careful analysis I was able to do a series of search-and-replace operations that fixed the problems, but it's a bit hairy doing that on the live db. We're having a meeting tomorrow to discuss filenaming.
Met with the encoders and talked about some issues, including broken lines (now documented properly and handled in XSLT) and original footnotes (handled OK, but not properly documented yet). Fixed some reported bugs in XSLT and JS handling of milestones. Encoded a couple of poems myself.
Spent some time with SK learning about the indexing process, and took a couple of pictures with a view to documenting it in a more user-friendly way through a page on the site. The spent some time adding the first couple of sections to a new part of the ODD/HTML documentation aimed at programmers, so that anyone in HCMC could step in and do the tasks needed to keep the process moving along if I'm not available. I need to do a lot more of this.
Tweaked the diagnostics to allow case-insensitivity in hashtags, then fixed a couple of issues showing in the diagnostics.
Changed some project ids which didn't adequately distinguish members of our team. Met with everyone and answered a few queries; some CSS issues on KAF's drama are down to me to solve. Also added encoded line-count to the stats, during which I did two things: manually check out previous instances of the repo to get old stats for the file, and prune the stats file a bit. It seems to me that both of these things could be automated, so look at doing that in case we want to add new stats to our tracking.
Did some work to make the editor's HTML poem view more user-friendly, including some JS to tweak the margins of horizontal lines nested within left-margined ancestors to make them span the poem div correctly. I'm again reconsidering the use of grid layout for the metadata panels; I think a flex layout where I could actually fix the position of the metadata while the poem scrolls would be good (although of course sometimes the metadata scrolls too, which complicates things).
Started thinking about how to code for useful analysis of encoded poems, and wrote one pilot function, along with XSpec testing for it. Added some new drama components to the schema to support work KF2 is doing, added some enhancements/fixes to the poem rendering, encoded one new poem, re-encoded an old one.
Did another poem from 1840 for testing purposes; added some folks photos and bios to the personography and then put them into the About page; tweaked the build process a bit; gathered and signed all Jan 1-15 timesheets except for KF2's and sent them in.
Had some very useful discussion on the use of echo figures and the team's possible presentation on the topic. Fixed a couple of XSLT issues and enhanced some documentation a bit.
- Put planning document in svn and made it available on Jenkins site as a PDF.
- Updated footer in db.
- Added library and HCMC to footer of new site.
- Removed OLD Translator field from db. Changed diagnostics to point to new field.
- Combined top two diagnostic categories.
- Made reader view of relational db public.
- Added link to reader view to menu of new site.
Updated the faulty rhyme-label fixing code to handle the lg/@rhyme attribute as well, which of course it has to do; updated the XSpec to test that; added handling for the drama elements that KF is now using; and tagged a poem myself.
KF asked me today if we could have a fix for rhyme-labelling issues, in particular where you get through an entire labelling sequence then you discover that where you assigned label "m" you failed to notice that it was the same as "c" so you need to change it to "c" and re-label everything after "m" to move it down one slot. This is a slightly more tricky problem than you might think, so rather than try to squash it into a quickfix, which doesn't allow for as much graceful termination with useful messages when you ask it to do something that doesn't make sense, I've created it as an XSLT transform with a framing transformation scenario, which is not the default for XML documents but which you could run on them by using "Transform with". I've also written a fairly extensive XSpec test suite for it. In the process of developing and testing, I fixed some old encoding from back in the day. That's going to be a long steady process.
Met with AC and drafted the plan for 2019; that's in the repo now, along with a todo list for both of us.
CSS selector conversion now properly written, tested and working. XSpec file now includes test for that function. Extra poem encoded for testing. Fixes to a couple of other poems done.
I've also now finished re-creating the original site banner using higher-res sources to get something that will actually scale. Results look pretty good in isolation.
SQL-to-TEI conversion now updated to take account of the changes to the db; ditto with schema and documentation; and finally the poem rendering XSLT.
Did the updates on the dev db first, then on the live db. Problems encountered were that extensions to field lengths in the poems table seem to have hit MySQL limits, in particular the size limit on a row, which is 65,535 bytes. Converting some columns to TEXT instead of VARCHAR solves the problem, although of course there's a performance penalty. I also had to delete some indexes which were hitting limits. Below is the process in half-code-half-comments. Now I have to update my TEI generation code to take account of the changes.
/* This file is the working SQL file for changes to the db made in December 2018, per * instructions from AC. * * Make these changes step by step and confirm/check/test/backup before continuing. */ /* FIRST THE SIMPLE THINGS: MAKING TEXT FIELDS LONGER. */ /* Pseudonym field needs triple the length of characters. */ /* First we have to drop some indexes this field is involved in. */ ALTER TABLE `poems` DROP INDEX `idx_po_general`; ALTER TABLE `poems` DROP INDEX `idx_po_pseudonym`; /* Now set the length. */ ALTER TABLE `poems` MODIFY `po_pseudonym` VARCHAR(300); /* Display name ditto. */ ALTER TABLE `persons` MODIFY `prs_displayName` VARCHAR(300); /* Images field needs to handle up to 70 images. This involves changing its type to TEXT. */ ALTER TABLE `poems` MODIFY `po_images` TEXT(4096); /* Add a new allonym text field. */ ALTER TABLE `poems` ADD COLUMN `po_allonym` VARCHAR(300) AFTER `po_allonymous`; /* Add new hashtag field. */ ALTER TABLE `poems` ADD COLUMN `po_hashtags` TEXT(1024) AFTER `po_links`; /* SERIES FIELD FOR POEMS. */ ALTER TABLE `poems` ADD COLUMN `po_series` int(11) default NULL AFTER `po_organ`; /* NOW UPDATE local_classes.php and test. */ /* local_classes.php: * set prs_displayName to 300 length. * set po_pseudonym to 300 length. * set po_images to 4096 length. * add new allonym field. * add new hashtags field. * add new series field. * * */ /* NOW THE HARD STUFF: TURN THE TRANSLATOR FIELD INTO A ONE-TO-MANY LINK TO PERSONS. */ /* First create the linking table. */ DROP TABLE IF EXISTS `poems_to_translators`; CREATE TABLE `poems_to_translators` ( `ptt_id` int(11) NOT NULL auto_increment, `ptt_po_id` int(11) default NULL, `ptt_tr_id` int(11) default NULL, PRIMARY KEY (`ptt_id`), KEY `fk_ptt_translator` (`ptt_tr_id`), KEY `fk_ptt_poem` (`ptt_po_id`), CONSTRAINT `fk_ptt_translator` FOREIGN KEY (`ptt_tr_id`) REFERENCES `persons` (`prs_id`) ON DELETE CASCADE ON UPDATE CASCADE, CONSTRAINT `fk_ptt_poem` FOREIGN KEY (`ptt_po_id`) REFERENCES `poems` (`po_id`) ON DELETE CASCADE ON UPDATE CASCADE ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci; /* NOW UPDATE local_classes.php and test. */ /* Next, we try to dicsover candidates for translators in the persons table. */ /* This is the XQuery to generate the SQL: */ --------------------------------------- declare namespace output = "http://www.w3.org/2010/xslt-xquery-serialization"; declare option output:method "text"; let $poemsWithTranslators := //table_data[@name='poems']/row[string-length(field[@name='po_translator']) gt 0], $links := for $p in $poemsWithTranslators let $transName := normalize-space($p/field[@name='po_translator']/text()), $candidates := //table_data[@name='persons']/row[field[@name='prs_displayName'] = $transName] return if (count($candidates) = 1) then let $poId := $p/field[@name='po_id'], $prsId := $candidates/field[@name='prs_id'] return concat( 'INSERT INTO `poems_to_translators` (`ptt_po_id`, `ptt_tr_id`) VALUES ("', $poId, '", "', $prsId, '");', ' ') else (concat('/* No match found for ', $transName, '. */ ')) return $links --------------------------------------- /* Run the resulting SQL against the db to insert * the new records.*/ /* Change the local_classes.php file to show "OLD Translator field". */ /* Download fresh versions of the db and commit. */ /* Run XPath against the db to get comma-separated lists of poem ids where: * a) a new record has been inserted linking to the poem table, and * b) no match was found so a record will have to be manually created. */
Tomorrow is update day for the db, so I've built a full plan with SQL and XQuery code ready to execute. It looks like we can link about 730 of the 1530 or so translators directly to existing person records, so although those will need to be checked, that's a lot faster than doing them all manually. The remaining ones will have to be handled manually, though. I've also pulled the banner from the old VPN site ready to make a rough home page for the site, and done some more thinking about a more sophisticated CSS parser in XSLT.
It's very common to find the same pattern of indents throughout the stanzas of a poem. Right now, people are encoding these mechanically and repetitively, which is OK but clutters the XML and takes time. A better option would be to use the TEI rendition element with the @selector attribute, like this:
<rendition selector="lg"> margin-left: 6rem; </rendition> <rendition selector="lg>l:nth-child(2), lg>l:nth-child(4), lg>l:nth-child(7), lg>l:nth-child(10)"> margin-left: 1.1em; </rendition>
to specify that all stanzas have a left margin of 6rem, and lines 2, 4, 7, and 10 of each stanza are additionally indented.
This is easy to code but hard to process. I've had a first shot at figuring out how to do it, and so far so good, although as the selectors get more gnarly the code will have to be revisited. It's good enough for testing purposes at any rate.
The progress tracking output was borked in a couple of ways, one cosmetic (the chart display had hundreds of stacked labels on the X axis) and one arithmetical (I was miscalculating the projected duration based on current progress). I've fixed both of those issues.
Worked on my poem-encoding Quick Fix so that it can now tag a whole poem in one go. As part of developing and testing, I also encoded a couple of poems myself, and did some tweaks to rendering and processing. I also ran the OCR task against the 1840 poems to give myself a bit more choice in picking test poems. Updated the documentation as well.
Met with KF and AC and discussed a number of issues, as a result of which I've added schema support, processing support and documentation for handling refrains, eliminated the hack that was used to handle them before, tightened up the rhyme label attribute constraints as a result, and fixed the old encoding approach from the data. I've done the same for ornamental horizontal lines. In the process I encoded a couple of poems myself, and fixed some bugs in rendering that were annoying me.
More work on this, and lots of work to fix bad rhyme encoding which is now obvious in the results. Much tedious re-encoding of old poems. Fixed some XSLT bugs too.
I've been meaning to do this for a long time, and I got the time today to do a quick-and-dirty implementation of a search for all endings that rhyme with a given ending supplied as a param. The results are intriguing, suggesting many encoding errors in the older files. These can all be fixed, of course, but it'll be interesting to follow up on how some of them happened. More work to do, too, on the interface to the feature.
Added the Schematron and the QuickFix, documented them, and then trawled through all the existing documents to fix problems (there were hundreds). This applies only to the text element descendants for now, but I also fixed some apostrophes/straight quotes in the db itself to avoid future problems.
After adding some documentation on the SQF QuickFix features to the ODD file, I got annoyed by the fact that the SQF code in the ODD file was making it technically invalid. Actually, it was the TEI code embedded in XSLT embedded in SQF which was causing the problem, so I refactored it to use XSL element and attribute constructors instead of literal elements, and the problem was fixed. You have to be a little careful with namespaces when doing that, though.
Then I moved on to implementing the requirement for curly apostrophes. Actually I'm going to generalize that to curly quotes everywhere. Before we can make rules and enforce them, we need to make sure that we're not actually importing more of these things whenever we do SQL-to-TEI processing, so I've been working on those conversion routines to make them handle the curly apostrophes and quotes in the db. In the process, I learned a bit about XSpec and wrote my first XSpec unit tests. This looks like it may be a valuable testing tool.
Meeting and group tagging session, during which I did the following:
- Created a link inside the documentation HTML for a "cheatsheet" which is actually a constrained view of some of the documentation, intended for printing.
- Discussed with the RAs the need for a simpler way to check your rhyme encoding, which resulted in a new feature in the poem rendering that enables you to turn on and off individual rhyme label highlighting.
- A new constraint on lg/@rhyme, which uses a regex to constrain the content, and includes a new value of "NONE", which we will add some processing for.
- Fixed a bunch of old encodings which were no longer valid against the new constraint.
- Discussed encoding of prose content before and inside poems, developed tagging guidelines for it, and added them to the documentation.
Made a bit of progress before and after the morning's training; also tagged a longish poem myself as part of testing. All basically working as intended.
After a lot of reading and experimentation, I think I have a robust way to enable automated tagging of blocks of content, using Schematron Quick Fixes. Right now I have one for turning double-dashes into em dashes, and (more important, and more difficult) one for auto-tagging a block of text as a stanza. I've also added processing into the ODD file build to retrieve the code template keystroke shortcuts from the Oxygen file and build an explanatory table in the documentation.
It's now a one-line command to add OCR to any given year in the TEI files. Did both 1820 and 1830 already.
My HOCR process is now able to find all the candidate poems in a year, download all the images, run HOCR on them, and then start to process the original file to include the comment in it. But the last phase is a little tricky.
Did another round of group training, where we all discussed a lot of our processes and tagging practices. We've decided to dispense with any encoding which can be derived from the structure or other tagging -- so for instance lg/@type is not needed right now, because all the simple types are inferrable, and the complex ones need expertise.
Planned the stages of the OCR process and created the framework build file for it.
KFu found some tags she was using that were not being correctly handled, so I've made some additions and revisions to the poem rendering code.
I've now tweaked the build process so that poem rendering is now part of the build process (assuming the Jenkins build doesn't break). Also spent some time building on the diagnostic process I wrote yesterday to determine whether encoding data had been lost to the obsolete folder during the sql-to-tei conversion, and wrote a retrieval process that puts it back. All data now retrieved, and I think the problem that caused it to be lost has been fixed, but I now have an instant method to check and fix after the next run of the TEI-building process.
Converted six old handouts from svn training and English 500 to use with our DVPP training, then we did the training. Afterwards, followed up with some fixes to the project file and transformation of poems, and also noticed that some encoded poem data was being lost to the obsoletes folder, so wrote a diagnostic to find the extent of that and generate info for fixing it. Will fix it next, and figure out why it's happening.
Added to and tweaked the documentation. Made a change to the diagnostics chart so it better reflects our progress towards 15,000 poems. Fixed some bugs. Rebuilt all the TEI.
I've basically completed the documentation in the ODD file ahead of Wednesday's training session, drawing on the old VPN materials but rewriting a lot to bring it in line with our current practice.
Met with AC and generated the following TODO list, some of which I've now done:
- Port over existing respStmts from old file (DONE).
- Reconfigure taxonomies to remove stanza types in favour of linegroup types (DONE).
- Reconfigure schema build process to incorporate full glosses and descs for linegroup types into schema (DONE).
- Reconfigure authorship taxonomy for better nesting and allonymy, and update existing headers accordingly (DONE).
- Add total poems per periodical to the stats (DONE).
- In the HTML poem rendering, replace links to images with thumbnails (DONE).
- In SQL to TEI process, incorporate authorship taxonomy data as catRefs.
- Add automated OCR to TEI building process for specific years (starting with 1820).
Also did a fresh rebuild of the TEI, incorporating new poems since the last one, to confirm that respStmt handling works correctly.
Ported over the original code from the VPN project and updated and tweaked it quite a lot to get better layout options. There's still the main layout to do, and I'll use CSS grid for that. In the process of today's work, I found errors in lots of poem encodings, which I fixed; added Schematron rules to prevent some of them; added missing bits to the taxonomies; and various other updates and fixes. This is all good progress.
There will be some tweaks that we need to handle, but the basic process is complete and we now have over 10,000 TEI files in our repository. Next is working on the Oxygen configuration for encoding. Posting time spent today, including a long phone call with AC, but also spent at the weekend getting the last wrinkles ironed out.
Spent most of the day tidying up, finishing off, and dealing with edge cases; I'm now able to generate all the TEI files containing all of the information we care about (minus the quoted-in-article thing for now, and the original language of a translation), but I'm just struggling with the final two tasks: generate scripts for moving obsolete files out of the xml folder, and for svn-adding new files which didn't exist before. This is just a question of getting my head round sed, which I don't use very often. Many changes to the schema and documentation in this process too.
For ease of reading and convenience, I've added a section to the documentation which is a rendering of the taxonomies.
I'm now able to correctly re-generate the personography and bibliography data from the database, creating a valid output file. I'm also pulling in page-image data into the facsimile element of the poem files. Steady progress.
We need to do three kinds of things when processing the canonical metadata database into TEI XML: 1. update metadata in existing files, 2. create new files for poems never before processed, and 3. detect problems such as multiple existing files for the same id. I'm a good way into this, and the basic structure is set up and appears to be working. Ultimately, I think we'll have to manage the whole operation with ant, to construct all the output stuff in a new location, then copy it back over the original tree, because we won't be able to both read and write the same file in one XSLT transformation.
Also reconstructed the two required triggers in the live database to get the related-poems dropdown working again.
Per AC, extended the pseudonym field to 100 characters, and renamed the Anonymous field to Allonymous. Tested in dev, done in live, tested again in both. Later bug report came in: the myriad triggers were not handling the change correctly. Deleted the triggers completely as a quick way to get back to a working db, but I will have to rewrite the ones for related poems based on the earlier backups.
AC needed to generate a printable document of pages for a specific year, for the encoders to use. Got a list of images from the SQL XML dump like this:
let $images := //table_data[@name='poems']/row[starts-with(field[@name='po_date'], '1820')]/field[@name='po_images']/tokenize(., '\s+') return string-join($images, '
Then used wget to pull down the images from the server, and generated the PDF:
img2pdf -o 1820.pdf *.jpg