I've written an XSLT thing which generates a report on broken cross-references, and generated a report. Then I fixed a few obvious easy ones. There are quite a few left.
I've just worked through the handful of remaining instances of bibls containing parentheses. These were problematic because it wasn't exactly clear how to assign responsibility (@corresp) values to them.
I've left the contents of the bibls alone, so we can easily find them all again with this regex:
<bibl[^<]*\([^<]+</bibl>
My basic approach has been to credit everyone whose initial-key appears in the entry, without trying to figure out the relationship between them, on the basis that this is safer than ignoring some people based on question marks or parentheses. So, for instance:
<bibl corresp="psn:K psn:AM psn:JM">K(Y40.29)</bibl>
becomes
<bibl corresp="psn:K psn:AM psn:JM">K(Y40.29)</bibl>
(K becomes K; Y signifies AM and JM together.)
The only remaining problems are two instances in the rescued.xml file of this:
<bibl>(q.v.)</bibl>
I have no idea what to do with this, and a search through the original lexware files doesn't really help. I think someone will have to go back to the filecards for those.
There are 10,124 bibls remaining that have no @corresp, but 8029 of them are <bibl><!--[No source]--></bibl>. That leaves over 2000 which are a problem, though, and many of these are in already-completed files; affix_aspectual.xml has 186 instances of this, for instance:
<bibl>4.56</bibl>
which is I would guess supposed to be W4.56 or G4.56, both of which appear elsewhere.
I have a working table of contents for the book, based on divs with xml:ids. I'm sure this will get more complicated in time, but it works for the moment. I think we're going to need to organize some individual section title pages which have only large text in the middle of them. This might be done by deciding that any div with a head but no other content is a big fat title page.
As requested in SMK's post of March 4, I've added the four indexes in an appendix. They add only 17 pages to the length of the dictionary.
Yesterday ECH requested the following three fixes, which I've now done:
- Allomorphs are now separated by tildes in main entries.
- The "capitals" in small-caps are now explicitly 9pt, alongside 7pt for the "small" letters, instead of inheriting the default 10pt from the context.
- In the root-based index, we're using the first def instead of the gloss, since glosses often don't constitute short definitions; instead they're partial definitions whose purpose it to generate the English-Moses lookup index.
ECH and I met and went through all the existing tasks and documentation. We've put together a Gantt chart (using GanttProject -- very straightforward to use), and we'll be able to move forward in a more organized way through the next phase of the project.
Made a few bugfixes, and added title page info and some other placeholder content to the XML intro file for the book. Then added rendering for the title pages, and fixed the blank page issue. Also added some XSL messages to track rendering progress, and rationalized the build directory by deleted some old files with hard-coded orths (not needed now we're creating them on the fly). Did another proof of the LD & C article too.
Today:
- Spacing is fixed (all explicitly inserted and controlled now).
- Leading/trailing spaces in text nodes are clipped.
- Affix prefixes and suffixes are working.
- The small-caps implementation is working, and is used in the root-based-index too.
- The RBI is now formatted very nicely, with no confusing indents, and with some decent settings for keeping headers and headwords with following blocks, to avoid widows.
- Labels are now included in entries for affixes.
- Sections now all begin on rectos.
New tasks arising:
- Because of auto-blank pages, we need a page template for blank pages and it needs to be included with sequences that have headers and footers, so that blank is really blank (the technique is adaptable from ScanCan corpus-to-pdf code).
- Sarah's remaining instructions below need to be implemented (some are straightforward, others less so).
A certain amount of one-step-forward-two-steps-back today. We've determined that there's no way with XEP to do small-caps, because the font does not have a small-caps variant, so I had to write some code to make small-caps programmatically with text analysis and styles. This caused horrible problems because it introduced lost of whitespace, due to the indenting between FO nodes. The only way to get rid of this was to set indent="no" on the XSL output, which leaves us with the need to manually insert spaces all over the place; that will take me a while, but ultimately it'll probably be the right thing to do. So the results are currently ugly as sin, but will eventually look a bit better.
Lots of other decisions as detailed in SMK's posts today.
If there are two instances of the same root's xml:id in a word's hyph, it's because the root morpheme is split up by an infix. These infixes need to be handled as follows:
-- when printing the hyph, replace the following strings:
-ʔ- with <ʔ>
+a+ with <a>
+C₂+ with <C₂>
+CVC+ with <CVC>
-- when generating the translated hyph,
a) Delete the second/rightmost instance of the root after these morphemes: inchoative (xml:id="ʔ"), characteristic (xml:id="CHAR"), out of control (xml:id="OC"):
For example: [[√ʔiɬ<CVC>n-úl • √eat<char>-attrib]]
BUT, if the root has no gloss, DO keep the second part of the root:
For example: [[k-√cúwˀ<CVC>x=ánaʔ • loc-√cúwˀ<char>x=ear]]
b) Delete the first/leftmost instance of the root before the repetitive morpheme (xml:id="REP"), and put the root symbol before the second part of the root.
For example: [[√p<a>tix̣ʷ • <rep>√test]]
Again, if the root has no gloss, keep the first part of the root.
For example: [[√p<a>tix̣ʷ • √p<rep>tix̣ʷ]]