I've continued working on trying to fix up the XML in the existing documents while in the UK. The more I looked at them, the worse they turned out to be; it seems unlikely that any of them would have validated, unless they were using a heavily-modified schema which redefined the way some core tags should be used. I've pursued a strategy of fixing one-off or two-off errors manually whenever I encounter them, and turning the other fixes into a single large XSLT transformation, which I include below because it amounts to documentation of the fixes required. The result of this transformation, which I think is finished as of this morning, is that all documents validate against tei_all, with the exception of two, which have peculiarities relating to mapping the routes of pageants; those will have to be entirely rewritten manually; and the credits.xml file, which uses XInclude to pull in data from another document, so can't actually validate because oXygen can't handle the xpointer syntax to pull in the relevant content (although eXist can). Following this, I'm going to convert all of the remaining PHP pages on the site to XML (that has to be done manually), and then we'll have a good data set and I can start coding things like the search engine.
That said, there are many documents which still have problems, despite being valid, and those will have to be fixed manually. Most commonly, there are documents which have internal sections, which should be handled by dividing them into nested divs, each with a head element; however, they've been done with special paragraph or seg elements for inline headings, which validates but is clearly incorrect. Those can be dealt with in time, by an RA, once we have good detailed markup documentation.
Here's the XSLT:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xd="http://www.oxygenxml.com/ns/doc/xsl"
xpath-default-namespace="http://www.tei-c.org/ns/1.0"
xmlns="http://www.tei-c.org/ns/1.0"
exclude-result-prefixes="xs xd"
version="2.0">
<xd:doc scope="stylesheet">
<xd:desc>
<xd:p><xd:b>Created on:</xd:b> Oct 1, 2011</xd:p>
<xd:p><xd:b>Author:</xd:b> mholmes</xd:p>
<xd:p>This is just a utility identity transform for any purposes required.</xd:p>
</xd:desc>
</xd:doc>
<!-- This template finds instances where there is a group element
with only one child text element, and removes the group and
the child text. -->
<xsl:template match="group[count(child::text) lt 2]">
<xsl:apply-templates select="child::text/*"/>
</xsl:template>
<!-- body elements should have their content in a div, but usually they don't. -->
<xsl:template match="body[not(child::div)]">
<body><xsl:text>
</xsl:text>
<div><xsl:text>
</xsl:text>
<xsl:if test="not(child::head)">
<head><xsl:copy-of select="//teiHeader/fileDesc/titleStmt/title[1]/text()"/></head><xsl:text>
</xsl:text>
</xsl:if>
<xsl:apply-templates select="*"/><xsl:text>
</xsl:text>
</div><xsl:text>
</xsl:text>
</body><xsl:text>
</xsl:text>
</xsl:template>
<!-- This template adds a generic revision change element, to capture the range
of different changes I've made over the past couple of weeks. -->
<xsl:template match="revisionDesc">
<revisionDesc><xsl:text>
</xsl:text>
<xsl:apply-templates select="change"/>
<change when="2011-11" who="mol:HOLM1">Various updates and fixes made through XSLT, to standardize and normalize encoding practices.</change><xsl:text>
</xsl:text>
</revisionDesc>
</xsl:template>
<!-- At the same time, we might as well remove useless <change> elements
which are empty, and also wrong, derived from a template. -->
<xsl:template match="change[string-length(normalize-space(.)) lt 2]"></xsl:template>
<!-- Lots of <change> elements contain <l> and <p> elements, which are invalid. These are used to
separate multiple changes made during the same session. The simplest thing is
to replace them with line-break tags, I think. Many have no content,
so those can be suppressed. -->
<xsl:template match="change/l">
<xsl:if test="string-length(normalize-space(.)) ge 2">
<xsl:value-of select="."/>
<xsl:if test="following-sibling::l[string-length(normalize-space(.)) ge 2]">
<lb/>
</xsl:if>
</xsl:if>
</xsl:template>
<xsl:template match="change/p">
<xsl:if test="string-length(normalize-space(.)) ge 2">
<xsl:value-of select="."/>
<xsl:if test="following-sibling::l[string-length(normalize-space(.)) ge 2]">
<lb/>
</xsl:if>
</xsl:if>
</xsl:template>
<!-- Normalize previously-inserted change elements. -->
<xsl:template match="change[contains(@who, 'staff.php#martin')]">
<change who="mol:HOLM1">
<xsl:copy-of select="@when"/>
<xsl:apply-templates select="* | text()"/>
</change>
</xsl:template>
<xsl:template match="change[child::name[@ref='mol:MDH']]">
<change who="mol:HOLM1" when="2011-09">
<xsl:value-of select="text()"/>
</change>
</xsl:template>
<!-- Some files have no revisionDesc, so we'll insert it. -->
<xsl:template match="encodingDesc[not(following-sibling::revisionDesc)]">
<encodingDesc>
<xsl:apply-templates></xsl:apply-templates>
</encodingDesc>
<revisionDesc><xsl:text>
</xsl:text>
<change when="2011-11" who="mol:HOLM1">Various updates and fixes made through XSLT, to standardize and normalize encoding practices.</change><xsl:text>
</xsl:text>
</revisionDesc>
</xsl:template>
<!-- We need to handle previously-transformed internal links like this:
<ref type="internal" target="/licenced.htm"> -->
<xsl:template match="ref[@type='internal'][matches(@target, '/[a-zA-Z0-9_]*.htm')]">
<ref target="mol:{substring-before(substring-after(@target, '/'), '.htm')}"><xsl:value-of select="."/></ref>
</xsl:template>
<!-- @rend attributes are very problematic. Here are the elements which carry them:
emph (bold)
graphic (center, imgmap) Don't know what to do about this.
group (multi ... WTF?)
head (italics)
hi (sup)
l (various: see below)
lg (center)
name (book_title, no_index. The former fixed below; the latter mysterious, and left alone.)
p (center; center italics; center, italics; copyright; indent)
ref (noindex, no_index)
reg (fixed manually)
seg (see below)
title (fixed manually)
-->
<!-- These are the distinct values of seg/@rend. They include both inline and block formatting
commands.
bold
italics
superscript
bold italics
center
heading center
heading
center italics
italics center
italics indent_10
Each will have to be handled slightly differently, depending on its context and usage.
-->
<!-- There's only one of these, and it's just flat-out wrong. -->
<xsl:template match="name[@rend='book_title']">
<title level="m"><xsl:apply-templates/></title>
</xsl:template>
<!-- There are two distinct usages of ref/@rend regarding "no indexing":
no_index and noindex. We'll normalize to the former. -->
<xsl:template match="ref[contains(@rend, 'index')]">
<ref rend="no_index">
<xsl:copy-of select="@*[not(local-name() = 'rend')]"></xsl:copy-of>
<xsl:copy-of select="* | text()"/>
</ref>
</xsl:template>
<!-- These are the simple ones, which are all inline. <seg rend="bold">, for instance, occurs in <p>, <q>, <l>, and <item>; in all those contexts it can be replaced by <hi>. -->
<xsl:template match="seg[@rend='bold']">
<hi rend="font-weight: bold;">
<xsl:apply-templates/>
</hi>
</xsl:template>
<!-- <emph> is only used for bold, and should be kept as <emph>. -->
<xsl:template match="emph[@rend='bold']">
<emph rend="font-weight: bold;">
<xsl:apply-templates/>
</emph>
</xsl:template>
<xsl:template match="seg[@rend='italics']">
<xsl:choose>
<xsl:when test="contains(., 'MoEML') or contains(., 'Map of Early Modern London')">
<title level="m"><xsl:apply-templates select="* | text()"/></title>
</xsl:when>
<xsl:otherwise>
<hi rend="font-style: italics;">
<xsl:apply-templates/>
</hi>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<xsl:template match="seg[@rend='bold italics']">
<hi rend="font-weight: bold; font-style: italics;">
<xsl:apply-templates/>
</hi>
</xsl:template>
<xsl:template match="seg[@rend='superscript'] | hi[@rend='sup']">
<hi rend="vertical-align: super;">
<xsl:apply-templates/>
</hi>
</xsl:template>
<!-- These are the more complicated ones. They require that the style be realized on the containing element, not the <seg>. Many are caused by abuse of <lg> and <l> to create a heading. In the right
context, these can simply be converted to <head> elements. -->
<xsl:template match="lg[preceding-sibling::*[1][local-name() = 'head']][count(//l) = 1]">
<head>
<xsl:attribute name="rend"> text-align: center; <xsl:if test="contains(l/@rend, 'italics')">text-style: italic;</xsl:if>
</xsl:attribute>
<xsl:apply-templates select="l/* | l/text()"/>
</head>
</xsl:template>
<!-- Normal <lg>s and <l>s. There's a lot of redundancy in these (rend="center" on both
<lg> and its child <l>) Only "center" exists.-->
<xsl:template match="lg[@rend]">
<lg rend="text-align: center;">
<xsl:apply-templates/>
</lg>
</xsl:template>
<!-- <l> has a wide range of values for @rend.
We need to handle each of them correctly.
-->
<xsl:template match="l[@rend]">
<xsl:variable name="style">
<xsl:if test="contains(@rend, 'center')">text-align: center;</xsl:if>
<xsl:if test="contains(@rend, 'bold')">font-weight: bold;</xsl:if>
<xsl:if test="contains(@rend, 'italic')">font-style: italic;</xsl:if>
<xsl:if test="contains(@rend, 'align_right')">text-align: right;</xsl:if>
<!-- Indents are of the form indent_3, where 3 = number of spaces to indent by. We will use
em as the value for this. In all cases, the indent value appears at the end of the
@rend. -->
<xsl:if test="contains(@rend, 'indent_')">text-indent: <xsl:value-of select=" normalize-space(substring-after(@rend, 'indent_'))"/>em;</xsl:if>
</xsl:variable>
<l>
<xsl:attribute name="rend"><xsl:value-of select="normalize-space($style)"/></xsl:attribute>
<xsl:apply-templates select="* | text()"/>
</l>
</xsl:template>
<!-- <p> elements have some values which can be converted to css, but others
that we'll just leave in the rend attribute. -->
<xsl:template match="p[contains(@rend, 'italics') or contains(@rend, 'center')]">
<xsl:variable name="style">
<xsl:if test="contains(@rend, 'center')">text-align: center;</xsl:if>
<xsl:if test="contains(@rend, 'italic')">font-style: italic;</xsl:if>
</xsl:variable>
<p rend="{normalize-space($style)}">
<xsl:apply-templates/>
</p>
</xsl:template>
<!-- There are several instances of <p> elements which contain only a seg with @rend
= "heading" or "heading center". These can't be turned into <head> elements
because they appear right in the middle of divs. They SHOULD be handled by
sub-divs, but that's impossible to do algorithmically. -->
<xsl:template match="p[child::seg[contains(@rend, 'heading')]]">
<xsl:comment>The following p should NOT be a p tag; it should be
a heading, using a head tag, and the text should be subdivided
to provide a nested div at this point.
</xsl:comment>
<p rend="heading">
<xsl:if test="contains(seg/@rend, 'center')">
<xsl:attribute name="rend">text-align: center;</xsl:attribute>
</xsl:if>
<xsl:apply-templates select="seg/* | seg/text()"/>
</p>
</xsl:template>
<!-- old internal references
<ref type="internal" target="render_page.php?id=HENS2&n=16&searchterm=manuscript"> -->
<xsl:template match="ref[@type='internal'][starts-with(@target, 'render_page.php?id=')]">
<xsl:choose>
<xsl:when test="contains(@target, '&')">
<!-- There are other parameters which we need to retain for the moment. -->
<ref>
<xsl:copy-of select="@*[not(local-name() = 'target')]"/>
<xsl:attribute name="target">mol:<xsl:value-of select="substring-after(@target, 'id=')"/>&<xsl:value-of select="substring-after(@target, '&')"/></xsl:attribute>
</ref>
</xsl:when>
<xsl:otherwise>
<ref>
<xsl:copy-of select="@*[not(local-name() = 'target')]"/>
<xsl:attribute name="target">mol:<xsl:value-of select="substring-after(@target, 'id=')"/></xsl:attribute>
</ref>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<!-- <date calendar="Julian" value="1613"> -->
<xsl:template match="date[@value]">
<date when="{@value}">
<xsl:apply-templates select="@*[not(local-name() = 'value')] | * | text()"/>
</date>
</xsl:template>
<!--
<text>s in multiple groups often have this: <text corresp="THRE3">.
They point at e.g. locations, and should look like this:
<text corresp="mol:THRE3">
-->
<xsl:template match="text[@corresp]">
<text>
<xsl:copy-of select="@n"/>
<xsl:attribute name="corresp" select="concat('mol:', @corresp)"/>
<xsl:apply-templates/>
</text>
</xsl:template>
<!-- Copy everything else as-is. -->
<xsl:template match="node()|@*" priority="-1">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>