Leaving early today.
My original approach to sequencing the Carrier dictionary was to create a Java collation, but it occurred to me when I looked at the actual data and the proposed sequence that this could easily be done in XSLT, using replacements. My strategy has been to remove all accents, then replace key pairs of characters with other pairs that will sort in the right location relative to existing data, and then to remove all apostrophes. This, as far as I can tell, gives the right results, although I'm still waiting for confirmation from the FV and Carrier folks. Here's the XSLT that does it:
<xsl:function name="mdh:tweak" as="xs:string">
<xsl:param name="inString" as="xs:string"/>
<xsl:variable name="output" select="$inString"/>
<xsl:choose>
<xsl:when test="string-length($inString) gt 0">
<!-- Replace all accented vowels with their unaccented equivalents. -->
<xsl:variable name="accentsGone" select="translate(normalize-space(lower-case($inString)), 'áéíóú', 'aeiou')"/>
<!-- Get rid of all combining underscores (u+0331 and u+0332).-->
<xsl:variable name="underscoresGone" select="replace($accentsGone, '̱|̲', '')"/>
<!-- Now before we remove the apostrophes, we need to replace some pairs of letters that need to sort as if they were one.
We can use a character following z to replace a character that needs to sort after all the rest. For example:
kh needs to sort after ka, kb, kc, kz, so we can replace kh with k{
So we do:
g becomes {
h becomes }
l becomes ~
o becomes ¥
s becomes ¦
w becomes §
z becomes ©
-->
<xsl:variable name="gReplaced" select="replace($underscoresGone, 'ng', 'n{')"/>
<xsl:variable name="hReplaced" select="replace($gReplaced, '(c|g|k|l|s|w)h', '$1}')"/>
<xsl:variable name="lReplaced" select="replace($hReplaced, '(d|t)l', '$1~')"/>
<xsl:variable name="oReplaced" select="replace($lReplaced, 'oo', 'o¥')"/>
<xsl:variable name="sReplaced" select="replace($oReplaced, 'ts', 't¦')"/>
<xsl:variable name="wReplaced" select="replace($sReplaced, '(g|k)w', '$1§')"/>
<xsl:variable name="zReplaced" select="replace($wReplaced, 'dz', 'd©')"/>
<xsl:variable name="aposGone" select="replace($zReplaced, '''', '')"/>
<xsl:value-of select="$aposGone"/>
</xsl:when>
<xsl:otherwise>
<xsl:variable name="noTerm">__NO DAKELH TERM IN THIS ENTRY.</xsl:variable>
<xsl:value-of select="$noTerm"/>
</xsl:otherwise>
</xsl:choose>
</xsl:function>
I can sort the actual records on the Dakelh Term passed through this function.
The long-term plan for rapid generation of spreadsheet output and niftier searches is to create a VIEW of the core table which contains all the data simply in the form of strings, which can be indexed, searched and rendered more rapidly. The first stage of this is figuring out how to turn the adaptive DB field types into string fields (or integers where they're simple integers). This is a bit tricky in the case of the one-to-many fields and the custom fields. This is a block of working code which covers a few of the field types:
CREATE VIEW `documents_view` AS SELECT
`doc_id` AS `doc_id`,
(SELECT `dt_name` FROM `docTypes` WHERE `dt_id` = `documents`.`doc_to_docTypes_id`) AS `doc_to_docTypes_id`,
`doc_archive` AS `doc_archive`,
(SELECT GROUP_CONCAT(`disTypes`.`dt_name` SEPARATOR ', ') FROM `docs_to_disTypes` INNER JOIN `disTypes` ON `disTypes`.`dt_id` = `docs_to_disTypes`.`dtd_disType_id_fk` INNER JOIN `documents` AS `temp_docs` ON `temp_docs`.`doc_id` = `docs_to_disTypes`.`dtd_doc_id_fk` WHERE `docs_to_disTypes`.`dtd_doc_id_fk` = `documents`.`doc_id`) AS `doc_disTypes`
FROM `documents`
This is based on JW's documents table. The first is a straight MdhIntField; the second is an MdhStrSelectField; the third is an MdhStrLookupField; and the fourth is the trickiest one so far (thanks Jamie and Greg), an MdhOneToManyField.
Once I've worked through generation code for all of the field types, and tested it, I need to add an abstract function called getViewGenerationCode() to MdhBaseField, and implement it in the descendants, so that MdhRecord can call this function on all its fields and use it to construct an SQL command that will generate a view.
One issue that remains unresolved -- we're still working on it -- is what happens when your one-to-many field exceeds the default group_concat_max_len value for MySQL (which is 341). Right now, it looks as though any field in a view created with group_concat will end up as varchar(341), but we don't know what will happen if the output of the group_concat is longer than that; will it be truncated, or will the view field be expanded? The former seems more likely, unless we can increase the group_concat_max_len config setting, which can apparently be done for a single session (SET @@group_concat_max_len = 9999999;). Jury's out on this until we can do more research and testing.
RES-Z-1746-P-114.jpg arrived. I've made a start on cleaning it up, then I'll build the other sizes and the XML file.
Final meeting of PDSA committee.
To assist KSW with automating some of the name markup, I've been generating lists of distinct values for name variants, using this code:
xquery version "1.0";
declare namespace xdb="http://exist-db.org/xquery/xmldb";
declare namespace util="http://exist-db.org/xquery/util";
declare namespace f="http://exist-db.org/f-functions";
declare namespace tei="http://www.tei-c.org/ns/1.0";
(:declare namespace fn="http://www.w3.org/2005/xpath-functions";:)
declare function f:getContents($id as xs:string) as element()*
{
for $d in distinct-values(collection('/db/coldesp/correspondence/')//tei:name[not(@type)][@key = $id])
return
<name>{$d}</name>
};
<people>
{for $id in distinct-values(collection('/db/coldesp/bios/')//tei:person/@xml:id)
return
<person xml:id="{$id}">
{f:getContents($id)}
</person>}
</people>
Throws up some interesting things that look like they might be typos, as well as many names that don't seem to have any mentions in the text. I'm investigating.
We're now working out of a subversion repository instead of editing XML directly in home1t; this is basically essential since we have five or more people editing files in the same collection. I've set up LSPW and EGB on their editing machines with subversion and instructions, and all seems to be going well. I've also set up Radish so that it's ready for GBS when she comes in on Monday. I haven't yet set up the Mac for LCC.
Went to web dev meeting about Cascade, which looks very impressive. Looking forward to working with it.
Five timesheets completed and submitted.
Met with SD today so that I could get more familiar with the project. SD went over how he'd like the data displayed, and we talked about the current state of the project in general. He is pleased with the progress and is excited to see what lies ahead. He'll double-check the list of crimes, respites and outcomes before I send him the XML file for proofing. After we hammer down the conversion and validation process for the current data set (1800s), then the rest of the data (i.e. the rest of the 19th century) will fall into place relatively quickly.
So, the project is progressing well and both SD and myself are happy with the results thus far.