<?xml version="1.0" encoding="UTF-8"?>
<TEI.2 id="paper_170_butler">
   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>Improving Access to Encoded Primary Texts</title>
            <author>
               <name reg="Butler, Terry">Terry Butler</name>
            </author>
            <respStmt>
               <resp>Marked up by </resp>
               <name reg="Holmes, Martin">Martin Holmes</name>
               <lb/>
               <name reg="Baer, Patricia">Patricia Baer</name>
            </respStmt>
         </titleStmt>
         <publicationStmt>
            <p>Marked up to be included in the ACH/ALLC 2005 Conference Abstracts book.</p>
         </publicationStmt>
         <sourceDesc>
            <p>None</p>
         </sourceDesc>
      </fileDesc>
      <profileDesc>
         <textClass>
            <classCode>paper</classCode>
            <keywords>
               <list>
                  <item>text encoding</item>
                  <item>automatic text processing</item>
                  <item>online delivery of primary text</item>
               </list>
            </keywords>
         </textClass>
      </profileDesc>
      <revisionDesc>
         <list>
            <item>MDH: Created from John Bradley's XML <date value="2005-03">March 2005</date>
            </item>
            <item>PAB: Marked up <date value="2005-03-10">10 March 2005</date>
            </item>
            <item>MDH: Made revisions from author's proofing <date value="2005-03-11">11 March 2005</date>
            </item>
         </list>
      </revisionDesc>
   </teiHeader>
   <text>
      <front>
         <docTitle n="Improving Access to Encoded Primary Texts">
            <titlePart>Improving Access to Encoded Primary Texts</titlePart>
         </docTitle>
         <docAuthor>
            <name reg="Butler, Terry">Terry Butler</name>
            <address>
               <addrLine>Terry.Butler@UAlberta.ca</addrLine>
            </address>
         </docAuthor>
         <titlePart type="affil">University of Alberta</titlePart>
      </front>
      <body>
         <div0>
            <head>The Access Problem</head>
            <p>An impressive amount of our literary heritage has now been put into digital editions. Much of it is encoded in XML, often using recognized standards for encoding such as the TEI. One of the primary scholarly goals behind this activity has been to increase access to the texts - by publishing them on-line, and by making the text amenable to searching. The XML tagging provides further added value for searching and display. Metadata, where it exists at all, is mostly at the collection level, or provides only a broad guide to the contents of a specific work.</p>
            <p>Between high-level metadata access, and a direct search on the word forms of the text, there is little help for the reader. Due to the immense labour involved in creating detailed subject indexing, very few scholarly electronic texts have indexes or finding aids which would draw the reader to specific sections of the work.</p>
            <p>To address this deficiency, a first trial has been made at automatic indexing of a substantial non-fiction work. The notebooks of Samuel Taylor Coleridge are a rich treasury of the thought of one of the 19th century's most important intellectuals. Comprised of over 6,500 individual entries (in scope ranging from a single phrase to complete essays), they are a valuable record of his thought and the active intellectual currents of the time. We have captured the text of the notebooks in electronic form, encoded with TEI. As a first step to building a coherent subject index to this material, we have generated a mapping between this material and a contemporary subject index (Roget's first edition of his celebrated <title level="m">Thesaurus</title>).</p>
            <p>Our strategy has been to construct connections between the conceptual categories in the <title level="m">Thesaurus</title> and Coleridge's individual notes, based upon a weighted measure of similarity between the words of the note and the terms and sub-terms in the <title level="m">Thesaurus</title>. Common words are weighted lightly; rarely used words heavily. Using this measure, we can connect each note to one or more thesaurus entries, which then makes the note accessible to searching through the thesaural categories. Implementing these connections through topic map technology, we have a stand-off tagging structure that relates these two encoded works but still leaves both of them unchanged, available to be delivered and shared with colleagues.</p>
            <p>This presentation will describe the process by which we create an appropriate mapping between Coleridge's text and Roget's hierarchy, demonstrate the environment for creating and managing the stand-off tagging, and describe the utility of the resulting product.</p>
            <p>The resulting edifice illustrates three important advantages for access to scholarly text: the index connects and relates sections of the text to larger, consistent conceptual categories; it provides access for searching that is complementary to the texts' own idiosyncratic terminology; it uses stand-off tagging to provide access without direct intervention in the electronic source text. This indexing structure, of value to researchers in its own right, is also the scaffolding upon which we will construct our subject index of the <title level="m">Notebooks</title>, using modern terminology and accessible conceptual categories.</p>
         </div0>
         <div0>
            <head>Background to the Project</head>
            <p>The notebooks of Samuel Taylor Coleridge are a valuable and almost unknown resource. Much of Coleridge's work as poet, philosopher, scientist, linguist, and theologian was published only partially and fitfully in his time; the notebooks contain some of his most innovative and interesting work. They have been published in print in five large double volumes (text and notes) by Princeton University Press, with indexes to selected titles, names, and places; but there is no subject index. The intention for the series was to publish a thematic index to the whole, as volume 6. However, we argued (to the Canadian Social Sciences and Humanities Research Council, who are funding this work) that at the present time an electronic index to the work would be of much greater utility to scholars who wish to know how Coleridge's thought emerged and developed over the 40 years which these notebooks cover.</p>
            <p>The overall goals for the project include:
<list type="unordered">
                  <item>creating an accurate electronic text of the entire notebook corpus;</item>
                  <item>creating an index and thesaurus for the notebooks which will be a start to a synthetic index to Coleridge's thought;</item>
                  <item>providing a web-based search and discovery system which will meet the needs of scholars, making his thought on a vast variety of topics more accessible.</item>
               </list>
            </p>
         </div0>
      </body>
      <back>
         <div type="Bibliography">
            <head>Bibliography</head>
            <listBibl>
               <biblStruct>
                  <monogr>
                     <author>
                        <name reg="Coleridge, S.T.">S.T. Coleridge</name>
                     </author>
                     <editor>
                        <name reg="Coburn, Kathryn">Kathryn Coburn</name>
                     </editor>
                     <title level="m">The Collected Works of Samuel Taylor Coleridge</title>
                     <title level="s">Bollingen Series 75</title>
                     <imprint>
                        <publisher>Princeton University Press</publisher>
                        <pubPlace>Princeton</pubPlace>
                        <date value="1969">1969</date>
                     </imprint>
                  </monogr>
               </biblStruct>
               <biblStruct>
                  <monogr>
                     <author>
                        <name reg="Hüllen, W.">W. Hüllen</name>
                     </author>
                     <title level="m">A history of Roget's thesaurus: origins, development, and design</title>
                     <imprint>
                        <publisher>Oxford University Press</publisher>
                        <pubPlace>Oxford</pubPlace>
                        <date value="2004">2004</date>
                     </imprint>
                  </monogr>
               </biblStruct>
               <biblStruct>
                  <monogr>
                     <author>
                        <name reg="Pepper, S.">S. Pepper</name>
                     </author>
                     <title level="m">The TAO of Topic Maps</title>
                     <imprint>
                        <date value="2001">2001</date>
                     </imprint>
                  </monogr>
                  <note>
                     <xptr crdate="2005-03-15" to="http://www.ontopia.net/topicmaps/materials/tao.html"/>
                  </note>
               </biblStruct>
               <biblStruct>
                  <analytic>
                     <author>
                        <name reg="Sebastiani, F.">F. Sebastiani</name>
                     </author>
                     <title level="a">Machine learning in automated text categorization</title>
                  </analytic>
                  <monogr>
                     <title level="j">ACM Computing Surveys</title>
                     <imprint>
                        <biblScope type="vol">34.1</biblScope>
                        <biblScope type="pages">1-47</biblScope>
                        <date value="2002">2002</date>
                     </imprint>
                  </monogr>
               </biblStruct>
               <biblStruct>
                  <monogr>
                     <author>
                        <name reg="Thompson, H.S.">H.S. Thompson</name>
                     </author>
                     <author>
                        <name reg="McKelvie, D.">D. McKelvie</name>
                     </author>
                     <title level="m" type="WWW document">Hyperlink semantics for standoff markup of read-only documents</title>
                     <imprint>
                        <publisher>Language Technology Group, HCRC, University of Edinburgh</publisher>
                        <date value="1997-05">1997</date>
                     </imprint>
                  </monogr>
                  <note>
                     <xptr crdate="2005-03-15" to="http://www.ltg.ed.ac.uk/~ht/sgmleu97.html"/>
                  </note>
               </biblStruct>
            </listBibl>
         </div>
      </back>
   </text>
</TEI.2>