<?xml version="1.0" encoding="UTF-8"?>
<TEI.2 id="paper_165_cayless">
   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>DocScapes: Visualizing Document Structures with SVG</title>
            <author>
               <name reg="Cayless, Hugh">Hugh Cayless</name>
            </author>
            <respStmt>
               <resp>Marked up by </resp>
               <name reg="Holmes, Martin">Martin Holmes</name>
               <lb/>
               <name reg="Baer, Patricia">Patricia Baer</name>
            </respStmt>
         </titleStmt>
         <publicationStmt>
            <p>Marked up to be included in the ACH/ALLC 2005 Conference Abstracts book.</p>
         </publicationStmt>
         <sourceDesc>
            <p>None</p>
         </sourceDesc>
      </fileDesc>
      <profileDesc>
         <textClass>
            <classCode>paper</classCode>
            <keywords>
               <list>
                  <item>visualization</item>
                  <item>TEI</item>
                  <item>SVG</item>
               </list>
            </keywords>
         </textClass>
      </profileDesc>
      <revisionDesc>
         <list>
            <item>MDH: Created from John Bradley's XML <date value="2005-03-09">09 March 2005</date>
            </item>
            <item>MDH: Marked up <date value="2005-03-10">10 March 2005</date>
            </item>
         </list>
      </revisionDesc>
   </teiHeader>
   <text>
      <front>
         <docTitle n="DocScapes: Visualizing Document Structures with SVG">
            <titlePart>
               <title level="m">DocScapes</title>: Visualizing Document Structures with SVG</titlePart>
         </docTitle>
         <docAuthor>
            <name reg="Cayless, Hugh">Hugh Cayless</name>
            <address>
               <addrLine>hcayless@lulu.com</addrLine>
            </address>
         </docAuthor>
         <titlePart type="affil">Lulu (<xptr to="http://lulu.com"/>)</titlePart>
      </front>
      <body>
         <div0>
            <p>The task of searching for and browsing documents online can be a frustrating one.  Documents in search results are typically treated as atomic units rather than structured collections of information.  This paper proposes some ideas for enhancing search and browsing by producing graphical <soCalled>document-scapes</soCalled> that summarize document characteristics and provide links into the content of documents. The advantage of this type of summary is that it can compensate for some of the visual cues (available when browsing bookshelves) that are lost in the digital environment. It is possible to visually summarize document size, structure, density, and the presence of metadata in such a way that users will be able to tell, at a glance, the difference between (for example) an interview and a monograph, or a play and a catalog. The work in this paper focuses on a particular vocabulary of document markup, TEI, and a particular collection, <title level="m">Documenting the American South</title> at the University of North Carolina at Chapel Hill (<xptr to="http://docsouth.unc.edu"/>.</p>
            <p>A great deal of work has been done on the visualization of collections and search results (see <xptr to="http://www.cs.umd.edu/hcil/research/visualization.shtml"/> for a summary of online material). There is, however, a remarkable paucity of scholarship focusing on the visualization of documents themselves. No doubt this has to do with the difficulties of dealing with heterogeneous collections. Comparing the varying structures of text, XML, and PDF documents, for example, might not be an especially useful exercise. The technique discussed in this paper can easily be applied to relatively homogeneous collections of XML documents, however, and could in theory be generalized to other document types.</p>
            <p>The techniques used in this project are relatively simple. Essentially, what is involved is the transformation of XML from one vocabulary to another; in this case TEI to SVG. Scalable Vector Graphics is an XML application that allows for the representation of vector graphics in an XML format. This means that the structure of a document in, for example, TEI, can be turned into an image via the same processes used to display the document in HTML or to covert it to PDF for printing. Since other document formats can be parsed to generate SAX (Simple API for XML) events, they too could be fed into an XML processing pipeline and turned into <title level="m">DocScape</title> images.</p>
            <p>There are a number of variables which may be used to distinguish documents marked up in TEI without recourse to semantic distinctions like subject vocabularies. Since TEI documents are subdivided by division (<hi rend="code">&lt;div&gt;</hi>, <hi rend="code">&lt;divN&gt;</hi>, <hi rend="code">&lt;front&gt;</hi>, <hi rend="code">&lt;back&gt;</hi>, etc.), each document has its own internal structure. Different types of document may have very different internal structures. For example, a dictionary will consist of a set of entries (<hi rend="code">&lt;entry&gt;</hi> tags) inside its divisions while a monograph will contain chapters, sections, and paragraphs (<hi rend="code">&lt;p&gt;</hi>). The relative size and structure of nested divisions can be represented graphically in a fairly compact space. Differing types of content, on the other hand, can be represented using color.</p>
            <p>TEI documents also differ in size (obviously) and this can be an important metric. Size can be represented visually in a number of ways.  <title level="m">DocSouth</title>'s collection varies widely in terms of absolute size, from short pamphlets to large books and government documents (up to 800 pages in length). The representation of relative size must therefore be considered quite carefully. The first iteration of <title level="m">DocScapes</title> did this using border thickness. A pixel was added to the border width for each 100 pages. This sort of scale does not help in handling the important distinction between the moderately sized (10-50 page) document, and the very short (1-2 pages), a distinction which encompasses important differences of genre. The next generation of <title level="m">DocScapes</title> will use more complex SVG capabilities, such as drop shadows to indicate relative size.</p>
            <p>Another important metric is the relative size and complexity of the TEI Header metadata. <title level="m">DocSouth</title>, whose documents are largely derived from catalogued library holdings, has very detailed and thorough header information. By contrast, a TEI document that was <soCalled>born digital</soCalled> might have fairly minimal metadata. A visual distinction of different levels of metadata density will be useful for collection managers and searchers alike.</p>
            <p>A <title level="m">DocScape</title> image is composed of the elements outlined above: the document itself, any header metadata and structural container elements (e.g. <hi rend="code">&lt;div&gt;</hi>s in TEI, <hi rend="code">&lt;section&gt;</hi>s in <title level="m">DocBook</title>, etc). The four TEI Header sections are represented by blocks of color at the top of the image. The nested divisions are visualized as nested blocks, moving first left-to-right then top-to-bottom, and so on. The nested blocks start from different ends of the light/dark scale, so top-level containers are light green, then their children are dark green, etc. In addition, the image attempts to quantify the number of paragraphs per page or section using color saturation. The relative size of the document is indicated by the border thickness of the entire image (see figures 1 and 2).</p>
            <figure rend="ImageLink">
               <head>Figure 1</head>
               <p>
                  <xref>paper_165_cayless_1.jpg</xref>
               </p>
               <figDesc>Figure 1</figDesc>
            </figure>
            <figure rend="ImageLink">
               <head>Figure 2</head>
               <p>
                  <xref>paper_165_cayless_2.jpg</xref>
               </p>
               <figDesc>Figure 2</figDesc>
            </figure>
            <p>Figure 1 provides a nice example of a document with a very heterogeneous internal structure. The first section is a catalog, with many nested TEI <hi rend="code">&lt;div&gt;</hi>s, while the following divisions are more narrative in nature. Figure 2, on the other hand, represents an interview. The more densely packed paragraph structure in this document is represented by the lighter shade of green in the nested sections.   </p>
            <p>In addition to these basic elements, it is possible to use the capabilities of SVG to group many documents on a single page and dynamically zoom into the ones that are of interest. The document sections may also be linked to the documents themselves, so that it is possible to drill into the texts from their visual representations. Finally, it is possible to layer other information, such as the occurrence of search terms onto the documents.  Figure 3 is an example of a <title level="m">DocScape</title> with personal names, locations, and dates plotted on the image surface.  My paper will outline the techniques and principles involved in developing DocScape visualizations and will discuss ways in which they may be used in digital libraries as a means to browse textual content.</p>
            <figure rend="ImageLink">
               <head>Figure 3</head>
               <p>
                  <xref>paper_165_cayless_3.jpg</xref>
               </p>
               <figDesc>Figure 3</figDesc>
            </figure>
         </div0>
      </body>
      <back>
         <div type="Bibliography">
            <head>Bibliography</head>
            <listBibl>
               <biblStruct>
                  <analytic>
                     <author>
                        <name reg="Börner, K.">K. Börner</name>
                     </author>
                     <title level="a">Extracting and Visualizing Semantic Structures in Retrieval Results for Browsing</title>
                  </analytic>
                  <monogr>
                     <title level="m">Proceedings of the fifth ACM Conference on Digital Libraries</title>
                     <imprint>
                        <date value="2000">2000</date>
                        <biblScope type="pages">234-235</biblScope>
                     </imprint>
                  </monogr>
               </biblStruct>
               <biblStruct>
                  <monogr>
                     <author>
                        <name reg="Campeseto, O.">O. Campeseto</name>
                     </author>
                     <title level="m">Fundamentals of SVG Programming: Concepts to Source Code</title>
                     <imprint>
                        <publisher>Charles River Media, Inc.</publisher>
                        <pubPlace>Hingham, MA</pubPlace>
                        <date value="2003">2003</date>
                     </imprint>
                  </monogr>
               </biblStruct>
               <biblStruct>
                  <monogr>
                     <author>
                        <name reg="Clark, James">James Clark</name>
                     </author>
                     <author>
                        <name reg="DeRose, Steve">Steve DeRose</name>
                     </author>
                     <title level="m" type="WWW document">XML Path Language (XPath), Version 1.0 (W3C Recommendation)</title>
                     <imprint>
                        <publisher>W3C</publisher>
                        <date value="1999">1999</date>
                     </imprint>
                  </monogr>
                  <note>
                     <xptr crdate="2005-03-15" to="http://www.w3.org/TR/1999/REC-xpath-19991116"/>
                  </note>
               </biblStruct>
               <biblStruct>
                  <monogr>
                     <author>
                        <name reg="Clark, James">James Clark</name>
                     </author>
                     <title level="m" type="WWW document">Transformations (XSLT), Version 1.0 (W3C Recommendation)</title>
                     <imprint>
                        <publisher>W3C</publisher>
                        <date value="1999">1999</date>
                     </imprint>
                  </monogr>
                  <note>
                     <xptr crdate="2005-03-15" to="http://www.w3.org/TR/1999/REC-xslt-19991116"/>
                  </note>
               </biblStruct>
               <biblStruct>
                  <monogr>
                     <author>
                        <name reg="Clark, James">James Clark</name>
                     </author>
                     <author>
                        <name reg="Fujisawa, Jun">Jun Fujisawa</name>
                     </author>
                     <author>
                        <name reg="Jackson, Dean">Dean Jackson</name>
                     </author>
                     <title level="m" type="WWW document">Scalable Vector Graphics (SVG) 1.1 Specification (W3C Recommendation)</title>
                     <imprint>
                        <publisher>W3C</publisher>
                        <date value="2003">2003</date>
                     </imprint>
                  </monogr>
                  <note>
                     <xptr crdate="2005-03-15" to="http://www.w3.org/TR/2003/REC-SVG11-20030114"/>
                  </note>
               </biblStruct>
               <biblStruct>
                  <monogr>
                     <title level="m">
                        <name reg="Documenting the American South">Documenting the American South</name>
                     </title>
                     <imprint>
                        <publisher>University of North Carolina at Chapel Hill</publisher>
                     </imprint>
                  </monogr>
                  <note>
                     <xptr crdate="2005-03-15" to="http://docsouth.unc.edu"/>
                  </note>
               </biblStruct>
               <biblStruct>
                  <monogr>
                     <title level="m">
                        <name reg="Visualization">Visualization</name>
                     </title>
                     <imprint>
                        <publisher>Human-Computer Interaction Lab / University of Maryland</publisher>
                     </imprint>
                  </monogr>
                  <note>
                     <xptr crdate="2005-03-15"
                           to="http://www.cs.umd.edu/hcil/research/visualization.shtml"/>
                  </note>
               </biblStruct>
               <biblStruct>
                  <analytic>
                     <author>
                        <name reg="Hornbæk, K.">K. Hornbæk</name>
                     </author>
                     <author>
                        <name reg="Frøkjær, Erik">Erik Frøkjær</name>
                     </author>
                     <title level="a">Reading Patterns and Usability in Visualizations of Electronic Documents</title>
                  </analytic>
                  <monogr>
                     <title level="j">ACM Transactions on Computer-Human Interaction (TOCHI)</title>
                     <imprint>
                        <biblScope type="vol">10.2</biblScope>
                        <biblScope type="pages">119-149</biblScope>
                        <date value="2003">2003</date>
                     </imprint>
                  </monogr>
               </biblStruct>
               <biblStruct>
                  <monogr>
                     <editor>
                        <name reg="Sperberg-McQueen, C.M.">C.M. Sperberg-McQueen</name>
                     </editor>
                     <editor>
                        <name reg="Burnard, L.">L. Burnard</name>
                     </editor>
                     <title level="m">TEI P4: Guidelines for Electronic Text Encoding and Interchange</title>
                     <imprint>
                        <publisher>Text Encoding Initiative Consortium</publisher>
                        <date value="2002">2002</date>
                     </imprint>
                  </monogr>
               </biblStruct>
               <biblStruct>
                  <monogr>
                     <author>
                        <name reg="Venn, B.">B. Venn</name>
                     </author>
                     <title level="m" type="WWW document">Add Interactivity to Your SVG</title>
                     <imprint>
                        <publisher>IBM developerWorks</publisher>
                        <date value="2003-12-11">11 December 2003</date>
                     </imprint>
                  </monogr>
                  <note>
                     <xptr crdate="2005-03-15"
                           to="http://www-106.ibm.com/developerworks/web/library/x-svgint/"/>
                  </note>
               </biblStruct>
            </listBibl>
         </div>
      </back>
   </text>
</TEI.2>