<?xml version="1.0" encoding="UTF-8"?>
<TEI.2 id="panel_201_downie">
   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>A Revolutionary Approach to Humanities Computing?: Tools Development and the D2K Data-Mining Framework</title>
            <author>
               <name reg="Downie, J. Stephen">J. Stephen Downie</name>
            </author>
            <author>
               <name reg="Unsworth, John">John Unsworth</name>
            </author>
            <author>
               <name reg="Yu, Bei">Bei Yu</name>
            </author>
            <author>
               <name reg="Tcheng, David">David Tcheng</name>
            </author>
            <author>
               <name reg="Rockwell, Geoffrey">Geoffrey Rockwell</name>
            </author>
            <author>
               <name reg="Ramsay, Stephen J.">Stephen J. Ramsay</name>
            </author>
            <respStmt>
               <resp>Marked up by </resp>
               <name reg="Holmes, Martin">Martin Holmes</name>
               <lb/>
               <name reg="Baer, Patricia">Patricia Baer</name>
            </respStmt>
         </titleStmt>
         <publicationStmt>
            <p>Marked up to be included in the ACH/ALLC 2005 Conference Abstracts book.</p>
         </publicationStmt>
         <sourceDesc>
            <p>None</p>
         </sourceDesc>
      </fileDesc>
      <profileDesc>
         <textClass>
            <classCode>panel</classCode>
            <keywords>
               <list>
                  <item>text mining</item>
                  <item>data mining</item>
                  <item>tools development</item>
               </list>
            </keywords>
         </textClass>
      </profileDesc>
      <revisionDesc>
         <list>
            <item>MDH: Created from John Bradley's XML <date value="2005-03">March 2005</date>
            </item>
            <item>MDH: RS proofed and signed off without changes <date value="2005-05-18">18 May 2005</date>.</item>
         </list>
      </revisionDesc>
   </teiHeader>
   <text>
      <front>
         <docTitle n="A Revolutionary Approach to Humanities Computing?: Tools Development and the D2K Data-Mining Framework">
            <titlePart>A Revolutionary Approach to Humanities Computing?: Tools Development and the <title level="m">D2K</title> Data-Mining Framework</titlePart>
         </docTitle>
         <docAuthor>
            <name reg="Downie, J. Stephen">J. Stephen Downie</name>
            <address>
               <addrLine>jdownie@uiuc.edu</addrLine>
            </address>
         </docAuthor>
         <titlePart type="affil">University of Illinois at Urbana-Champaign</titlePart>
         <docAuthor>
            <name reg="Unsworth, John">John Unsworth</name>
            <address>
               <addrLine>unsworth@uiuc.edu</addrLine>
            </address>
         </docAuthor>
         <titlePart type="affil">University of Illinois at Urbana-Champaign</titlePart>
         <docAuthor>
            <name reg="Yu, Bei">Bei Yu</name>
            <address>
               <addrLine>beiyu@uiuc.edu</addrLine>
            </address>
         </docAuthor>
         <titlePart type="affil">University of Illinois at Urbana-Champaign</titlePart>
         <docAuthor>
            <name reg="Tcheng, David">David Tcheng</name>
            <address>
               <addrLine>dtcheng@ncsa.uiuc.edu</addrLine>
            </address>
         </docAuthor>
         <titlePart type="affil">University of Illinois at Urbana-Champaign</titlePart>
         <docAuthor>
            <name reg="Rockwell, Geoffrey">Geoffrey Rockwell</name>
            <address>
               <addrLine>georock@mcmaster.ca</addrLine>
            </address>
         </docAuthor>
         <titlePart type="affil">McMaster University</titlePart>
         <docAuthor>
            <name reg="Ramsay, Stephen J.">Stephen J. Ramsay</name>
            <address>
               <addrLine>sramsay@uga.edu</addrLine>
            </address>
         </docAuthor>
         <titlePart type="affil">University of Georgia</titlePart>
      </front>
      <body>
         <div0>
            <head> Introduction</head>
            <p>A new set of humanities computing (HC) research projects are underway
               that could revolutionize how the HC community works together to build,
               use, and share HC tools. The set of projects under consideration all
               play a role in the development work currently being done to extend the
               <title level="m">D2K</title> (Data-to-Knowledge)<note n="1">See <xptr to="http://alg.ncsa.uiuc.edu/do/tools/d2k"/>.</note> data-mining framework into the realm of
               HC. <hi rend="bold">John Unsworth</hi> and <hi rend="bold">Stephen J. Ramsay</hi> were recently
               awarded a significant Andrew W. Mellon Foundation grant<note n="2">See <xptr to="http://www.news.uiuc.edu/news/04/1025mellon.html"/>.</note> to develop
               a suite of HC data-mining tools using <title level="m">D2K</title> and its child framework, <title level="m">T2K</title>
               (Text-to-Knowledge). Drs. Unsworth  and Ramsay, along with
               research assistant, <hi rend="bold">Bei Yu</hi>,
               are working closely with <hi rend="bold">Geoffrey
                  Rockwell</hi>. Dr. Rockwell is the project leader for the <title level="m">CFI</title> (<title level="m">Canada
                     Foundation for Innovation</title>) funded project, <title level="m">TAPoR</title> (<title level="m">Text Analysis Portal
                        for Research</title>)<note n="3">
                  <xptr to="http://www.tapor.ca/"/>
               </note>, which is developing a text tool portal for
               researchers
               who work with electronic texts. <hi rend="bold">J.
                  Stephen Downie</hi> and <hi rend="bold">David Tcheng</hi>,
               through their work in creating the <title level="m">International Music Information
                  Retrieval Systems Evaluation Laboratory</title>
               (<title level="m">IMIRSEL</title>)<note n="4">See <xptr to="http://music-ir.org/evaluation"/>.</note>, are leading an international researchers group to
               develop another <title level="m">D2K</title> child system called <title level="m">M2K</title>
               (<gloss>Music-to-Knowledge</gloss>). This panel session demonstrates how all of these
               projects come together to form a comprehensive whole. The session has
               four major themes designed, through presentations and demonstrations,
               to highlight individual the project components being developed and
               their collective impact on the future of HC research. These themes are:</p>
            <list type="ordered">
               <item>
                  <title level="m">D2K</title> as the overarching framework </item>
               <item>
                  <title level="m">T2K</title> and its ties
                  to traditional text-based HC techniques
               </item>
               <item>
                  <title level="m">M2K</title> and its ties to multi-media-based HC
                  techniques
               </item>
               <item>The issues surrounding the HC community's development,
                  validation, distribution, and re-use of <title level="m">D2K</title>/<title level="m">T2K</title>/<title level="m">M2K</title> modules. </item>
            </list>
         </div0>
         <div0>
            <head>Participants</head>
            <p>​<hi rend="bold">J. Stephen Downie</hi>, Graduate
            School of Library and Information Science (GSLIS), University of
            Illinois at Urbana-Champaign (UIUC)<lb/>
               <hi rend="bold">John Unsworth</hi>, GSLIS, UIUC<lb/>
               <hi rend="bold">Bei Yu</hi>, GSLIS, UIUC<lb/>
               <hi rend="bold">David Tcheng</hi>, National
            Center for Supercomputng Applications (NCSA), UIUC<lb/>
               <hi rend="bold">Geoffrey Rockwell</hi>, School
            of the Arts, McMaster University<lb/>
               <hi rend="bold">Stephen J. Ramsay</hi>, 
            Department of English, University of Georgia<lb/>
            </p>
         </div0>
         <div0>
            <head>Presentations, Demonstrations, and Discussions (in order)</head>
            <p/>
         </div0>
         <div0 type="EmbeddedDoc">
            <div1>
               <head>Overview of the NORA  (No One
                  Remembers Acronyms) project</head>
               <p rend="Presenter">John Unsworth</p>
               <p>For decades, humanities computing
                  researchers have been developing software tools and statistical
                  techniques for text analysis, but those same researchers have not
                  succeeded in producing tools of interest to the majority of humanities
                  researchers, nor (with the exception of some very recent work in the
                  Canadian <title level="m">TAPoR</title> project) have they produced tools that work over the
                  web. Meanwhile, large collections of web-accessible structured texts in
                  the humanities have been created and collected by libraries over the
                  last fifteen years. During that same time period, with
                  improvements database and other information technologies, data-mining
                  has become a practical tool, albeit one mostly used in business
                  applications. We believe data-mining (or more specifically,
                  text-mining) techniques can be applied to digital library collections
                  to discover unanticipated patterns, for further exploration either
                  through traditional criticism or through web-based text analysis. 
                  Existing humanities e-text collections from Virginia, Michigan,
                  Indiana, North Carolina, and other research universities form the
                  corpus for the project. <title level="m">NORA</title> brings NCSA's <title level="m">D2K</title> data-mining architecture
                  to bear on the challenges of text-mining in digital libraries, with
                  special emphasis on leveraging markup, and on visualizations as
                  interface and as part of an iterative process of exploration.</p>
            </div1>
         </div0>
         <div0 type="EmbeddedDoc">
            <div1>
               <head>Introduction to the <title level="m">D2K</title> framework</head>
               <p rend="Presenter">David Tcheng </p>
               <p>Released in 1999, <title level="m">D2K</title> was developed by the Automated Learning Group
                  (ALG) at NCSA. <title level="m">D2K</title> has been used to solve many problems for both
                  industry (e.g., Sears, Caterpillar, etc.) and government agencies
                  (e.g., NSF, NASA, NIH, etc.). Academic uses include bioinformatics,
                  seismology, hydrology, and astronomy. <title level="m">D2K</title> uses a data flow
                  paradigm where a <soCalled>program</soCalled> is a network (directed graph) of processing
                  modules. Modules can be <soCalled>primitive</soCalled>, defined as a single piece of
                  source code that implements a single well defined task, or can be
                  <soCalled>nested</soCalled> meaning it is defined as a network of previously defined <title level="m">D2K</title>
                  modules. Decomposition of programs into modules that implement a
                  well defined input-output relationship promotes the creation of
                  reusable code. Nesting modules into higher-level modules helps to
                  manage complexity. <title level="m">D2K</title> parallelizes across any number different
                  computers by simply running a copy of  "D2K Server" on each
                  available machine. The <title level="m">D2K</title> software distribution comes as a basic <title level="m">D2K</title>
                  package, with core modules capable of doing general purpose
                  data-mining, as well as such task-specific add-on packages as text
                  analysis (T2K), image analysis (I2K), and now music analysis (M2K).
               </p>
            </div1>
         </div0>
         <div0 type="EmbeddedDoc">
            <div1>
               <head>Introduction to <title level="m">T2K</title>
               </head>
               <p rend="Presenter">Bei Yu</p>
               <p>Similar
               to many data-mining tools, <title level="m">T2K</title> has implemented a number of automatic
               classification and clustering algorithms. Compared to the commercial
               text mining tools, for example SAS Text Miner, <title level="m">T2K</title> has richer NLP
               preprocessing tools, especially after its integration with GATE. Tools
               include: stemmer, tokenizer, PoS-tagger, data cleaning and named-entity
               extraction tools. The clustering visualization is tailored for thematic
               analysis. On one hand, <title level="m">T2K</title> provides a text mining platform for the HC
               community. On the other hand, <title level="m">T2K</title> is also a platform to automate the HC
               research results and thus facilitate their applications to the text
               mining community in general. For example, most of the text mining tasks
               are still topicality oriented, but the affect analysis has emerged in
               the last couple of years. The affect of a document includes the
               subjectivity/objectivity, the positive/neutral/negative attitude, and
               the strength of emotions, etc. Some researchers have adapted stylistic
               analysis techniques from HC to analyze customer reviews. The found
               non-thematic features can also be used as predictors for document
               genre, readability, clarity and many other document properties. </p>
            </div1>
         </div0>
         <div0 type="EmbeddedDoc">
            <div1>
               <head>The TAPoR portal and <title level="m">D2K</title>
               </head>
               <p rend="Presenter">Geoffrey Rockwell</p>
               <p>
                  <title level="m">TAPoR</title>
                  has released an alpha of the portal and will have the beta ready by
                  June 2005. The portal is designed to allow researchers to run tools
                  (which can be local or remote web services) on texts (which can be
                  local or remote.) The <title level="m">TAPoR</title> portal has been designed to work with other
                  systems like <title level="m">D2K</title> in three ways:</p>
               <list type="ordered">
                  <item>Particular tools or chains of tools can be <soCalled>published</soCalled> so that
                     they are available as post-process tool right in the interface of
                     another system. Thus one can have a button that appears on the
                     appropriate results screens of a <title level="m">D2K</title> process that allows the user to
                     pass results to <title level="m">TAPoR</title> tools.</item>
                  <item>The portal has been released as open source and we are working on
                     models for projects to run customized versions of the portal that work
                     within their environment.</item>
                  <item>The portal can initiate queries to remote systems and then pass
                     results to other <title level="m">TAPoR</title> tools. Thus users can see tools like <title level="m">D2K</title> (where
                     they have permission) within their portal account. </item>
               </list>
            </div1>
         </div0>
         <div0 type="EmbeddedDoc">
            <div1>
               <head>The Tamarind project and <title level="m">D2K</title>
               </head>
               <p rend="Presenter">Stephen J. Ramsay</p>
               <p>Tamarind began with the observation that the most basic text analysis
                  procedure of all — search — does not typically operate on the text
                  archive itself. It operates, rather, on a specially designed data
                  structure (typically an inverted file or pat trie index) that contains
                  string locations and byte offsets. Tamarind's primary goal is
                  to facilitate access to analytical data gleaned from large-scale full text
                  archives. Our working prototype of Tamarind, for example, can quickly
                  generate a relational database of graph properties in a text which can
                  in turn be mined for structural information about the texts in
                  question. Tamarind creates a generalized database schema for
                  holding text properties and allows you to specify this structure as one
                  that should be isolated and loaded into the database. Work is
                  proceeding on a module that will allow the user to load a Tamarind
                  database with millions of word frequency data points drawn from several
                  gigabytes of encoded data. Unlike existing tools, this newest
                  module includes information about where those counts occur within the
                  tag structure of the document (something that is impossible to do
                  without the raw XML). For the purposes of this project, we intend to
                  use <title level="m">D2K</title> and <title level="m">T2K</title> as the primary clients for Tamarind data stores. </p>
            </div1>
         </div0>
         <div0 type="EmbeddedDoc">
            <div1>
               <head>The <title level="m">M2K</title> project</head>
               <p rend="Presenter">J. Stephen Downie</p>
               <p>M2K is being
                  developed to provide the Music Information Retrieval (MIR) community
                  with a mechanism to access a secure store of copyright-sensitive music
                  materials in symbolic, audio and graphic formats. <title level="m">M2K</title> is a set of
                  open-source, music-specific, <title level="m">D2K</title> modules jointly developed by members
                  of the <title level="m">IMIRSEL</title> project and the wider MIR community. <title level="m">M2K</title> modules include
                  such classic signal processing functions as Fast Fourier Transforms,
                  Spectral Flux, etc. In combination with <title level="m">D2K</title>'s built-in classification
                  functions (e.g., Bayesian Networks, Decision Trees, etc.), the <title level="m">M2K</title>
                  modules allow MIR researchers to quickly construct and evaluate
                  prototype MIR systems that perform such sophisticated tasks as genre
                  recoginition, artist identification, audio transcription, score
                  analysis, and similarity clustering. </p>
            </div1>
         </div0>
         <div0>
            <p>​<hi rend="bold">John Unsworth</hi> and <hi rend="bold">J. Stephen Downie</hi>
                  will lead a wrap-up and future work open-forum discussion: For
                  ambitious, multi-institutional
                  projects like those presented in this panel many issues arise that can
                  affect the sustainability and impact of the projects. In particular,
                  the issues surrounding the HC community's development, validation,
                  distribution, and re-use of <title level="m">D2K</title>/<title level="m">T2K</title>/<title level="m">M2K</title> modules will be
                  addressed.</p>
         </div0>
      </body>
   </text>
</TEI.2>