<?xml version="1.0" encoding="UTF-8"?>
<TEI.2 id="paper_91_westman">
   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>Approaches to Searching for Language and Diversity in a 'Whitebread City' Digital Corpus: The Charlotte Conversation and Narrative Collection</title>
            <author>
               <name reg="Westman, Stephen">Stephen Westman</name>
            </author>
            <author>
               <name reg="Davis, Boyd">Boyd Davis</name>
            </author>
            <respStmt>
               <resp>Marked up by </resp>
               <name reg="Holmes, Martin">Martin Holmes</name>
               <lb/>
               <name reg="Baer, Patricia">Patricia Baer</name>
            </respStmt>
         </titleStmt>
         <publicationStmt>
            <p>Marked up to be included in the ACH/ALLC 2005 Conference Abstracts book.</p>
         </publicationStmt>
         <sourceDesc>
            <p>None</p>
         </sourceDesc>
      </fileDesc>
      <profileDesc>
         <textClass>
            <classCode>paper</classCode>
            <keywords>
               <list>
                  <item>automated textual analysis</item>
                  <item>open source tools</item>
                  <item>narrative corpora</item>
               </list>
            </keywords>
         </textClass>
      </profileDesc>
      <revisionDesc>
         <list>
            <item>MDH: Created from John Bradley's XML <date value="2005-04">April 2005</date>
            </item>
            <item>MDH: Marked up <date value="2005-04-11">11 April 2005</date>
            </item>
            <item>MDH: Entered proofing corrections from PGL <date value="2005-05-25">25 May 2005</date>
            </item>
         </list>
      </revisionDesc>
   </teiHeader>
   <text>
      <front>
         <docTitle n="Approaches to Searching for Language and Diversity in a 'Whitebread City' Digital Corpus: The Charlotte Conversation and Narrative Collection">
            <titlePart>Approaches to Searching for Language and Diversity in a 'Whitebread City' Digital Corpus: The Charlotte Conversation and Narrative Collection</titlePart>
         </docTitle>
         <docAuthor>
            <name reg="Westman, Stephen">Stephen Westman</name>
            <address>
               <addrLine>srwestma@email.uncc.edu</addrLine>
            </address>
         </docAuthor>
         <titlePart type="affil">University of North Carolina at Charlotte</titlePart>
         <docAuthor>
            <name reg="Davis, Boyd">Boyd Davis</name>
            <address>
               <addrLine>bdavis@email.uncc.edu</addrLine>
            </address>
         </docAuthor>
         <titlePart type="affil">University of North Carolina at Charlotte</titlePart>
      </front>
      <body>
         <div0>
            <p>Macaulay comments that <cit>
                  <q>Dialects, like languages, have both a unifying and a separatist function. We speak the way we do to be like those we wish to associate with and to distinguish ourselves from others. When that association is based on where we live. . . a distinctive form of speech is likely to survive. However, we need to look at the whole configuration of linguistic features and not a few features which may or may not be the critical ones for the speakers.</q>
                  <bibl>239</bibl>
               </cit> Why? In addition to the <cit>
                  <q>grammatical, phonetic, and lexical</q>
               </cit> features traditionally posited as characterizing a dialect, Macaulay adds <cit>
                  <q>prosodic features and possibly also voice quality and discourse characteristics. There is no reason to believe that dialects have fewer features than other forms of language, and we do not know in advance which features will be important to distinguish the dialect.</q>
                  <bibl>229</bibl>
               </cit>
            </p>
            <p>In a discussion of features of southern style that warrant further investigation, Barbara Johnstone cites several which can be searched at word- and text-level: these lexicogrammatical features contribute to the reader/hearer's assessment of style, and include rhetorical genres triggered by particular discourse markers; style shifts into regional colloquialism, stylization and self-parody signaled by shifts into nonstandard verbs, for example, or a judicious sprinkling of double modals to suggest temporary intimacy. She asks for an investigation of regional styles of interacting that <cit>
                  <q>makes strategic use of nostalgia for neighborhood, local community, or region.</q>
                  <bibl>206</bibl>
               </cit> Well, they said southernly, we have a gracious plenty of data that accommodates such investigation; the issue is, of course, how to access, identify, and retrieve it.</p>
            <p>The corpus that we are using to investigate these questions is <title level="m">Project MORE</title>'s expanded <title level="m">Charlotte Conversation and Narrative Collection</title> (<title level="m">CNCC</title>), which is part of the 11.5 million-words in the First Release of the <title level="m">American National Corpus</title> (<title level="m">ANC</title>). Considered a satellite corpus to the core of the <title level="m">ANC</title>, which parallels the organization of the <title level="m">British National Corpus</title> (Reppen et al.), the <title level="m">CNCC</title> goal is far more modest, but still one of difficulty: to create a corpus of conversation and conversational narration in a New South city at the beginning of the 21st century. And that, of course, brings us smack up against issues of region (Macaulay) and representativeness (Douglas), of dialect diversity (Wolfram &amp; Dannenberg), and distinctions between rural and metropolitan features (Tillery, Bailey &amp; Wikle).</p>
            <p>The <title level="m">CNCC</title> is hybrid in some ways; similar to the <title level="m">ONZE</title> corpus in its evolution through multiple formats and purposes (Gordon et al.). In addition to being a part of the <title level="m">ANC</title>, it is also included in the <title level="m">New South Voices</title> (<title level="m">NSV</title>) digital resource housed at the University of North Carolina at Charlotte Library. <title level="m">NSV</title> includes interviews that cover a wide range of historical subjects, from African American churches and Billy Graham crusades to women's basketball and World War II. Other interviews, narratives and conversations document the experiences and language of recent immigrants to the area. As such, it seeks to address a wide variety of audiences from local historians and historic preservationists to public school students. By using <title level="m">NSV</title>, we are able to expand the number and range of interviews available for linguistic as well as for historical analysis.</p>
            <p>If the corpus is to be inclusive of the range of spoken styles that conglomerate in the elastic borders of a New South city, it must begin by identifying what they are. In today's Charlotte, today's North Carolina, this is no longer simple. As Tillery, Bailey &amp; Wykle note, metropolitanization, foreign and domestic migration, and expanding ethnic diversity have <cit>
                  <q>eliminated many of the vestiges of traditional regional culture and . . . are radically reshaping the United States</q>
                  <bibl>228</bibl>
               </cit>. Their painstaking study of what they see as the impact of demographic change on American speech is keyed to 22 socio-demographic and linguistic variables: 14 phonological features, 3 lexical, and 5 that are lexicogrammatical. They see a balkanization (241; cf 244) with increased divergence of rural and urban ways of speaking; they ask will <cit>
                  <q>old towns with new populations</q>
               </cit> — such as Charlotte — create new communities and new ways of speaking?</p>
            <p>Investigation of these phenomena within a database environment requires a variety of tools and approaches if we are to extract the information contained in these transcripts. The reason for this is due to the nature of the types of information we need to obtain from these interviews and then to the question of how we can best obtain that information. </p>
            <p>On the one hand, we need to be able to perform textual analysis on the interviews and to examine subjects' speech patterns and linguistic characteristics. This in turn requires that we be able to extract information that is embedded within discursive text — looking at how they use language <soCalled>
                  <hi rend="foreign">in situ</hi>
               </soCalled>. On the other hand, there are discrete pieces of information — metadata if you will — about the participants (place of origin, current residence, gender, ethnicity, etc.) to which we need access if we are to do anything meaningful with what we discover from our textual analysis. This dichotomy pertains to any area doing textual analysis. As noted by Ronald Bourrett, the roots of this dichotomy lie in the two types of information with which we are dealing: document-centric and data-centric.</p>
            <p>Due to the different nature of the two types of information — linguistic analysis and descriptive metadata — we have found that a single approach does not allow us to fully explore the types of correlations we were seeking. While our XML database allows us to find useful things in searching tagged information within the interviews, it does not provide sufficient flexibility in searching data-centric information. On the other hand, with relational database technology we have exactly the opposite situation.</p>
            <p>In addition, during the course of our investigation, we discovered that there were certain types of textual information — such as word- and phrase-frequency; retrieving and isolating particular words and phrases within their context in a document; and looking for particular words and/or phrases within certain proximity of each other — that were amenable neither to an XML database, nor a classic relational database, approach. To address this need, we decided that a third option — inverted indexes — was needed to allow us to look for such patterns. As noted in Zaïane, this technique greatly enhances the ability to search textual-based information.</p>
            <p>Therefore, in designing our database system, we decided to adopt a mixed approach, one that allowed us to utilize the strengths of each system without running afoul of its limitations. In doing so, we use both XML and inverted indexing to do textual analysis and then a relational database to correlate that information with relevant demographic criteria. The result is a hybrid that allows us to do more than any single approach could provide.</p>
            <p>This paper will first present how we are using readily available tools to implement a searching system that supports demographic correlation with textual features (including some features of proximity search and frequency of occurrence). These tools, all of which are part of the Open Source tools, allow us to build and configure with ease a system that not long ago would require extensive and non-trivial programming. They include:</p>
            <list type="unordered">
               <item>
                  <title level="m">eXist</title> (XML) and <title level="m">MySQL</title> (relational) database managers</item>
               <item>php, perl and Java programming languages</item>
               <item>
                  <title level="m">Apache</title> Web server</item>
            </list>
            <p>As a way of concluding, we will then earnestly solicit assistance on how we can best make this collection of roughly 1,000 transcribed oral interviews, conversations and narratives more useful to any researcher, particularly in the area of text-based, online searching. </p>
         </div0>
      </body>
      <back>
         <div type="Bibliography">
            <head>Bibliography</head>
            <listBibl>
               <biblStruct>
                  <analytic>
                     <author>
                        <name reg="Gordon, Elizabeth">Elizabeth Gordon</name>
                     </author>
                     <author>
                        <name reg="Maclagan, Margaret">Margaret Maclagan</name>
                     </author>
                     <author>
                        <name reg="Hay, Jennifer">Jennifer Hay</name>
                     </author>
                     <title level="a">The ONZE corpus. Manuscript</title>
                  </analytic>
                  <monogr>
                     <editor>
                        <name reg="Beal, J.C.">J.C. Beal</name>
                     </editor>
                     <editor>
                        <name reg="Corrigan, K.P.">K.P. Corrigan</name>
                     </editor>
                     <editor>
                        <name reg="Moisl, H.">H. Moisl</name>
                     </editor>
                     <title level="m">Models and Methods in the Handling of Unconventional Digital Corpora. Volume 	2: Diachronic Corpora</title>
                     <imprint>
                        <publisher>Palgrave</publisher>
                        <pubPlace>Houndsmills</pubPlace>
                        <date>Forthcoming</date>
                     </imprint>
                  </monogr>
               </biblStruct>
               <biblStruct>
                  <analytic>
                     <author>
                        <name reg="Douglas, Fiona">Fiona Douglas</name>
                     </author>
                     <title level="a">The Scottish Corpus of Texts and Speech: problems of corpus design</title>
                  </analytic>
                  <monogr>
                     <title level="j">Literary and Linguistic Computing</title>
                     <imprint>
                        <biblScope type="vol">18</biblScope>
                        <biblScope type="pages">23-37</biblScope>
                        <date value="2003">2003</date>
                     </imprint>
                  </monogr>
               </biblStruct>
               <biblStruct>
                  <analytic>
                     <author>
                        <name reg="Hudson-Ettle, Diana">Diana Hudson-Ettle</name>
                     </author>
                     <title level="a">Nominal that clauses in three regional varieties of English: a study of the relevance of text type medium, and syntactic function</title>
                  </analytic>
                  <monogr>
                     <title level="j">Journal of English Linguistics</title>
                     <imprint>
                        <biblScope type="vol">30</biblScope>
                        <biblScope type="pages">258-273</biblScope>
                        <date value="2002">2002</date>
                     </imprint>
                  </monogr>
               </biblStruct>
               <biblStruct>
                  <analytic>
                     <author>
                        <name reg="Johnstone, Barbara">Barbara Johnstone</name>
                     </author>
                     <title level="a">Features and uses of southern style</title>
                  </analytic>
                  <monogr>
                     <editor>
                        <name reg="Nagle, S.">S. Nagle</name>
                     </editor>
                     <editor>
                        <name reg="Sanders, S.">S. Sanders</name>
                     </editor>
                     <title level="m">English in the Southern United States</title>
                     <imprint>
                        <publisher>Cambridge University Press</publisher>
                        <pubPlace>Cambridge</pubPlace>
                        <date value="2003">2003</date>
                        <biblScope type="pages">189-207</biblScope>
                     </imprint>
                  </monogr>
               </biblStruct>
               <biblStruct>
                  <analytic>
                     <author>
                        <name reg="Kjellmer, Goeran">Goeran Kjellmer</name>
                     </author>
                     <title level="a">A modal shock absorber, empathizer/emphasizer and qualifier</title>
                  </analytic>
                  <monogr>
                     <title level="j">International Journal of Corpus Linguistics</title>
                     <imprint>
                        <biblScope type="vol">8</biblScope>
                        <biblScope type="pages">145-168</biblScope>
                        <date value="2003">2003</date>
                     </imprint>
                  </monogr>
               </biblStruct>
               <biblStruct>
                  <analytic>
                     <author>
                        <name reg="Macaulay, Ronald">Ronald Macaulay</name>
                     </author>
                     <title level="a">I'm off to Philadelphia in the morning</title>
                  </analytic>
                  <monogr>
                     <title level="j">American Speech</title>
                     <imprint>
                        <biblScope type="vol">77</biblScope>
                        <biblScope type="pages">227-241</biblScope>
                        <date value="2002">2002</date>
                     </imprint>
                  </monogr>
               </biblStruct>
               <biblStruct>
                  <analytic>
                     <author>
                        <name reg="Reppen, Randi">Randi Reppen</name>
                     </author>
                     <author>
                        <name reg="Ide, Nancy">Nancy Ide</name>
                     </author>
                     <title level="a">The American National Corpus: Overall goals and the first release</title>
                  </analytic>
                  <monogr>
                     <title level="j">Journal of English Linguistics</title>
                     <imprint>
                        <biblScope type="vol">32</biblScope>
                        <biblScope type="pages">105-113</biblScope>
                        <date value="2004">2004</date>
                     </imprint>
                  </monogr>
               </biblStruct>
               <biblStruct>
                  <analytic>
                     <author>
                        <name reg="Tillery, Jan">Jan Tillery</name>
                     </author>
                     <author>
                        <name reg="Bailey, Guy">Guy Bailey</name>
                     </author>
                     <author>
                        <name reg="Wikle, Tom">Tom Wikle</name>
                     </author>
                     <title level="a">Demographic change and American dialectology in the twenty-first century</title>
                  </analytic>
                  <monogr>
                     <title level="j">American Speech</title>
                     <imprint>
                        <biblScope type="vol">79</biblScope>
                        <biblScope type="pages">227-249</biblScope>
                        <date value="2004">2004</date>
                     </imprint>
                  </monogr>
               </biblStruct>
               <biblStruct>
                  <analytic>
                     <author>
                        <name reg="Wolfram, Walt">Walt Wolfram</name>
                     </author>
                     <author>
                        <name reg="Dannenberg, Clare">Clare Dannenberg</name>
                     </author>
                     <title level="a">Dialect identity in a tri-ethnic context: The case of Lumbee American Indian English</title>
                  </analytic>
                  <monogr>
                     <title level="j">English World-Wide</title>
                     <imprint>
                        <biblScope type="vol">20</biblScope>
                        <biblScope type="pages">179-216</biblScope>
                        <date value="1999">1999</date>
                     </imprint>
                  </monogr>
               </biblStruct>
               <biblStruct>
                  <monogr>
                     <author>
                        <name reg="Zaïane, Osmar">Osmar Zaïane</name>
                     </author>
                     <title level="m">Inverted Index for Information Retrieval (Slides keyed to Chapter 22 of unlisted textbook: CMPUT 391: Database Management Systems)</title>
                     <imprint>
                        <publisher>University of Alberta</publisher>
                        <date value="2001">2001</date>
                     </imprint>
                  </monogr>
                  <note>
                     <xptr crdate="2005-04-11"
                           to="http://www.cs.ualberta.ca/~zaiane/courses/cmput391-02/slides/Lect7/"/>
                  </note>
               </biblStruct>
            </listBibl>
         </div>
      </back>
   </text>
</TEI.2>