<?xml version="1.0" encoding="UTF-8"?>
<TEI.2 id="paper_87_salmon-alt">
   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>Modeling Diachrony in Dictionaries</title>
            <author>
               <name reg="Salmon-Alt, Susanne">Susanne Salmon-Alt</name>
            </author>
            <author>
               <name reg="Romary, Laurent">Laurent Romary</name>
            </author>
            <author>
               <name reg="Eva, Buchi">Buchi Eva</name>
            </author>
            <respStmt>
               <resp>Marked up by </resp>
               <name reg="Holmes, Martin">Martin Holmes</name>
               <lb/>
               <name reg="Baer, Patricia">Patricia Baer</name>
            </respStmt>
         </titleStmt>
         <publicationStmt>
            <p>Marked up to be included in the ACH/ALLC 2005 Conference Abstracts book.</p>
         </publicationStmt>
         <sourceDesc>
            <p>None</p>
         </sourceDesc>
      </fileDesc>
      <profileDesc>
         <textClass>
            <classCode>paper</classCode>
            <keywords>
               <list>
                  <item>lexicography</item>
                  <item>diachronics</item>
                  <item>encoding standards</item>
               </list>
            </keywords>
         </textClass>
      </profileDesc>
      <revisionDesc>
         <list>
            <item>MDH: Created from John Bradley's XML <date value="2005-04-05">5 April 2005</date>
            </item>
            <item>MDH: Added corrections from PGL's proofing <date value="2005-05-27">27 May 2005</date>
            </item>
         </list>
      </revisionDesc>
   </teiHeader>
   <text>
      <front>
         <docTitle n="Modeling Diachrony in Dictionaries">
            <titlePart>Modeling Diachrony in Dictionaries</titlePart>
         </docTitle>
         <docAuthor>
            <name reg="Salmon-Alt, Susanne">Susanne Salmon-Alt</name>
            <address>
               <addrLine>salt@atilf.fr</addrLine>
            </address>
         </docAuthor>
         <titlePart type="affil">ATILF-CNRS</titlePart>
         <docAuthor>
            <name reg="Romary, Laurent">Laurent Romary</name>
            <address>
               <addrLine>romary@loria.fr</addrLine>
            </address>
         </docAuthor>
         <titlePart type="affil">Loria-Inria</titlePart>
         <docAuthor>
            <name reg="Eva, Buchi">Buchi Eva</name>
            <address>
               <addrLine>eva.buchi@atilf.fr</addrLine>
            </address>
         </docAuthor>
         <titlePart type="affil">ATILF-CNRS</titlePart>
      </front>
      <body>
         <div0>
            <head>Introduction: The variety of lexical structures</head>
            <p>Lexical data appear in a wide variety of forms. These can range from basic morpho-syntactic structures (Romary et al.) intended to be used in language engineering applications to important editorial projects that cover multiple levels of lexicographic description: morphological information, syntactic constructs, sense related information (definitions, examples, usage notes, etc.) or historical information.  Entries can also vary in their internal organization. Among other factors, the fundamental choice between an onomasiological (concept to word) and a semasiological (word to concept) representation directly impacts on the internal structure of entries, as well as on the possible choice of descriptors attached to them. From a computational point of view, this situation prevents the design of one single data structure that would fit all the possible needs, whereas one would like to be able to have uniform access to similar information across heterogeneous lexical sources. This has been the source of strong debates, leading for instance to the ubiquitous <title level="a">Print Dictionaries</title> chapter of the <title level="m">TEI</title> (<title level="m">Text Encoding Initiative</title>) that tries to combine structured and unstructured views of lexical entries. Still, we want to show in this paper that it is possible to apply coherent modeling principles to deal with this variety of structures while providing a precise account of complex sub-components such as diachronic information as they appear in dictionaries with wide lexical coverage. Besides, we want to show that such modeling principles can guide the possible evolution of the TEI towards a more flexible data for the concrete representation of dictionaries.</p>
         </div0>
         <div0>
            <head>Diachronic information in dictionary entries</head>
            <p>We consider diachronic information along the lines of its modern, large acceptation as <cit>
                  <q>a word's biography</q>
                  <bibl>Baldinger</bibl>
               </cit>. As such, it covers both etymological information in a restricted sense — tracing out origin and primitive significance of a lexeme in its source language — and historical notes about successive changes of form and meaning once it entered into the target language. This type of information can for instance be found in the <title level="m">Oxford English Dictionary</title> (<title level="m">OED</title>), in the <title level="m">Deutsches Wörterbuch</title> (<title level="m">DWB</title>) or in the <title level="m">Trésor de la Langue Française</title> (<title level="m">TLF</title>), for which Figure 1 illustrates the organization of diachronic information within the micro-structure of a lexical entry (<mentioned>pamplemousse</mentioned>): here, the Etymol. et Hist. section, separated from the synchronic description of the lexeme, consists of two parts, the first one being dedicated to the lexeme’s history within the target language (modern French) and the second one to etymology proper, i.e. origin and word sense in the source language (Dutch).</p>
            <figure rend="ImageLink">
               <head>Figure 1</head>
               <p>
                  <xref>paper_87_salmon-alt_1.gif</xref>
               </p>
               <figDesc>Figure 1</figDesc>
            </figure>
         </div0>
         <div0>
            <head>Historical Notes</head>
            <p>The main objective of  the historical notes is to provide (earliest) written testimony for each of the senses — and possibly different usages of a sense — with respect to the synchronic description of the entry. Therefore, temporal information and quoted source text associated with bibliographical references play a central role in this section. Whereas the <title level="m">OED</title> and the <title level="m">DWB</title> realise the projection from synchronic sense organization explicitly by subordinating  historical notes under sense description, the diachronic part of the <title level="m">TLF</title> takes up each of the four synchronic senses by a sense identifier, a date, a quotation and bibliographical reference. The latter might be complex in case of use of secondary literature. One may also notice that differences in word spelling led to two testimonies for sense 1a. Despite the very strict application of the sense projection principle — which is far from being applied systematically throughout the dictionary, as mentioned for example in Hausmann et al. — one may however notice that the synchronization has not been made explicit, for example through the use of the same sense identifiers within the synchronic and diachronic sections.</p>
         </div0>
         <div0>
            <head>Etymological Notes</head>
            <p>The etymology section of dictionaries is concerned with the origin and development of the lexeme before entering into the target language. As a central task, it informs about one or more etymons and determines the etymological class (inheritance, loan word, word generation) for the oldest sense of the lexeme under consideration. As a consequence, it is not directly related to individual senses in the modern stage of the considered language. In the example, the etymon for the oldest sense of <mentioned>pamplemousse</mentioned> (1a), is the Dutch <mentioned>pompelmoes</mentioned>, itself being a word generated via composition from <mentioned>pompel</mentioned> and <mentioned>limoes</mentioned>. Although there have been attempts to formalize further etymological notes (cf. etymological formulas, Ross), they are generally not subject to well defined organisation principles, at least in current dictionaries. Additionally to core information about etymon, etymological notes may indicate bibliographical sources of the etymological hypotheses and discuss other related issues (phonetic evolution, concurrent hypotheses, confidence statements, secondary etymons, testimony of etymons, intermediate states etc.).</p>
         </div0>
         <div0>
            <head>A representational model for diachronic information</head>
            <p>In the following sections, we apply the main modeling principles of the <title>LMF</title> (<title>Lexical Markup Framework</title>) project within ISO committee TC 37/SC 4 to outline the structure of diachronic information in dictionary entries. Those principles (Ide &amp; Romary) allow one to combine a meta-model, which informs the main agreed upon practices within a given field, with data categories, corresponding to elementary information units attached to the nodes of the metamodel. In the case of lexical structures, a metamodel is itself the combination of a core metamodel (a simple structure organizing a lexical entry with form related information and a hierarchy of senses) and lexical extensions, seen as additional modules attached to the core meta-model. In our case, we will consider what kind of lexical extensions are needed for both etymological and historical information.</p>
         </div0>
         <div0>
            <head>A Lexical Extension for Etymological Structure</head>
            <p>We propose a basic lexical extension for etymological notes (<term>Etymology</term>, cf. Figure 2), i.e. a structure that accounts for the description of links to etymons. The <term>Etymology</term> component may occur at most once for a given lexical entry, under the assumption that lexical entries are purely polysemous, excluding homonyms, given that the difference between both is made upon historical criteria, cf. <mentioned>adresse</mentioned>¹ (<title level="m">TLF</title>). This etymological information is further structured by means of <term>Etymological Unit</term> and <term>Etymological Link</term> components. <term>Etymological Unit</term> components are word forms playing the role of etymons. As such, they might be characterized by any existing data category defined for the description of lexical entries, i.e. lemmata and inflected forms (<term>language</term>, <term>orthography</term>, <term>sense</term>, <term>part-of-speech</term>, <term>inflectional information</term> etc.). Two points have to be noticed: first, the coverage of <term>language</term> should be extended to more fine-grained geographical and diachronic variants as those currently available from the ISO 639 series. Second, depending on available resources, all or part of this information could be recovered by a pointing mechanism. <term>Etymological Link</term> components stand for the etymological relation between linguistic units. A link is basically characterized by an <term>etymological target</term> and an <term>etymological source</term>, i.e. pointers to external resources, including lexical entries of the current dictionary and etymological units previously described. Etymological links are typed by the <term>etymological class</term> (loan word, inheritance etc.). They may additionally bear information about the bibliographical source, confidence level or other type of notes. The full paper will show how this data structure accounts for different types of etymological notes in current dictionaries, including cases of concurrent, popular, secondary and multiple etymons.</p>
            <figure rend="ImageLink">
               <head>Figure 2</head>
               <p>
                  <xref>paper_87_salmon-alt_2.gif</xref>
               </p>
               <figDesc>Figure 2</figDesc>
            </figure>
         </div0>
         <div0>
            <head>A Lexical Extension for Historical Notes</head>
            <p>The modeling of historical notes can actually be seen from two complementary and somehow sequentially organized perspectives. Firstly, we have identified that historical notes are organized as a hierarchy of sense like objects, which leads to the simple historical extension depicted in Figure 3.</p>
            <figure rend="ImageLink">
               <head>Figure 3</head>
               <p>
                  <xref>paper_87_salmon-alt_3.gif</xref>
               </p>
               <figDesc>Figure 3</figDesc>
            </figure>
            <p>This extension takes up the sense component that already exists in the core LMF meta-model, while further characterizing it with specific dating (<hi rend="code">/date/</hi>) and bibliographic (<hi rend="code">/bibliography/</hi>) information. Such an extension accounts for the situations where there is no a priori editorial coherence between the sense organization in the lexical entry and its possible counterparts in the historical notes as encountered in, e.g., the <title level="m">TLFi</title> or the <title level="m">OED</title>. In that case, we can see that we keep open the possibility to actuate links (<hi rend="code">/synchronic reference/</hi>) between components of the historical notes and senses in the main entry. If we want to model a more controlled editorial project, we suggest to move from the previous extension to an integrated view (cf. Figure 4), which directly anchors historical descriptions on the corresponding senses. Doing so, it would always be possible to externalize the corresponding information, to derive an autonomous representation conformant to Figure 3.</p>
            <figure rend="ImageLink">
               <head>Figure 4</head>
               <p>
                  <xref>paper_87_salmon-alt_4.gif</xref>
               </p>
               <figDesc>Figure 4</figDesc>
            </figure>
         </div0>
         <div0>
            <head>Implementation in the framework of the TEI</head>
            <p>The final paper will show precisely how the two types of structures described above can be implemented using the latest version of the specification platform of the TEI (ODD — One Document Does it all; Burnard &amp; Rahtz). In particular, we will show that, on the one hand, we can extend the scope of the existing <hi rend="code">&lt;etym&gt;</hi> element from the P4 guidelines, and, on the other hand, it is necessary to introduce a new element dedicated to the representation of historical notes, which mimics the behavior of related entries (sub-structure with a strong structural analogy to a full entry), combined with dating and bibliographical descriptors. Depending on the feedback we will receive from the lexicographic community, these extensions could be incorporated into the next version (P5) of the TEI guidelines.</p>
         </div0>
      </body>
      <back>
         <div type="Bibliography">
            <head>Bibliography</head>
            <listBibl>
               <biblStruct>
                  <analytic>
                     <author>
                        <name reg="Baldinger, K.">K. Baldinger</name>
                     </author>
                     <title level="a">L’étymologie d’hier et d’aujourd’hui</title>
                  </analytic>
                  <monogr>
                     <editor>
                        <name reg="Schmitt, R.">R. Schmitt</name>
                     </editor>
                     <title level="m">Etymologie</title>
                     <imprint>
                        <pubPlace>Darmstadt</pubPlace>
                        <date value="1977">1977</date>
                     </imprint>
                  </monogr>
               </biblStruct>
               <biblStruct>
                  <analytic>
                     <author>
                        <name reg="Burnard, L.">L. Burnard</name>
                     </author>
                     <author>
                        <name reg="Rahtz, S.">S. Rahtz</name>
                     </author>
                     <title level="a">Relaxing with Son of ODD, or What the TEI did Next</title>
                  </analytic>
                  <monogr>
                     <title level="u">Paper delivered at the Extreme Markup Languages conference, Montréal (Canada) , 2-6 August 2004</title>
                     <imprint>
                        <date value="2004-08">2004</date>
                     </imprint>
                  </monogr>
               </biblStruct>
               <biblStruct>
                  <analytic>
                     <author>
                        <name reg="Ide, N.">N. Ide</name>
                     </author>
                     <author>
                        <name reg="Romary, L.">L. Romary</name>
                     </author>
                     <title level="a">International Standard for a Linguistic Annotation Framework</title>
                  </analytic>
                  <monogr>
                     <title level="j">International Journal on Natural Language Engineering</title>
                     <imprint>
                        <date>Forthcoming</date>
                     </imprint>
                  </monogr>
               </biblStruct>
               <biblStruct>
                  <analytic>
                     <author>
                        <name reg="Romary, L.">L. Romary</name>
                     </author>
                     <author>
                        <name reg="Salmon-Alt, S.">S. Salmon-Alt</name>
                     </author>
                     <author>
                        <name reg="Francopoulo, G.">G. Francopoulo</name>
                     </author>
                     <title level="a">Standards going concrete: from LMF to Morphalou</title>
                  </analytic>
                  <monogr>
                     <title level="u">Coling Workshop on Enhancing and Using Electronic Dictionaries, Geneva, 29 août 2004</title>
                     <imprint>
                        <date value="2004-08-29">2004</date>
                     </imprint>
                  </monogr>
               </biblStruct>
               <biblStruct>
                  <monogr>
                     <editor>
                        <name reg="Hausmann, F.J.">F.J. Hausmann</name>
                     </editor>
                     <editor>
                        <name reg="Reichmann, O.">O. Reichmann</name>
                     </editor>
                     <editor>
                        <name reg="Wiegand, H.E.">H.E. Wiegand</name>
                     </editor>
                     <editor>
                        <name reg="Zgusta, L.">L. Zgusta</name>
                     </editor>
                     <title level="m">Wörterbücher. Ein internationales Handbuch zur Lexikographie</title>
                     <imprint>
                        <publisher>Walter de Gruyter</publisher>
                        <pubPlace>Berlin / New York</pubPlace>
                        <date value="1990">1990</date>
                     </imprint>
                  </monogr>
               </biblStruct>
            </listBibl>
         </div>
      </back>
   </text>
</TEI.2>