TEI 2017



Using TEI for Language Documentation Projects: The Nxaʔamxčín database and dictionary

Ewa Czaykowska-Higgins* Ewa Czaykowska-Higgins (Linguistics, University of Victoria) has been a scholar of Salish languages for about 30 years, has contributed to research on both the Nxaʔamxčin and the SENĆOŦEN Salish languages, and has worked as an ally and partner in Indigenous Language Revitalization. Her current research projects include completion, in collaboration with colleagues from Colville Tribes’ Nxaʔamxčin Language Program, of an online Nxaʔamxčin Database and Dictionary, based on legacy materials from the 1960s and 70s. and Sarah M. Kell* Sarah Kell has a BA in Linguistics and an M.Ed in Indigenous Language Revitalization from the University of Victoria. She has been co-editor of the Nxaʔamxčin Database and Dictionary since 2010. Sarah also assists UVic linguists with research on other Salish languages, and consults on Indigenous language curriculum development with First Nations, school districts, and the British Columbia Ministry of Education.

1The TEI Consortium has extensive guidelines for encoding lexical resources, including “monolingual and multilingual dictionaries, glossaries, and similar documents;” for the most part, these guidelines have been used in construction of dictionaries for European languages, and to produce encodings of already extant print dictionaries, requiring an encoding to be “faithful to an original printed version” or to capture lexical information “in a form suitable for further processing” (TEI Guidelines 2017).
2In this paper, we describe a language documentation project employing TEI encoding to construct lexical resources for the Native American language Nxaʔamxčín (Salish) (Czaykowska-Higgins et al. 2014). The project is significant from the point of view of TEI because 1) the language is small, non-European, and endangered; 2) the lexical resource aims to encode lexical information listed on filecards created from fieldnotes, rather than information from an already-printed dictionary; 3) the lexical resource is morphologically-based, encoding affixes as well as roots, stems and words.
3While linguists involved in language documentation for small endangered languages do use XML for language data, the majority use off-the-shelf software–like FLEx or Tshwana-Lex, or customized coding systems, to create lexical resources. Endangered language projects using TEI-XML schemas are rare (e.g., Spence & Liu 2017, Lillehaugen et al 2016); we know of only one other TEI-based dictionary project (Bates & Lonsdale 2010). Off-the-shelf dictionary-making software in particular provides an easy interface for inputting lexical material, and has built-in programming to generate dictionary output, thus limiting the need for linguists to work with programmers. This is a decided advantage for language documentation of small languages, which is often undertaken with small budgets and in challenging fieldwork conditions. However, as Budin et al (2012) point out for lexicography more generally, there is comparatively little explicit consensus about standards.
4In this paper, therefore, we illustrate our use of the TEI encoding schemas, and the print dictionary and web-based lexical resource these have allowed us to produce. As in other Salish languages, words in Nxaʔamxčín can be highly complex, containing multiple morphemes representing inflectional, derivational, and lexical content. For example, scḥàwˀiyáɬxʷəxʷ “he is building a house” contains the root ḥawˀiy “make, do, build” and the bound lexical suffix -áɬxʷ “house.” The prefix sc- and the suffix -əxʷ together indicate imperfective aspect. One of the advantages for us in using TEI is that TEI’s mixed-content elements, combined with the ability to define the content of feature structures, allow maximal flexibility to encode this complex morphological information and provide linking between morphological units (cf. Fraser 2011). Although it was challenging to arrive at a system of coding that reflected variability in morphology, multiplicity of sources, and multiple definitions, TEI has allowed us the freedom to encode this variability through the use of multiple <pron>, <seg>, <def> and <bibl> elements. The flexibility of the <hyph> element has also allowed us to mark-up root morphemes split by infixation, as in sḥəwˀḥáwˀwˀiy “what you have done,” where the “out of control” affix, here -wˀ-, appears within the root ḥawˀiy. As a result of having to work out appropriate use of TEI to capture the content of the material recorded on filecards and in fieldnotes, we have learned more about the structure of the language’s phonology, morphology, and syntax, including how best to classify certain morphemes as affixes, clitics, or particles. Now that all entries are fully encoded in TEI, we can use XSLT queries to find related structures across the dictionary database, and carry out complex linguistic analyses.
5In sum, TEI has served us well, giving us the freedom to create encoding structures that capture the information we need to capture, within a constrained but customizable standard. The standardized nature of TEI, combined with TEI consortium support for maintaining and improving the standard, also ensure interoperability and sustainability (Bird & Simons 2003) of the end products.


