Category: Announcements

13/04/17

Permalink 12:13:21 pm, by skell, 203 words, 102 views   English (CA)
Categories: Activity log, Announcements; Mins. worked: 30

feature structures for numbers

Further to our discussions on numbers, I have added the following to feature_system.xml

1) wordType numberStem. So ECH will add this <fs> to the number stems 1-10.

<fs>
<f name="numberStem">
<binary value="true"/>
</f>
</fs>

2) countingType "ten"

I have also added the following <fs> to lexical suffix "akst-2", so ECH can use this morpheme for marking up the numbers 30, 40 ... 90.

<fs>
<f name="baseType">
<symbol value="affix"/>
</f>
<f name="positionType">
<symbol value="suffix"/>
</f>
<f name="affixType">
<symbol value="derivational"/>
</f>
<f name="derivationalType">
<symbol value="lexical"/>
<symbol value="counting"/>
</f>
<f name="countingType">
<symbol value="ten"/>
</f>
</fs>

MDH will then search for entries with this <fs> to build a test column for the table of numerical expressions. We can subsequently add more countingType values to the feature system, and to the entries for the appropriate lexical suffixes with classifier functions, and generate more columns for the table.

30/06/14

Permalink 10:10:36 am, by skell, 123 words, 868 views   English (CA)
Categories: Announcements; Mins. worked: 10

Editing procedure after autohyphenation

Now that I have (almost) worked through the affix list and autohyphenated as many affixes as possible, I'm finding that editing the alphabetical files goes significantly more quickly. This is my editing procedure now:

-edit root entry

-autohyphenate all instances of that root in complex words in the file

-first pass: skim through all entries with that root and clean them up as best I can - mainly tagging any remaining morphemes and making sure gloss tags are placed properly in defs

-second pass: check that everything on the Lexware printout is present in the entries, and correct any autohyphing errors.

I could consider NOT even looking at the filecards from here on, to limit my obsessive need to proofread in triplicate. :-)

08/10/13

Permalink 02:54:40 pm, by skell, 117 words, 917 views   English (CA)
Categories: Announcements; Mins. worked: 0

Hyphs for compound lexical suffixes

I just happened to notice that the hyphs for compound lexical suffixes were commented out along with the hyphs of monomorphemic entries. There are only five compound lexical suffixes, so I fixed them manually. They should look like this:

<hyph>=<m corresp="m:qin m:wil">qnwíl</m></hyph>

They were commented out because they only have one <m> tag within their <hyph>s, but they are actually bimorphemic, with two @corresp values.

Note to selves that "monomorphemic" doesn't mean just "hyph with a single <m> child", but rather "hyph with a single <m> child with only one @corresp value".

19/03/10

Permalink 10:38:51 am, by skell, 38 words, 772 views   English (CA)
Categories: Activity log, Announcements; Mins. worked: 15

test file

In the tei_xml folder, I have added a file called qw-glot-test.xml.

It contains three entries copied from qw-glot.xml which should be good test cases for all the transformations we hope to be able to do.

16/11/06

Permalink 04:29:53 pm, by mholmes, 978 words, 718 views   English (CA)
Categories: Activity log, Tasks, Announcements; Mins. worked: 60

Project plan

  • Phase one: Rescuing the old data (basically complete)

    The initial task was to retrieve the original data from its DOS/WordPerfect/Lexware form. The bulk of this work was done by Greg, with the assistance of a piece of software (Transformer) written by me to make certain complicated search-and-replace operations easier.

    • Convert the binary files to text (a process of converting escape characters and WordPerfect codes to predictable, readable text-strings).
    • Identify each of the escape sequences used to represent non-ascii characters and character/diacritic combinations, and select an appropriate Unicode representation for it.
    • Implement search-and-replace operations in sequence to convert the data to Unicode.
    • Use the Lexware Band2XML converter (http://www.ling.unt.edu/~montler/Convert/Band2xml.htm) to turn the original data into rudimentary XML markup.
  • Phase 2: XML Encoding
    (ongoing: this ball is currently in Ewa's court)

    • Decide on a suitable XML format for the data. The requirements were:
      • portability (format must be easily parsed and transformed)
      • efficiency (data should not have to be duplicated -- for example, the same information about the same item should not have to be encoded in two places, as an independent entry, and as a nested component of another entry).
      • standards-compliance (format must be based on an existing, well-accepted and documented standard; we don't want to have to rescue it again in future).

      We chose TEI P5 (http://www.tei-c.org/P5/), and we decided to avoid all nesting and do all linking through xml:id attributes. We also decided that each entry would be marked up in such a way as to break it down into individual morphemes, each of which would be linked through xml:id to the entry for that morpheme. In this way, most feature information for most entries need not be encoded at all, because it can be retrieved from the entries of the morphemes that constitute it. This makes the encoding simpler and cleaner, offloading much of the work onto the XML database that will store and handle the data.

    • Devise a method for migrating the rudimentary Band2XML data to the new format. This was achieved using a two-stage XSLT transformation:
      • Un-nesting all the entries. Nested entries were extracted and made into siblings, and derivations encoded as part of main entries were also split off into separate entries. After this stage, all entries are siblings at the same level.
      • Elaboration. The rudimentary entry information from the Lexware bands was expanded to produce a more elaborate and explicit TEI P5 structure; xml:ids, transcriptions and morphological breakdowns were created by the XSLT based on the original data, and where appropriate, linguistic descriptors were added using the TEI feature structure system (http://www.tei-c.org/release/doc/tei-p5-doc/html/FS.html).
    • Check and correct the results, based on the original printed/handwritten data (this is the real work!).
  • Phase 3: Storage and Presentation

    • The data will be stored in an XML database, from which all presentation forms will be created algorithmically. The database system we're using is eXist (http://www.exist-db.org/), and open-source native XML database system which we have used for some years. This project will eventually use the next-generation version of eXist (1.1 or 1.2) which is currently in development, but pilot work is being done with the 1.0 beta version.
    • The interface to the data will be built in Cocoon, an open-source servlet container which provides a good basis for browser-based interaction with the eXist XML database.
    • The first output format we aim at will be a browser-based system which works like this:
      • A list of headwords is retrieved through a search.
      • Clicking on a headword retrieves the full entry for that item, which is inserted into the page. This is done on-the-fly using AJAX/XMLHttpRequest, which sends a query to the XML database; the database response with a block of XML, which Cocoon converts to XHTML through an XSLT transformation before sending it back to the page; the page then inserts the data into the appropriate place.
      • Each morpheme in an entry is itself a link, and clicking on it retrieves the data for that morpheme's entry, which is then inserted into the page. Thus a kind of expanding tree is generated, and an entry can be expanded until it contains all the information on all its constituent morphemes.
      • Each morpheme entry includes a "see also" link which would call back through AJAX to get a link to all other items in the database which include that morpheme. Clicking on one of these would retrieve its entry and insert it.
    • The search functionality basically works like this:

      • There is a drop-down list of "what fields to search": Orthography, Transcription or English.
      • If you choose one of the first two, you get a button bar to enter special characters.
      • The choice also determines which fields will be searched.
      • A checkbox also allows fuzzy searching. This involves translating IPA to plain latin, and searching agains plain latin conversions of the fields searched. This will probably require the construction (through XSLT, on the fly) of parallel indexes of plain latin fields, so it can be relatively fast.
      • Searches will pull back results which can be added selectively to your "notebook"; you can then take your notebook to the "print shop" where you can configure a page (accessible through URL) or a PDF printable output to create a vocab sheet or mini-dictionary.
    • Other projected output formats include:

      • Print dictionary (PDFs generated through XSLT -> XSL:FO and converted to PDF using the RenderX XEP engine). We envisage linguistic dictionaries as well as dictionaries aimed at language-learners.
      • Printable wordlists (Nxa'amxcin to English and English to Nxa'amxcin).
      • Graphical navigation devices for browsing the data such as the KirrKirr Java model used by Christopher Manning (http://www-nlp.stanford.edu/kirrkirr/doc/ach-allc2000-ver5-single.pdf).
  • Phase 4: Media

    We plan to integrate audio and visual media into the database in the future.

Nxaʔamxcín (Moses) Dictionary Blog

This is an XML dictionary project based primarily on the materials compiled by the late M. Dale Kinkade during fifteen years of work in the 1960’s and 1970’s with more than a dozen native speakers of the language, but it also includes materials compiled by Ewa Czaykowska-Higgins in the early 1990’s.

Reports

XML Feeds