Project plan

Posted by on 16 Nov 2006 in Activity log, Tasks, Announcements

Phase one: Rescuing the old data (basically complete)

The initial task was to retrieve the original data from its DOS/WordPerfect/Lexware form. The bulk of this work was done by Greg, with the assistance of a piece of software (Transformer) written by me to make certain complicated search-and-replace operations easier.
- Convert the binary files to text (a process of converting escape characters and WordPerfect codes to predictable, readable text-strings).
- Identify each of the escape sequences used to represent non-ascii characters and character/diacritic combinations, and select an appropriate Unicode representation for it.
- Implement search-and-replace operations in sequence to convert the data to Unicode.
- Use the Lexware Band2XML converter (http://www.ling.unt.edu/~montler/Convert/Band2xml.htm) to turn the original data into rudimentary XML markup.
Phase 2: XML Encoding
(ongoing: this ball is currently in Ewa's court)
- Decide on a suitable XML format for the data. The requirements were:
  - portability (format must be easily parsed and transformed)
  - efficiency (data should not have to be duplicated -- for example, the same information about the same item should not have to be encoded in two places, as an independent entry, and as a nested component of another entry).
  - standards-compliance (format must be based on an existing, well-accepted and documented standard; we don't want to have to rescue it again in future).
  We chose TEI P5 (http://www.tei-c.org/P5/), and we decided to avoid all nesting and do all linking through xml:id attributes. We also decided that each entry would be marked up in such a way as to break it down into individual morphemes, each of which would be linked through xml:id to the entry for that morpheme. In this way, most feature information for most entries need not be encoded at all, because it can be retrieved from the entries of the morphemes that constitute it. This makes the encoding simpler and cleaner, offloading much of the work onto the XML database that will store and handle the data.
- Devise a method for migrating the rudimentary Band2XML data to the new format. This was achieved using a two-stage XSLT transformation:
  - Un-nesting all the entries. Nested entries were extracted and made into siblings, and derivations encoded as part of main entries were also split off into separate entries. After this stage, all entries are siblings at the same level.
  - Elaboration. The rudimentary entry information from the Lexware bands was expanded to produce a more elaborate and explicit TEI P5 structure; xml:ids, transcriptions and morphological breakdowns were created by the XSLT based on the original data, and where appropriate, linguistic descriptors were added using the TEI feature structure system (http://www.tei-c.org/release/doc/tei-p5-doc/html/FS.html).
- Check and correct the results, based on the original printed/handwritten data (this is the real work!).
Phase 3: Storage and Presentation
- The data will be stored in an XML database, from which all presentation forms will be created algorithmically. The database system we're using is eXist (http://www.exist-db.org/), and open-source native XML database system which we have used for some years. This project will eventually use the next-generation version of eXist (1.1 or 1.2) which is currently in development, but pilot work is being done with the 1.0 beta version.
- The interface to the data will be built in Cocoon, an open-source servlet container which provides a good basis for browser-based interaction with the eXist XML database.
- The first output format we aim at will be a browser-based system which works like this:
  - A list of headwords is retrieved through a search.
  - Clicking on a headword retrieves the full entry for that item, which is inserted into the page. This is done on-the-fly using AJAX/XMLHttpRequest, which sends a query to the XML database; the database response with a block of XML, which Cocoon converts to XHTML through an XSLT transformation before sending it back to the page; the page then inserts the data into the appropriate place.
  - Each morpheme in an entry is itself a link, and clicking on it retrieves the data for that morpheme's entry, which is then inserted into the page. Thus a kind of expanding tree is generated, and an entry can be expanded until it contains all the information on all its constituent morphemes.
  - Each morpheme entry includes a "see also" link which would call back through AJAX to get a link to all other items in the database which include that morpheme. Clicking on one of these would retrieve its entry and insert it.
- The search functionality basically works like this:
  - There is a drop-down list of "what fields to search": Orthography, Transcription or English.
  - If you choose one of the first two, you get a button bar to enter special characters.
  - The choice also determines which fields will be searched.
  - A checkbox also allows fuzzy searching. This involves translating IPA to plain latin, and searching agains plain latin conversions of the fields searched. This will probably require the construction (through XSLT, on the fly) of parallel indexes of plain latin fields, so it can be relatively fast.
  - Searches will pull back results which can be added selectively to your "notebook"; you can then take your notebook to the "print shop" where you can configure a page (accessible through URL) or a PDF printable output to create a vocab sheet or mini-dictionary.
- Other projected output formats include:
  - Print dictionary (PDFs generated through XSLT -> XSL:FO and converted to PDF using the RenderX XEP engine). We envisage linguistic dictionaries as well as dictionaries aimed at language-learners.
  - Printable wordlists (Nxa'amxcin to English and English to Nxa'amxcin).
  - Graphical navigation devices for browsing the data such as the KirrKirr Java model used by Christopher Manning (http://www-nlp.stanford.edu/kirrkirr/doc/ach-allc2000-ver5-single.pdf).
Phase 4: Media

We plan to integrate audio and visual media into the database in the future.

This entry was posted by Martin and filed under Activity log, Tasks, Announcements.