After some thought, we made a small change to our decision on how phonemic renderings should be encoded. The phonemic rendering will now look like this:
<form>
<pron>
<seg type="phonemic">blah</seg>
<seg type="narrow">blah</seg>
</pron>
...
</form>
Important points to note:
- There will always be a seg type="phonemic".
- It will always be the first seg in a pron element.
- Where there is an attested phonetic transcription from the field, that will be included with seg type="narrow"; if there is no such seg, that means there is no attested phonetic transcription.
Met with Ewa, and we made the following decisions:
- We'll use the comma-above rather than the hook-above, because community-friendly forms are more important than strict phonetic accuracy.
- All entries will have a form/pron/seg element with the attribute type="phonemic", which will be the first pron in the form element, and will constitute the phonemic/orthographic form which is displayed, and on which people will search.
- The search system will allow the use of all characters in the phonemic/orthographic markup, as well as plain-ascii versions of the entries (=phon/orth minus diacritics).
- We will try to generate the plain ascii versions for searching automatically, or (even better) find a way to configure searching so that it can do fuzzy matching against diacritic-free versions.
- The dicteg elements will need rewriting. This is what one currently looks like:
<dicteg type="narrow">
<cit>
<quote>[blah]<gloss>He hasn't eaten.</gloss>
</quote>
<bibl>W2.72</bibl>
</cit>
</dicteg>The target form looks like this:
<dicteg>
<cit>
<quote>
<phr type="phonemic">[blah]</phr>
<phr type="narrow">[blah]</phr>
<gloss>He hasn't eaten.</gloss>
</quote>
<bibl>W2.72</bibl>
</cit>
</dicteg>This will make for a more community-friendly display, and mean that we can search the examples in the same way as we search the entries.
The list posted previously was missing many characters which already appear in the database; I sent a list to Ewa, who mapped each of the missing characters to existing characters as follows:
i-bar is schwa
a-acute a
a-acute with dot below a-dot
a-grave a
c-wedge c
i-acute i
i-acute with dot below i-dot
e i
e-acute i
e-acute with modifier schwa i
schwa-acute schwa
schwa-grave schwa
o u
o with dot below u-dot
o-acute u
o-grave u
s with modifier schwa s
u-acute u
u-acute with dot below u-dot
glottal stop glottal stop
glottal stop with modifier a glottal stop (glottal stop is the first letter of the alphabet)
I'm not sure what this equivalence signifies, but it would certainly complicate the search process. For instance, one the entry in the database is this:
ṣọ̀lạ̀mén
in other words:
s + dot below; o + dot below + grave; l; a + dot below + grave; m; e acute; n
Now, we can assume that people are going to want to search for forms they actually see, so we'll have to have buttons for all of those character/diacritic combos, otherwise people won't be able to enter the actual form; at the same time, some people will want to search without the diacritics so we'll have to be able to map the form to this:
solamen
According to the aliases below, though, we'll also have to map:
-the second character to: u with dot or u (because o+dot = u+dot, and u+dot has to allow u);
-the fourth character to a with dot (disregarding the grave)
-the sixth character to i.
Combinatorially, we now have a total of:
[s]: 2 possibilities
[o]: 4 possibilities
[l]: 1
[a]: 3
[m]: 1
[e]: 3
[n]: 1
= 2 * 4 * 1 * 3 * 1 * 3 * 1 = 72 variations that we have to allow for, just for one word. I'm wondering if this isn't completely over the top. What does it actually mean to say that e-acute = i? If the database has e-acutes, where would the i's come from, and how would the user know about them?
Wouldn't it be better to standardize on the form which is actually in the database, and for the purposes of searching, also map that to a simple ascii representation which consists of the same form but with all diacritics stripped off? In other words, wouldn't it be best to say that the user should either search specifically for this:
ṣọ̀lạ̀mén
or, more likely, would search for this:
solamen
and possibly find a small set of words, varying only by diacritics, amongst which they could choose?
I'm basically suggesting that each form be mapped in only two ways:
- completely, with all its diacritics intact
- stripped of all diacritics (with superscripts transformed to full forms, glottal changed to apostrophe, and something similar done with the reversed glottal).
Then the database can do two searches, one on the full form, and one on the ascii-ized form. These searches can be fuzzy (implementing the fuzzy search capabilities the db has built-in), so there will still be some room for rough matching, but the number of actual operations will be reduced to a manageable level.
On the other hand, if the correspondences in your list (e-acute = i, etc.) actually amount to a form of orthography, and it's that orthography on which people will most likely want to search, then we should be entering orthographical forms directly into the database and using them as the presentational forms in the first place. However, my understanding was that there is no conventional orthography, and so I'm guessing that these correspondences must represent some kind of more abstract level of transcription; in which case, I'd suggest that we steer clear of them for the purposes of searching. I think people will really want to search either on what they have already seen in the database (perhaps looking up a form they found previously and wrote down), or on a rough ascii simplification without any diacritics.
In previous discussion, we planned a search system which would allow people to search on actual forms in the database, or using a simplified ascii representation which would make it easy to avoid struggling with diacritics. I'm now in the process of trying to create a taxonomy of glyphs and glyph-combos which may appear in the db, and figure out how to map them to simpler forms. This is my preliminary list, based on Ewa's handwritten note, in Unicode (so it may not appear correctly on the blog!):
Nxaʔamxcín Alphabetical Order:
ʔ a ạ c c̣ cˀ ə ə̣ h ḥ ḥʷ i ị k kˀ kʷ kˀʷ l ḷ lˀ ḷˀ ɬ ƛˀ m mˀ n nˀ p pˀ q qˀ qʷ qʷˀ r rˀ s ṣ t tˀ u ụ w w̓ x xʷ x̣ x̣ʷ y ỵ̓ ʕ ʕˀ ʕʷ ʕˀʷ
Phase one: Rescuing the old data (basically complete)
The initial task was to retrieve the original data from its DOS/WordPerfect/Lexware form. The bulk of this work was done by Greg, with the assistance of a piece of software (Transformer) written by me to make certain complicated search-and-replace operations easier.
- Convert the binary files to text (a process of converting escape characters and WordPerfect codes to predictable, readable text-strings).
- Identify each of the escape sequences used to represent non-ascii characters and character/diacritic combinations, and select an appropriate Unicode representation for it.
- Implement search-and-replace operations in sequence to convert the data to Unicode.
- Use the Lexware Band2XML converter (http://www.ling.unt.edu/~montler/Convert/Band2xml.htm) to turn the original data into rudimentary XML markup.
Phase 2: XML Encoding
(ongoing: this ball is currently in Ewa's court)- Decide on a suitable XML format for the data. The requirements were:
- portability (format must be easily parsed and transformed)
- efficiency (data should not have to be duplicated -- for example, the same information about the same item should not have to be encoded in two places, as an independent entry, and as a nested component of another entry).
- standards-compliance (format must be based on an existing, well-accepted and documented standard; we don't want to have to rescue it again in future).
We chose TEI P5 (http://www.tei-c.org/P5/), and we decided to avoid all nesting and do all linking through xml:id attributes. We also decided that each entry would be marked up in such a way as to break it down into individual morphemes, each of which would be linked through xml:id to the entry for that morpheme. In this way, most feature information for most entries need not be encoded at all, because it can be retrieved from the entries of the morphemes that constitute it. This makes the encoding simpler and cleaner, offloading much of the work onto the XML database that will store and handle the data.
- Devise a method for migrating the rudimentary Band2XML data to the new format. This was achieved using a two-stage XSLT transformation:
- Un-nesting all the entries. Nested entries were extracted and made into siblings, and derivations encoded as part of main entries were also split off into separate entries. After this stage, all entries are siblings at the same level.
- Elaboration. The rudimentary entry information from the Lexware bands was expanded to produce a more elaborate and explicit TEI P5 structure; xml:ids, transcriptions and morphological breakdowns were created by the XSLT based on the original data, and where appropriate, linguistic descriptors were added using the TEI feature structure system (http://www.tei-c.org/release/doc/tei-p5-doc/html/FS.html).
- Check and correct the results, based on the original printed/handwritten data (this is the real work!).
- Decide on a suitable XML format for the data. The requirements were:
Phase 3: Storage and Presentation
- The data will be stored in an XML database, from which all presentation forms will be created algorithmically. The database system we're using is eXist (http://www.exist-db.org/), and open-source native XML database system which we have used for some years. This project will eventually use the next-generation version of eXist (1.1 or 1.2) which is currently in development, but pilot work is being done with the 1.0 beta version.
- The interface to the data will be built in Cocoon, an open-source servlet container which provides a good basis for browser-based interaction with the eXist XML database.
- The first output format we aim at will be a browser-based system which works like this:
- A list of headwords is retrieved through a search.
- Clicking on a headword retrieves the full entry for that item, which is inserted into the page. This is done on-the-fly using AJAX/XMLHttpRequest, which sends a query to the XML database; the database response with a block of XML, which Cocoon converts to XHTML through an XSLT transformation before sending it back to the page; the page then inserts the data into the appropriate place.
- Each morpheme in an entry is itself a link, and clicking on it retrieves the data for that morpheme's entry, which is then inserted into the page. Thus a kind of expanding tree is generated, and an entry can be expanded until it contains all the information on all its constituent morphemes.
- Each morpheme entry includes a "see also" link which would call back through AJAX to get a link to all other items in the database which include that morpheme. Clicking on one of these would retrieve its entry and insert it.
The search functionality basically works like this:
- There is a drop-down list of "what fields to search": Orthography, Transcription or English.
- If you choose one of the first two, you get a button bar to enter special characters.
- The choice also determines which fields will be searched.
- A checkbox also allows fuzzy searching. This involves translating IPA to plain latin, and searching agains plain latin conversions of the fields searched. This will probably require the construction (through XSLT, on the fly) of parallel indexes of plain latin fields, so it can be relatively fast.
- Searches will pull back results which can be added selectively to your "notebook"; you can then take your notebook to the "print shop" where you can configure a page (accessible through URL) or a PDF printable output to create a vocab sheet or mini-dictionary.
- Other projected output formats include:
- Print dictionary (PDFs generated through XSLT -> XSL:FO and converted to PDF using the RenderX XEP engine). We envisage linguistic dictionaries as well as dictionaries aimed at language-learners.
- Printable wordlists (Nxa'amxcin to English and English to Nxa'amxcin).
Graphical navigation devices for browsing the data such as the KirrKirr Java model used by Christopher Manning (http://www-nlp.stanford.edu/kirrkirr/doc/ach-allc2000-ver5-single.pdf).
Phase 4: Media
We plan to integrate audio and visual media into the database in the future.
Met with Ewa to discuss the future of the project.
- Decided to set up the blog.
- Updated the plan to post it here for comment, so that we can later break it down into tasks.
- Discussed options for the search functions, and the possibility of user-generated vocabulary sheets.