Now the site has been fully ported over to Lettuce, the old URL on Mustard is obsolete. I added a redirect to the Mustard sitemap pointing to the new site on Lettuce.
Category: "Activity log"
Nx has a specific sort order for its alphabet, which doesn't follow the instant results you would get from a normal alphabetical sort. For instance, the glottal stop comes at the beginning, and the reverse glottal at the end; a normal default XSLT alpha sort, as well as the Unicode codepoint sort, place the glottal after the y.
In order to get sorting to work the way we need it to, we'll have to define a custom sort collation. Since we're using XSLT 2.0 and Saxon 8, we can do that following the instructions on the Saxon site for implementing a collation sequence. What we have to do is to write a Java class which implements the java.util.Comparator
interface. This IBM page has more details of the interface.
Because Java isn't my thing, I'm now looking around for a simple example I can modify. So far I've only found this, which is too basic to be much use. This shows a simple example of a case-insensitive comparator:
public class CaseInsensitiveComparator implements java.util.Comparator { public int compare(Object o1, Object o2) { String s1 = o1.toString().toUpperCase(); String s2 = o2.toString().toUpperCase(); return s1.compareTo(s2); } }
but it shells out to the default string comparator to do the real work. What we need is an example of an efficient way to compare individual characters. One possibility is to have a string comparator which also embodies a character comparator; the character comparator would tell you which of two characters sorts first, and the string comparator would call that for each character in the strings until it found an inequality. I have no idea if I'm on the right track here, though, so I'll keep researching for the moment. This is something that will be called frequently, so it has to be pretty fast.
I've finally had a chance to look at your questions in detail:
1. In the inchoative entry one of the allomorphs (the glottal stop), is an infix, while the other allomorph (the -p) is a suffix. In the feature structure it is possible to two different symbol values for type of morpheme, so I used this possibility to list the inchoative morpheme as being both an infix and a suffix. ՠQuestion for Martin: But the question that I have is how can we indicate which allomorph is an infix and which is a suffix in the database? Does this pose any kind of problem for the database?
This is a question I hadn't thought about before, because it hadn't occurred to me that there would be two different morpheme types for one morpheme. However, a relatively simple solution suggests itself:
<form type="allomorph" n="1"> ... </form> <form type="allomorph" n="2"> ... </form> ... <fs> <f name="baseType"> <symbol value="infix" n="1"/> <symbol value="suffix" n="2" /> </f> ... </fs>
Then I can write code to detect the presence of the n attributes, and link the correct form to the correct symbol value. I've added this to the documentation, and I've also posted a link to the documentation on the site.
2. This has to do with markup of glosses in illustration (dictegs). ՠQuestion for Martin: Do we want the English-Nx wordlist to be able to access illustration glosses, and if so, how do we mark this up? Can we use the same system of segs and glosses?
My original intention was that the gloss tag would be used inside a <def><seg> tag to signal a word or phrase which could be used to create the English-Nx wordlist, and that the wordlist would be constructed only based on <gloss> tags occurring in that context. In the <dicteg> tags, we're using <gloss> for something else:
If a gloss for the illustration is required, it can be included in the <quote> tag with a <gloss> tag, like this:<cit> <quote>The quoted illustration<gloss>Translation of the illustration</gloss></quote> ... </cit>(from our guidelines)
Therefore it seems to me that using <gloss> in a different way inside the <dicteg> will be confusing. I took a look at <code>s-rtr.xml</code>, and I found some bits that look like this:
<cit><!--check stress on this one--> <quote> <phr type="phonemic">ni?c'ikus ??p?iԿ?</phr> <phr type="narrow">ne?c?ikos ??p?lꬼ/phr> <seg>whole wheat <gloss>flour</gloss></seg> </quote> <bibl>Y41.7</bibl> </cit>
This doesn't look anything like the guidelines, so I'm wondering what happened here. Was this based on the code already in the file, or did you construct this format with <phr> and <seg> tags?
It seems to me that if there's a word or phrase that can serve as a direct English equivalent to the headword appearing in an illustration, it might as well be in the <def> element, wrapped in a <gloss> tag; is there any good reason to take material from the illustrations for the English-Nx glossary?
3. Crossreferences: Here are the two different ways of doing cross-references. They are from the same entry -?-/-p ԩnchoativeԮ Note that the format in (a) does not provide a gloss for the cross-reference. Presumably this is because the gloss is meant to be determined by looking at the entry of the word that is referred to in the cross-reference. The effect of xr is to point to the xml:id and thus the entry for the referred to word. The format in (b) does provide a gloss, but does not point to the entry of the referred to word. I assume that (a) is actually the format that we want to be following but your input is needed here Martin. (a) <dicteg> <cit> <quote>s-vt?a+?+x-mgloss>it is getting sweet</gloss> </quote> <bibl><!--[No source]--></bibl> </cit> <xr>See<ref target="t??x">t??ո</ref></xr> </dicteg> (b) <dicteg> <cit> <quote>vk??մ?-p<gloss>rope breaks</gloss> <note>cf. k?k'?մ'?n ' break a line'</note> </quote> <bibl><!--[No source]--></bibl> </cit> </dicteg>
Our documentation shows this example:
<xr>See <ref target="idblah">Blah</ref> (English blah) and <ref target="idblah2">blah2</ref> (English blah2).</xr>
The intention is that the gloss, if needed, be simply in brackets. I think the structure quoted in your question is the result of the automatic conversion code doing the best it could with the source material; in this case, there was no gloss for the cross reference encoded with <xr>, and the second cross-reference was simply not encoded properly in the original source. If there's a difference between a link introduced by "See" and one introduced by "cf.", then we'll need to elaborate the tagging system a bit, but I suspect in this case the second cross-reference should be re-encoded using an <xr> tag.
The old location of the Moses project on the Mustard server is going obsolete, so we're moving all projects over to a newer Tomcat/Cocoon/eXist block on the Lettuce server. I got stuck into that process today. The new site is here, and the old site will be pointed at it soon.
I immediately faced a major problem. The AJAX code which retrieves entry information from the server sends the id attribute as a GET request to the server. In previous projects we've had some problems with this, using characters between 127 and 255, but worked around it by using the JavaScript escape() function. However, with these id attributes, there's no hope of that; they have all sorts of characters above 255 in them.
After several hours of hacking around with the JavaScript, I eventually found a solution that could be implemented through a change to the Cocoon web.xml
file. I'll document that in detail on the Maintenance blog, since it's really a server configuration issue.
The main thing is that the new site is now working, and it gives us a lot of new opportunities in terms of performance, indexing and stability that were not there on the old server. Next, I'll be able to look in detail at the questions in Ewa's post below.
I fixed the errors in the s.rtr file and then went back to working on the affix file. In the affix file I accomplished the following:
1. I corrected some feature structures at the beginning of the file which had errors
2. I worked on restructuring the p entry. This entry was a mess because it has a lot of cross-references and the conversion program placed all the cross-references at the beginning of the entry and not in the right dictegs, and some of the cross-references also had their transcriptions and their glosses separated.
3. I moved the p entry into the glottal stop ‘Inchoative’ entry since p is an allomorph of the Inchoative.
Three issues arose:
1. In the ‘Inchoative’ entry one of the allomorphs (the glottal stop), is an infix, while the other allomorph (the -p) is a suffix. In the feature structure it is possible to two different symbol values for type of morpheme, so I used this possibility to list the ‘Inchoative’ morpheme as being both an infix and a suffix.
• Question for Martin: But the question that I have is how can we indicate which allomorph is an infix and which is a suffix in the database? Does this pose any kind of problem for the database/
2. In the affix file there seem to be two different ways of doing cross-references: one way involves using the tag xr and the other involves using the note tag with dictegs and quote. I need to figure out what the difference is and whether there is a problem to be fixed here, or whether there is a consistent reason for the difference.
• Task for Ewa: how are cross-references used in dictegs?
3. The last question that arose has to do with markup of glosses in illustration (dictegs).
• Question for Martin: Do we want the English-Nx wordlist to be able to access illustration glosses, and if so, how do we mark this up? Can we use the same system of segs and glosses?
I took a look at the file and it looks fine. I've uploaded it into the DB, and the list of words is a bit longer:
There are a couple of things I noticed. There's one asterisk still in there:
<gloss>he is real *gentle</gloss>
Also, there's one instance of a space at the beginning of a gloss element:
<seg>they are
<gloss>tame</gloss> or <gloss> gentle</gloss>
</seg>
which could be eliminated -- it'll save on processing time if we don't have to strip leading and trailing spaces from gloss tags when we process them.
There's one <m> element with no sameAs reference:
<m sameAs="">wílˀx</m>
which I presume is because you don't yet know what that reference would be, not having done all the affixes yet. This is a good reason for doing the affixes next, so that we can go back and fill any missing ones in before there are too many of them.
Finally, there are some things about the display of the entries which, now I look at them, I don't understand. The primary form (the headword on the web page) seems to be the <seg type="narrow"> form, rather than the phonemic form, while I suspect the sorting is being done based on the phonemic form. In addition, for an entry like "sweater" where the morpheme element can only point back to its own entry, the morpheme should not be a link; however, it is a link, and even worse, it doesn't point back to its own entry, it points to nothing.
So I'm adding this as a task for me to fix the sorting, headword display, and handling of morphemes which shouldn't link because they'd link only to themselves.
1. I have completed a version of the s-rtr.xml file.
2. Glosses:
Following Martin’s comments about marking up glosses, I have deleted all the asterisks (I think), and have tried to use seg and gloss tags in such a way as to ensure that the right words from the glosses will be singled out when creating the English-Moses wordlist.
3. Phonemic forms:
a) I have tried to be systematic about how I have indicated phonemic and narrow types. I think there is redundancy in what I have done, but the entries seem to me to be easier to read now. Martin, could you look over the file with redundancy in mind and tell me what you think.
b) In phonemicizing the dictegs, and the entries themselves, I have made the following decisions:
(i) The so-called ‘phonemic’ forms are not identical to underlying forms in all cases. For example, when a morpheme has more than one allomorph, one stressed, and one unstressed, and the unstressed allomorph involves changes in vowels, or some similar change, then I have stayed fairly close to the pronunciation of the unstressed variant in the ‘phonemic’ form. Thus, in the case of a suffix like =áw’s~=u?s, I am writing the unstressed variant as =u?s--writing =aw’s would make it hard for speakers/learners to know how to pronounce the form without learning complicated rules. If, however, in the phonetic form Kinkade has recorded =o?s, then I change the ‘o’ to ‘u’ in the phonemic form. Similarly for a morpheme like =míx~=mx~=ExW (E=schwa, W=raised w), I write =ExW in the phonemic form (rather than =mix, or =oxW).
(ii) I am leaving out all unstressed schwas in phonemic forms, except those that occur in reduplicative morphemes, those that occur in roots, and those that occur in a few suffixes (e.g., =ul’ExW), where they are unpredictable synchronically. This phonemic transcription thus differs from the orthographic representations found in the Nx Language Program dictionary edited by Nancy Mattina.
(iii) In loanwords, I am transcribing vowels as Kinkade transcribed them, even if they are not fully phonemic (e.g., spanyol I have phonemicized as spanyol, and not as spanyul). This decision is made to allow the spelling to reflect the loanword status and to make pronunciation more transparent.
(iv) In the s-rtr file, I have transcribed all the xml-id forms and the phonemic forms with an initial s-rtr, even if Kinkade did not transcribe them all that way. The reason for this is to aid in alphabetization.
4. Next task: I plan to work on the Affix file to try to finish it. The reason for this is that then all the affixes are available to be referred to in hyphs. Does that make sense to you Martin?
I've read through what your post on entries and forms says, and it seems to me that the multiple form elements are redundant; if the forms are the same, then there only needs to be one form element. In other words, one form can have multiple senses, and one sense can have multiple forms; however, if it's a situation where there are two forms (a and b) and two senses (1 and 2), and form a only goes with sense one, while form b only goes with sense 2, then they should be in separate entry elements.
Does that make sense?
The idea of the gloss tags was based on the original star idea: single words were starred, on the basis that they would be used to create a simple word-list in English. Phrases couldn't be starred, because there's no way to know when the phrase would end, so the original starring was restricted to single words. Our gloss tags are doing the same sort of thing, but now, if you want to put a phrase in a gloss tag, you're welcome to do that. However, bear in mind that what you're creating with a gloss tag is an entry in a simple English-Moses dictionary; if you wrap a phrase such as "a small cup" in a gloss tag, then that entry will show up under "a" in the E-M dictionary, which is presumably not what you want. In other words, the gloss tags reproduce exactly what was available in the original star system (they identify a single word that will be used to create an E-M wordlist); but they aren't restricted to a single word, as the starring system was.
I've changed the markup documentation to use "phonemic" instead of "broad".
I've also edited Ewa's last-but-one post to add <pre></pre>
tags around the markup. This makes the linebreaks and indentation show up as expected in the output. You can use the "pre" button on the toolbar in the post editing window to do this. To make code show up as expected, first escape it with the &; button, then use the "pre" button on it.
<form> <pron> <seg type="phonemic">ṣə̣́nṣə̣nt</seg> </pron> </form> <sense> <def> <seg> <gloss>tame</gloss> </seg> <bibl>EP2.68.8</bibl> </def> </sense> <form> <pron> <seg type="phonemic">ṣə̣́nṣə̣nt</seg> </pron> </form> <sense> <def> <seg> it is <gloss>tame</gloss> or <gloss>gentle</gloss> </seg> <bibl>JM3.21.11</bibl>As far as the question about English glosses is concerned, I understand that the asterisks should be removed, and I will do that. But I don't entirely understand how the gloss function works. So if you look at the last example just above, the meaning is contained within two types of tags, the one referring to "segment" and the one referring to the gloss itself. I'm wondering if you could define for me exactly what the gloss tags should contain. In this case, for instance, you've got 'it is' and 'or' outside of the gloss tags. But they are also part of the meaning. However, they do not need to be targetted in an English-Nx rendition of the word-list.We would search for 'tame' or 'gentle', not 'it is tame or gentle'. I hope I am making myself clear. So if I go back to all the glosses and remove asterisks and add gloss tags, how exactly do I position the gloss tags?
This is my take on Ewa's questions in the preceding post:
1. I discovered that in some entries which have different forms the following occurs: we have several pronseg cases, each of which is the same, and associated with each pronseg we have a unique definition. The pronseg's are all phonemic so in the past we did not mark them as type="phonemic" since they did not contrast with type="narrow". Should they be marked as type="phonemic"?
If I understand this correctly, the forms are the same in all cases, but the definitions are different. In this case, I suspect that they should be different entries, shouldn't they? If they're the same entry, then don't they need only one form, and multiple definitions?
Incidentally, in our documentation from last July, it says that we should be using "broad" vs "narrow", rather than "phonemic". Did we change our minds on this? The docs actually show that "broad" is the default, so you wouldn't need to add it. Could you read through the PDF and let me know if what's described there differs from the existing markup you're working on now?
2. We have at least two different ways of dealing with marking up glosses so as to create, eventually, English to Nx lists. First, we still have lots of cases which were marked up in lexware with an asterisk. But we also have cases like the following definition "it is tame or gentle" where it has been marked up with two "gloss" tags. Do we want to have a consistent way of marking up the meanings now, or shall we leave that task until a later date, given that there is already so much work in marking up already?
My automatic conversion code should have converted any * items into gloss tags (* means nothing in the context of XML), but if you had already started work on this file when I added that feature, it wouldn't have been converted. The gloss tags need to be added, though.
Ewa is working on the s-rtr.xml file, and has noticed this:
"in a number of entries (e.g., saplil 'flour') there are several form elements, either because there is more than one phonetic transcription attested for the entry, or there is more than one source for the entry, or there is more than one gloss for the entry. The first form element in an entry like this will have a seg type="phonemic" and, if it is attested, a seg type="narrow". But should the remaining form elements have both seg types, when this means repeating the exact same seg type="phonemic" over and over again?"
I can certainly write the code so that in the absence of a phonemic element in a particular form element, it will look at preceding-sibling form elements until it finds one, and use that instead. If that's always going to be the right thing to do, it shouldn't be too hard to implement. Adding this as a task for April: check whether there is already any handling for this, and if not, implement it.
This went without a hitch. Uploaded the XML files (after a couple of hiccups till I realized I can't upload the biggest affix.xml file -- it's not finished or well-formed). Then Greg arranged to have the moses home folder moved out of tapor into home1t, and I created a Webapps directory in it, and pushed up the site materials. Worked out of the box.
The hyph form is currently expressed as a narrow transcription where one is attested, but as phonemic where it's no, without any indication of which is the case. It should actually be a broken down rendering of the phonemic form. This change will need to be made throughout the markup already completed.
The existing hyphs were created automatically from the original data, where it was not specified whether the transcription was phonemic or attested-narrow, so there's no way to fix this mechanically. We'll continue rendering what's in the original data into m tags, but Ewa will look at each hyph when editing the data and, if it's narrow, will reformulate it as phonemic.
This change has not been made in any of the existing files yet. s-rtr.xml has been changed in other ways (adding of phonemic pron/seg elements), so it is the most up-to-date, but lacks any hyph updates as yet.
After some thought, we made a small change to our decision on how phonemic renderings should be encoded. The phonemic rendering will now look like this:
<form>
<pron>
<seg type="phonemic">blah</seg>
<seg type="narrow">blah</seg>
</pron>
...
</form>
Important points to note:
- There will always be a seg type="phonemic".
- It will always be the first seg in a pron element.
- Where there is an attested phonetic transcription from the field, that will be included with seg type="narrow"; if there is no such seg, that means there is no attested phonetic transcription.
Met with Ewa, and we made the following decisions:
- We'll use the comma-above rather than the hook-above, because community-friendly forms are more important than strict phonetic accuracy.
- All entries will have a form/pron/seg element with the attribute type="phonemic", which will be the first pron in the form element, and will constitute the phonemic/orthographic form which is displayed, and on which people will search.
- The search system will allow the use of all characters in the phonemic/orthographic markup, as well as plain-ascii versions of the entries (=phon/orth minus diacritics).
- We will try to generate the plain ascii versions for searching automatically, or (even better) find a way to configure searching so that it can do fuzzy matching against diacritic-free versions.
- The dicteg elements will need rewriting. This is what one currently looks like:
<dicteg type="narrow">
<cit>
<quote>[blah]<gloss>He hasn't eaten.</gloss>
</quote>
<bibl>W2.72</bibl>
</cit>
</dicteg>The target form looks like this:
<dicteg>
<cit>
<quote>
<phr type="phonemic">[blah]</phr>
<phr type="narrow">[blah]</phr>
<gloss>He hasn't eaten.</gloss>
</quote>
<bibl>W2.72</bibl>
</cit>
</dicteg>This will make for a more community-friendly display, and mean that we can search the examples in the same way as we search the entries.
The list posted previously was missing many characters which already appear in the database; I sent a list to Ewa, who mapped each of the missing characters to existing characters as follows:
i-bar is schwa
a-acute a
a-acute with dot below a-dot
a-grave a
c-wedge c
i-acute i
i-acute with dot below i-dot
e i
e-acute i
e-acute with modifier schwa i
schwa-acute schwa
schwa-grave schwa
o u
o with dot below u-dot
o-acute u
o-grave u
s with modifier schwa s
u-acute u
u-acute with dot below u-dot
glottal stop glottal stop
glottal stop with modifier a glottal stop (glottal stop is the first letter of the alphabet)
I'm not sure what this equivalence signifies, but it would certainly complicate the search process. For instance, one the entry in the database is this:
ṣọ̀lạ̀mén
in other words:
s + dot below; o + dot below + grave; l; a + dot below + grave; m; e acute; n
Now, we can assume that people are going to want to search for forms they actually see, so we'll have to have buttons for all of those character/diacritic combos, otherwise people won't be able to enter the actual form; at the same time, some people will want to search without the diacritics so we'll have to be able to map the form to this:
solamen
According to the aliases below, though, we'll also have to map:
-the second character to: u with dot or u (because o+dot = u+dot, and u+dot has to allow u);
-the fourth character to a with dot (disregarding the grave)
-the sixth character to i.
Combinatorially, we now have a total of:
[s]: 2 possibilities
[o]: 4 possibilities
[l]: 1
[a]: 3
[m]: 1
[e]: 3
[n]: 1
= 2 * 4 * 1 * 3 * 1 * 3 * 1 = 72 variations that we have to allow for, just for one word. I'm wondering if this isn't completely over the top. What does it actually mean to say that e-acute = i? If the database has e-acutes, where would the i's come from, and how would the user know about them?
Wouldn't it be better to standardize on the form which is actually in the database, and for the purposes of searching, also map that to a simple ascii representation which consists of the same form but with all diacritics stripped off? In other words, wouldn't it be best to say that the user should either search specifically for this:
ṣọ̀lạ̀mén
or, more likely, would search for this:
solamen
and possibly find a small set of words, varying only by diacritics, amongst which they could choose?
I'm basically suggesting that each form be mapped in only two ways:
- completely, with all its diacritics intact
- stripped of all diacritics (with superscripts transformed to full forms, glottal changed to apostrophe, and something similar done with the reversed glottal).
Then the database can do two searches, one on the full form, and one on the ascii-ized form. These searches can be fuzzy (implementing the fuzzy search capabilities the db has built-in), so there will still be some room for rough matching, but the number of actual operations will be reduced to a manageable level.
On the other hand, if the correspondences in your list (e-acute = i, etc.) actually amount to a form of orthography, and it's that orthography on which people will most likely want to search, then we should be entering orthographical forms directly into the database and using them as the presentational forms in the first place. However, my understanding was that there is no conventional orthography, and so I'm guessing that these correspondences must represent some kind of more abstract level of transcription; in which case, I'd suggest that we steer clear of them for the purposes of searching. I think people will really want to search either on what they have already seen in the database (perhaps looking up a form they found previously and wrote down), or on a rough ascii simplification without any diacritics.
In previous discussion, we planned a search system which would allow people to search on actual forms in the database, or using a simplified ascii representation which would make it easy to avoid struggling with diacritics. I'm now in the process of trying to create a taxonomy of glyphs and glyph-combos which may appear in the db, and figure out how to map them to simpler forms. This is my preliminary list, based on Ewa's handwritten note, in Unicode (so it may not appear correctly on the blog!):
Nxaʔamxcín Alphabetical Order:
ʔ a ạ c c̣ cˀ ə ə̣ h ḥ ḥʷ i ị k kˀ kʷ kˀʷ l ḷ lˀ ḷˀ ɬ ƛˀ m mˀ n nˀ p pˀ q qˀ qʷ qʷˀ r rˀ s ṣ t tˀ u ụ w w̓ x xʷ x̣ x̣ʷ y ỵ̓ ʕ ʕˀ ʕʷ ʕˀʷ
Phase one: Rescuing the old data (basically complete)
The initial task was to retrieve the original data from its DOS/WordPerfect/Lexware form. The bulk of this work was done by Greg, with the assistance of a piece of software (Transformer) written by me to make certain complicated search-and-replace operations easier.
- Convert the binary files to text (a process of converting escape characters and WordPerfect codes to predictable, readable text-strings).
- Identify each of the escape sequences used to represent non-ascii characters and character/diacritic combinations, and select an appropriate Unicode representation for it.
- Implement search-and-replace operations in sequence to convert the data to Unicode.
- Use the Lexware Band2XML converter (http://www.ling.unt.edu/~montler/Convert/Band2xml.htm) to turn the original data into rudimentary XML markup.
Phase 2: XML Encoding
(ongoing: this ball is currently in Ewa's court)- Decide on a suitable XML format for the data. The requirements were:
- portability (format must be easily parsed and transformed)
- efficiency (data should not have to be duplicated -- for example, the same information about the same item should not have to be encoded in two places, as an independent entry, and as a nested component of another entry).
- standards-compliance (format must be based on an existing, well-accepted and documented standard; we don't want to have to rescue it again in future).
We chose TEI P5 (http://www.tei-c.org/P5/), and we decided to avoid all nesting and do all linking through xml:id attributes. We also decided that each entry would be marked up in such a way as to break it down into individual morphemes, each of which would be linked through xml:id to the entry for that morpheme. In this way, most feature information for most entries need not be encoded at all, because it can be retrieved from the entries of the morphemes that constitute it. This makes the encoding simpler and cleaner, offloading much of the work onto the XML database that will store and handle the data.
- Devise a method for migrating the rudimentary Band2XML data to the new format. This was achieved using a two-stage XSLT transformation:
- Un-nesting all the entries. Nested entries were extracted and made into siblings, and derivations encoded as part of main entries were also split off into separate entries. After this stage, all entries are siblings at the same level.
- Elaboration. The rudimentary entry information from the Lexware bands was expanded to produce a more elaborate and explicit TEI P5 structure; xml:ids, transcriptions and morphological breakdowns were created by the XSLT based on the original data, and where appropriate, linguistic descriptors were added using the TEI feature structure system (http://www.tei-c.org/release/doc/tei-p5-doc/html/FS.html).
- Check and correct the results, based on the original printed/handwritten data (this is the real work!).
- Decide on a suitable XML format for the data. The requirements were:
Phase 3: Storage and Presentation
- The data will be stored in an XML database, from which all presentation forms will be created algorithmically. The database system we're using is eXist (http://www.exist-db.org/), and open-source native XML database system which we have used for some years. This project will eventually use the next-generation version of eXist (1.1 or 1.2) which is currently in development, but pilot work is being done with the 1.0 beta version.
- The interface to the data will be built in Cocoon, an open-source servlet container which provides a good basis for browser-based interaction with the eXist XML database.
- The first output format we aim at will be a browser-based system which works like this:
- A list of headwords is retrieved through a search.
- Clicking on a headword retrieves the full entry for that item, which is inserted into the page. This is done on-the-fly using AJAX/XMLHttpRequest, which sends a query to the XML database; the database response with a block of XML, which Cocoon converts to XHTML through an XSLT transformation before sending it back to the page; the page then inserts the data into the appropriate place.
- Each morpheme in an entry is itself a link, and clicking on it retrieves the data for that morpheme's entry, which is then inserted into the page. Thus a kind of expanding tree is generated, and an entry can be expanded until it contains all the information on all its constituent morphemes.
- Each morpheme entry includes a "see also" link which would call back through AJAX to get a link to all other items in the database which include that morpheme. Clicking on one of these would retrieve its entry and insert it.
The search functionality basically works like this:
- There is a drop-down list of "what fields to search": Orthography, Transcription or English.
- If you choose one of the first two, you get a button bar to enter special characters.
- The choice also determines which fields will be searched.
- A checkbox also allows fuzzy searching. This involves translating IPA to plain latin, and searching agains plain latin conversions of the fields searched. This will probably require the construction (through XSLT, on the fly) of parallel indexes of plain latin fields, so it can be relatively fast.
- Searches will pull back results which can be added selectively to your "notebook"; you can then take your notebook to the "print shop" where you can configure a page (accessible through URL) or a PDF printable output to create a vocab sheet or mini-dictionary.
- Other projected output formats include:
- Print dictionary (PDFs generated through XSLT -> XSL:FO and converted to PDF using the RenderX XEP engine). We envisage linguistic dictionaries as well as dictionaries aimed at language-learners.
- Printable wordlists (Nxa'amxcin to English and English to Nxa'amxcin).
Graphical navigation devices for browsing the data such as the KirrKirr Java model used by Christopher Manning (http://www-nlp.stanford.edu/kirrkirr/doc/ach-allc2000-ver5-single.pdf).
Phase 4: Media
We plan to integrate audio and visual media into the database in the future.
Met with Ewa to discuss the future of the project.
- Decided to set up the blog.
- Updated the plan to post it here for comment, so that we can later break it down into tasks.
- Discussed options for the search functions, and the possibility of user-generated vocabulary sheets.