In the data files that I've looked at there seem to be a few cases where glottalized w and glottalized y have a raised comma written directly above them. This is not a standard representation in either IPA or in the Americanist forms used traditionally for writing Moses. So I would definitely change any w or y that has a raised comma above it into either w/y followed by raised comma or w/y followed by superscript glottal (depending on which we finally decide to do). I don't know why there are these w/y with raised comma above them since in the lexware database that is not how those sounds were represented.
Here are the considerations from my perspective. I am laying them out for you below because I am not sure from reading your various comments whether I have succeeded in making my thinking on this clear to you. Once you’ve read what I write below, if you still think that we should use superscript glottals throughout, then that is fine with me.
1. In the IPA, a distinction is made in the way that glottalization is marked on ejective stops and the way it is marked on glottalized sonorant/resonant consonants.
a. In the case of ejective stops, the raised comma transcribed after the stop is preferred [p’, t’, k’, etc.] . Although it is possible to transcribe ejective stops with a superscript glottal this is not the preferred way to do so.
b. In the case of glottalized sonorant/resonant consonants, a superscript glottal diacritic is used and can be positioned either before or after the sonorant. So glottalized m or y, for instance, would be transcribed [y?, m?] (the question marks are supposed to be raised glottals).
2. In the Americanist tradition used in the way Moses has always been written up until now, all ejective stops and all glottalized sonorants are marked in the same way: namely, they always have a raised comma after them: [p’, t’, k’, etc., m’, l’ , y’ , etc.].
3. In the xml:ids, we can’t use raised commas so we have been using superscript glottals throughout (I think, Martin, that you wrote a little conversion for this at some point).
4. Neither the IPA nor the Americanist tradition according to which Moses has been transcribed so far use a superscript glottal for ejective stops and affricates. Therefore, I argued that we should use the raised comma representation for these sounds, and not to use the superscript glottal for ejective stops and affricates.
5. Because the Moses/Americanist tradition has up until now used the raised comma for glottalized sonorants/resonants, I also argued that we should use the raised comma for the glottalized sonorants of Moses.
6. If, however, you guys think that the raised comma representation is problematic for the search functions, then I will accept your recommendation to use the superscript glottal for the ejective stops, ejective affricates, and for the glottalized sonorants.
7. My concern for consistency is the same as yours. I believe that at the moment we have several different kinds of representations. For example, I think that there are glottalized sonorants which have the raised comma written directly above them, and others which have the raised comma written just after them. And there are clearly ejectives written both with raised comma and with superscript glottal.
8. Whatever you think is the best representation, I will be happy to check for consistency as I go through the files.
Created a package called MosesIndexer, and wrote and tested a class called MosesConverter, which implements the NFKD and ASCII conversion algorithms described in the preceding post. Wrote a JUnit test for it, and debugged it.
Next is to implement the Indexer class, which will read a series of files from disk, and process each one to build an index XML file, then save it.
This post represents the results of research Greg and I have been doing on the issue of searching complex Unicode data.
The search interface for this project presents a range of interesting challenges. Searching the English components of entries is no problem at all, but when it comes to searching the Moses text fields, we have to deal with the issue of diacritics. For instance, take the word "ṣạpḷị́ḷ". If your browser is using an appropriate font, you'll see this sequence of characters:
- s with dot below
- a with dot below
- p
- l with dot below
- i with acute accent and dot below
- l with dot below
We can envisage two extremes of user. On the one hand, a researcher familiar with the language might be searching for exactly this string of characters, with all the accents in place. In this case, we have only one type of problem. Consider the "i with acute accent and dot below". This could legitimately be created through any of the following sequences:
- i + combining acute accent + combining dot below
- i + combining dot below + combining acute accent
- (composite) i with acute accent + combining dot below
- (composite) i with dot below + combining acute accent
These all look the same, and are equivalent; there's no way to know which one the user typed in, and there's also no way to be sure which form might be in the database. This makes matching a search string against fields in the database rather difficult.
Unicode provides a solution for this, which will work pretty well for Moses. In Unicode, each combining character is assigned a combining class -- a number between 0 and 255. The classes are based on the position of the diacritic with respect to the character it's joined to. (See here for more info.) Now, Unicode provides a set of normalization forms for any string. Among them is NFKD, or "Compatibility Decomposition". What this process does is:
- Decompose all composite characters into character + diacritic.
- Apply canonical ordering to diacritics, based on their combining class.
The result of applying NFKD normalization to any of the sequences above would result in the same output:
- i + combining dot below + combining acute accent
because they would all be decomposed into three components, and then the i would come first (as the base character), then the dot below (combining class 220), and finally the acute accent (because its combining class is 230, which is higher than that of the dot below).
Therefore we have a solution to the problem of our advanced searcher: if we perform NFKD normalization on BOTH the strings in the db against which we're searching, AND the search string entered by the user, then we'll be able to do a useful search.
The second type of user, a casual surfer, or someone who is not linguistically sophisticated or familiar with the language, presents a different type of problem. They most likely have no idea what diacritics should go where, and even if we provide onscreen buttons or keystroke methods for entering diacritics or modifier letters, they won't be able or willing to use them. Their search for "ṣạpḷị́ḷ" is likely to be entered as "saplil". Nevertheless, they'll still want to get results.
Another application of Unicode normalization form NFKD, followed by some extra processing, can solve this problem. First of all, it will split off the combining diacritics. We can then remove them from the string, turning "ṣạpḷị́ḷ" into "saplil". If we do this for the search string entered by the user, and for the db strings against which we're searching, then we can obtain meaningful matches whatever the user enters.
In addition to splitting out the combining diacritics, compatibility decomposition will also convert some characters into their "compatibility" equivalents. For example, modifier letter w (raised w) will be converted to a regular w. This solves yet another problem: people will tend to type a w rather than finding the keystroke or button for raised w. Some characters, however, do not have compatibility equivalents where we would want them. Modifier raised glottal, for instance, doesn't have a compatibility equivalent, even though there is a regular glottal character. When we process the string to strip out the diacritics, though, we could do that conversion too.
These are the conversions we would need to make in order to create an "ascii representation" of a Moses string:
- Split out combining diacritics (NFKD does this).
- Convert characters to their compatibility equivalents (NFKD does this for some characters).
- Discard combining diacritics.
- Convert raised w to w.
- Convert raised glottal to glottal.
- Convert belted l to l.
- Convert barred lambda to l.
Now we have a decision to make. There are two non-ascii (potentially "confusing-to-the-layman") characters still in the mix: glottal and epiglottal. We could either leave them there, or we could replace them with something more bland. If we replace them, we need appropriate replacements. One option would be to replace both by the apostrophe; another would be to use a convention such as X-SAMPA, which replaces the glottal by a question mark, and the epiglottal by "?\". A decision on this should be guided by our sense of what semi-sophisticated users (such as band members familiar with the basic orthography) might be expected to use.
So we have a situation where we need to map two representations of a search string against two representations of the data. The data itself does not contain any normalized representations at all, and we would prefer to avoid cluttering it up with them; furthermore, because they can be generated programmatically, it makes no sense to have them in the data anyway. However, generating two representations of every Moses string in the db every time a search is done makes no sense either; it would make searching ridiculously slow.
The solution to this is to create an index to the entries, consisting of a series of parallel entries. Each one would have the same xml:id
attribute as the entry it's based on, as well as a copy of the representative string for that entry (which is currently the first <seg>
in the first <pron>
in the first <form>
element). It would then have two blocks of tags, one for the NFKD forms of all Moses strings in that entry, and one for the ascii-ized forms. This index can be loaded into a special subcollection on the database. Searching can be done against the index, to retrieve the xml:id
and the representative string directly from the index, instead of from the db itself. A list of such hits can be returned to the user, who can then click on any entry to retrieve the real data for that entry from the database.
This index would have to be generated manually from the existing data. The best way to do this would probably be to write a Java application which can be passed a list of markup files. It loads each XML file into a DOM, then generates an index entry for each entry in the source file; it then uses a node iterator to run through each of the tags containing Moses text, and generates one of each type of compatibility form for them. This index file can go into the database, and eXist range indexes can be defined for it, making searching even more effective.
That solves the problem of creating searchable indexes; now we have to deal with the issue of the search string, which cannot be processed ahead of time, because we don't know what the user will type in. We need to render this string into two forms as well, to search against the two types of item in the index. There are two possible ways to handle this:
- We could try to produce the two forms using XSLT, by predicting all possible variants of characters and diacritics, and searching/replacing on them. This sounds like it would be inefficient and complicated, but since search strings will typically be very short, it might be a perfectly good approach.
- We could write a new Transformer for Cocoon using Java, which does the same job, and which can be called from a Cocoon pipeline. This would most likely be a little faster, and more reliable since we could depend on Java to do the NFKD normalization for us. However, it would involve learning how to write and deploy a customized transformer, which would take some time. On the other hand, knowing how to do this would be very handy in the future.
Whatever method we use to massage the search string, we'll have to integrate it into a pipeline, which will pass the results as two parameters to an XQuery page. The XQuery code will then perform two searches, with the normalized form against the equivalent forms in the index, and the ascii-ized form against its equivalent. This will result in two sets of hits; the first set should be prioritized because they're more likely to be more precise matches. Duplicates would need to be discarded, and then the resulting set of hits (xml:id
and representational form for the entry) would be returned to the pipeline to be styled as hits and be inserted into the search page. This would be done through AJAX.
This will obviously need more thought to work out the fine details, but I think we have the basis of an excellent strategy here, enabling us to provide fast searching which combines precision with fuzziness, and which should suit any type of user.
Got the sort system working by rewriting the XSLT file till it worked with Saxon. Saxon 8 is very finnicky about stuff like namespaces. The other XSLT will also have to be ported over to work with Saxon 8, so we have a fully XSLT-2.0-based system.
Then I added the extra code to disregard accents in the sort comparator. This simply involves stripping out accents from the strings before doing the comparison. It's an extra stage, so it may add extra processing time, but on the other hand the comparison character string is now shorter (no accents) and the input strings are often going to be shorter once their accents are stripped out, so the net results may not be noticeable. We'll have to see how fast this code runs when we've got more data in the db.
In the process of doing this, we found a bug in the implementation (groups of identical entries were not being sorted together). Fixed that bug.
Some answers to your questions:
1. Question: There are several cases of two or more different affixes having almost identical meanings. This means that they have identical feature structures. Is this going to be a problem? For example, xix and xax are both baseType, suffix, morphoSyntactic, indefinite-object.
I don't see why this would be a problem. We identify them by their xml:id
, and display/sort them by their first <pron>
, so I don't see any conflict.
2. Question about prons and hyphs of reduplicative morphemes: How should prons and hyphs of reduplications be represented? Reduplicative morphemes have changeable form, depending on what the shape of the base of the reduplication is. For example, if the root is of the shape xit, the reduplicative suffix “characteristic” will be xit (xit-xit), but if the root is of the shape quc, the reduplicative suffix “characteristic” will be quc (quc-quc). The basic shape of the reduplication is thus CVC (consonant-vowel-consonant), but what the exact segmental content of the suffix is depends on the segments found in the root. The simplest thing for a pron would be to specify the CV-shape of each reduplication. For example, the pron for the reduplicative suffix whose meaning is “characteristic” would be CVC, for the distributive it would be CEC (where E=schwa), for repetitive it would be Ca, and for out of control it would be VC, and for diminutive it would be C-. For the hyph forms, it would be the same type of thing. For example, for characteristic the hyph would thus include sameAs=”CHAR”>CVC Is it possible/desirable to do this in an xml markup?
As long as the xml:id
attributes are distinct, I don't think it matters. If each has a unique CV-shape, then that would be a good way to characterize them, given that they have no default or normalized representation at all.
3. I have completed to the end of hard copy affix10 of the affix files, except for fixing cross-references in the last entry, which is the DIM form. There is one more of these files left in the affix set.
I'm not sure what this means. On the server, I can only see one affix.xml
file. Is "affix10" above a typo, or is there such a file somewhere?
1. combining glottal and combining comma. By combining comma do you mean the combining apostrophe? If so, then combining glottal and combining comma/apostrophe are two different symbols for the same sound: they both represent glottalization. We decided at some point in December I think that we will actually replace all the combining glottals with combining commas/apostrophes in order to be completely consistent throughout. The one constraint on this is that we have to continue to use combining glottals in the xml:ids.
As we've said before, Greg and I both think replacing the glottals with commas is a bad idea, because it amounts to misrepresenting the data. It's also rather pointless, because for any particular context in which we're displaying this data, we can do a translation from glottal to comma on the fly; there's no need to store misleading data just so we can see it on the page. The combining comma I'm talking about, though, is one which appears above the w and y characters in the handwritten alphabetical order I've been working from. That character is "U+0313 : COMBINING COMMA ABOVE", whereas the combining glottal is "U+02C0 : MODIFIER LETTER GLOTTAL STOP". The former shows up above the modified letter, the latter shows up to the right of it. If I understand you correctly, these are intended to represent the same sound -- a glottal -- but you want the modifier to appear above the letter when the letter is w or y, and on the right of it when it's any other letter. This is problematic because if you convert them all to combining comma above, they'll appear above the letter everywhere; so using commas won't even solve the display problem it's intended to solve.
My recommendation is to keep your data correct and pure, and use the right character throughout (the glottal). Then for display purposes we write display code that does a translation in some circumstances (e.g. it substitutes a comma above for w and y, and adds an apostrophe after for other letters, if that's what you want). If the data itself is corrupted by display preferences, then it's going to be less useful for research and display in the future, in other contexts. That's my opinion, anyway.
2. Acute and grave accents are irrelevant for alphabetical order. In other words there is no difference in alphabetical order between [a] with no accent, [a] with acute accent, and [a] with grave accent; and this is similar for all the other vowels. Does this mean that the java sorter can ignore the accents?
I'll have to go away and think about that. We've actually got the java class sorting successfully, but right now, it needs a position for every character in the alphabet (which includes accents); I'll have to add some code to strip out the accents before comparing words. I think it should be fairly straightforward.
3. What I can’t determine from your presentation of the material is whether there is significance to the order that you have given for the diacritics. Why have you placed [dot below] before [combining glottal], etc. ? Can you explain this to me?
This is based on your own handwritten list, in which c-with-dot-below appears before c-with-glottal (and the same for all other combinations of these diacritics with other characters). If c-with-dot-below comes before c-with-glottal, then dot-below comes before glottal.
1. Question: There are several cases of two or more different affixes having almost identical meanings. This means that they have identical feature structures. Is this going to be a problem? For example, xix and xax are both baseType, suffix, morphoSyntactic, indefinite-object.
2. Question about prons and hyphs of reduplicative morphemes: How should prons and hyphs of reduplications be represented?
Reduplicative morphemes have changeable form, depending on what the shape of the base of the reduplication is. For example, if the root is of the shape xit, the reduplicative suffix “characteristic” will be xit (xit-xit), but if the root is of the shape quc, the reduplicative suffix “characteristic” will be quc (quc-quc). The basic shape of the reduplication is thus CVC (consonant-vowel-consonant), but what the exact segmental content of the suffix is depends on the segments found in the root. The simplest thing for a pron would be to specify the CV-shape of each reduplication. For example, the pron for the reduplicative suffix whose meaning is“characteristic” would be CVC, for the distributive it would be CEC (where E=schwa), for repetitive it would be Ca, and for out of control it would be VC, and for diminutive it would be C-.
For the hyph forms, it would be the same type of thing. For example, for characteristic the hyph would thus include
sameAs=”CHAR”>CVC
Is it possible/desirable to do this in an xml markup?
3. I have completed to the end of hard copy affix10 of the affix files, except for fixing cross-references in the last entry, which is the DIM form. There is one more of these files left in the affix set.
May 9, 2007
1. combining glottal and combining comma. By combining comma do you mean the combining apostrophe? If so, then combining glottal and combining comma/apostrophe are two different symbols for the same sound: they both represent glottalization. We decided at some point in December I think that we will actually replace all the combining glottals with combining commas/apostrophes in order to be completely consistent throughout. The one constraint on this is that we have to continue to use combining glottals in the xml:ids.
2. Acute and grave accents are irrelevant for alphabetical order. In other words there is no difference in alphabetical order between [a] with no accent, [a] with acute accent, and [a] with grave accent; and this is similar for all the other vowels. Does this mean that the java sorter can ignore the accents?
3. What I can’t determine from your presentation of the material is whether there is significance to the order that you have given for the diacritics. Why have you placed [dot below] before [combining glottal], etc. ? Can you explain this to me?
Today I learned some Java, which is pretty much new to me. I had to implement and test a Java class which implements the java.util.Comparator
interface, and which can then be used as a sort of plug-in to Saxon, invoked from XSLT, to do custom sorting. I downloaded and installed Eclipse, and set myself up with a new package, including a source file for the class and one for a JUnit test class for trying it out.
With lots of help from Stew, I eventually got the class working, based on a list of all the characters in a simple string. The first useful discovery was the Java Normalizer
class; this can be used to solve the problem of sorting strings which may contain pre-composed characters or strings of char+combining characters, which are equivalent. The Normalizer can be used to do a canonical decomposition of the strings before comparing them. Very handy -- and it might also be handy for normalizing actual data permanently at some point.
Testing of the results of sorting revealed that my initial assumption -- that putting the diacritics etc. at the beginning was wrong; to get the desired behaviour, they actually need to be at the end. That was easily fixed.
Once the class was working, we started trying to test it. The main requirement is that it be invoked using a URI, in a manner which is implementation-dependent. Our intention is to use it with Saxon 8, and the instructions for this are here. The code looks like this:
<xsl:sort select="tei:form[1]/tei:pron[1]/tei:seg[1]" collation="http://saxon.sf.net/collation?class=MosesSortComparator" />
Next, you have to put the class somewhere on the Java classpath, so it can be found by Saxon. We presume this means that it should go in with the other Java libraries in Cocoon, so I generated a JAR file (File / Export
in Eclipse), and added it to the other JAR files on the server, in /usr/local/apache-tomcat-6.0.2/webapps/cocoon/WEB-INF/lib
.
Initial testing failed, and I was puzzled, so I went back to the sitemap and discovered that although the file was XSLT 2.0, it was being run through the default XSLT processor, which is Xalan. When I changed the sitemap to call the Saxon processor, I got no results at all (an empty page). This was the case both with and without the new comparator being used, so the problem isn't the comparator; the stylesheet is not written correctly for Saxon, so we'll need to rewrite it before we can see if the sort actually works. That's for tomorrow.
I have a list of characters showing Moses sort order:
ʔ a ạ c c̣ cˀ ə ə̣ h ḥ ḥʷ i ị k kˀ kʷ kˀʷ l ḷ lˀ ḷˀ ɬ ƛˀ m mˀ n nˀ p pˀ q qˀ qʷ qʷˀ r rˀ s ṣ t tˀ u ụ w w̓ x xʷ x̣ x̣ʷ y ỵ̓ ʕ ʕˀ ʕʷ ʕˀʷ
However, when I look at the entries themselves, I see lots of instances of acute and grave accents. I need to know how those accents fit into the sort order.
Some background to this: in order to sort the entries according to the Moses sort order, I'm having to write a Java class that can be called on the server, which does the sorting. This class has to encapsulate the sort order in sequence. The actual sequence suggested by the list above is this:
[dot below], [combining glottal], [combining w], [combining comma above], ʔ a c ə h i k l ɬ ƛ m n p q r s t u w x y ʕ
In other words, the combining diacritics come first, followed by all the letters. Acute and grave presumably have to fit into the sequence of combining diacritics at the beginning. Can you tell me where they should show up?