Missing characters and aliasing

Posted by on 04 Dec 2006 in Activity log

The list posted previously was missing many characters which already appear in the database; I sent a list to Ewa, who mapped each of the missing characters to existing characters as follows:

i-bar is schwa
a-acute a
a-acute with dot below a-dot
a-grave a
c-wedge c
i-acute i
i-acute with dot below i-dot
e i
e-acute i
e-acute with modifier schwa i
schwa-acute schwa
schwa-grave schwa
o u
o with dot below u-dot
o-acute u
o-grave u
s with modifier schwa s
u-acute u
u-acute with dot below u-dot
glottal stop glottal stop
glottal stop with modifier a glottal stop (glottal stop is the first letter of the alphabet)

I'm not sure what this equivalence signifies, but it would certainly complicate the search process. For instance, one the entry in the database is this:

ṣọ̀lạ̀mén

in other words:

s + dot below; o + dot below + grave; l; a + dot below + grave; m; e acute; n

Now, we can assume that people are going to want to search for forms they actually see, so we'll have to have buttons for all of those character/diacritic combos, otherwise people won't be able to enter the actual form; at the same time, some people will want to search without the diacritics so we'll have to be able to map the form to this:

solamen

According to the aliases below, though, we'll also have to map:

-the second character to: u with dot or u (because o+dot = u+dot, and u+dot has to allow u);

-the fourth character to a with dot (disregarding the grave)

-the sixth character to i.

Combinatorially, we now have a total of:

[s]: 2 possibilities
[o]: 4 possibilities
[l]: 1
[a]: 3
[m]: 1
[e]: 3
[n]: 1

= 2 * 4 * 1 * 3 * 1 * 3 * 1 = 72 variations that we have to allow for, just for one word. I'm wondering if this isn't completely over the top. What does it actually mean to say that e-acute = i? If the database has e-acutes, where would the i's come from, and how would the user know about them?

Wouldn't it be better to standardize on the form which is actually in the database, and for the purposes of searching, also map that to a simple ascii representation which consists of the same form but with all diacritics stripped off? In other words, wouldn't it be best to say that the user should either search specifically for this:

ṣọ̀lạ̀mén

or, more likely, would search for this:

solamen

and possibly find a small set of words, varying only by diacritics, amongst which they could choose?

I'm basically suggesting that each form be mapped in only two ways:

- completely, with all its diacritics intact
- stripped of all diacritics (with superscripts transformed to full forms, glottal changed to apostrophe, and something similar done with the reversed glottal).

Then the database can do two searches, one on the full form, and one on the ascii-ized form. These searches can be fuzzy (implementing the fuzzy search capabilities the db has built-in), so there will still be some room for rough matching, but the number of actual operations will be reduced to a manageable level.

On the other hand, if the correspondences in your list (e-acute = i, etc.) actually amount to a form of orthography, and it's that orthography on which people will most likely want to search, then we should be entering orthographical forms directly into the database and using them as the presentational forms in the first place. However, my understanding was that there is no conventional orthography, and so I'm guessing that these correspondences must represent some kind of more abstract level of transcription; in which case, I'd suggest that we steer clear of them for the purposes of searching. I think people will really want to search either on what they have already seen in the database (perhaps looking up a form they found previously and wrote down), or on a rough ascii simplification without any diacritics.

This entry was posted by Martin and filed under Activity log.