Made some more changes to improve the layout, in consultation with EC and SK. Much more to do...
As blogged elsewhere, we now have some <m> components of <hyph> which break down into multiple morphemes; we're expressing these using @corresp with multiple values, instead of @sameAs with one. I've now written some basic handling for that situation. In the process, I had to migrate the stylesheet from XSLT 1.0 to 2.0. The site is a mix of both. It really needs to be remodelled using a new Cocoon/eXist stack. I'll do that as soon as we have the new Tomcat box up and running.
In the phar-w file, I have put bibl tags within both prons and defs - such that every seg has a sister bibl - e.g:
<entry>
<form>
<pron><seg type="phonemic">sʕʷáʔʕʷaʔ</seg><bibl>Y1.68; MW; EP</bibl></pron>
<pron><seg type="narrow">swáʔwaʔ</seg><bibl>G48; J3.1; A4</bibl>
</pron>
<pron><seg type="narrow">swˀáʔwˀaʔ</seg><bibl>CS18</bibl></pron>
<hyph><m sameAs="nom">s</m>‐√<m sameAs="ʕWaʔ">ʕʷáʔ</m>+<m sameAs="CHAR">CVC</m></hyph>
<note>onomatopoeic<bibl>Y</bibl></note>
</form>
<sense>
<def><seg>cougar</seg>
<bibl>Y1.68; MW; EP</bibl>
<bibl>G48; J3.1; A4</bibl>
<bibl>CS18</bibl>
</def>
</sense>
I am going to do this even in cases where the entry has only one form and one def (rather than just putting the bibl on the whole entry).
This also serves to distinguish among phonemic representations:
-If it was transcribed by MDK, it has a bibl.
-If it was derived by ECH from a narrow transcription, it has no bibl.
I checked with Martin, and he agreed that this way makes sense, so I am going to change the s-rtr file to use this system too.
Martin questioned the stacking of bibl tags in the example above, where multiple sources all give the same definition. These could all actually be in a single bibl tag, but Martin can collapse them into one programmatically later. I will keep my eyes out for any cases where this would NOT work.
As we noted last week:
----------
<bibl> elements need to be applied in many different locations, especially inside <def> and <form>, to make it absolutely clear what the source for each of them is. Right now, a <bibl> tends to appear in a <def>, and that actually means that it applies not only to the <def>'s parent <sense> element, but also to the preceding sibling <form> element. Since this is not reliably the case, though, we need to be explicit about it.
----------
I went through the s-rtr file and made the placement of bibl tags more explicit. For simple entries with one form and one sense it looks like:
<form>
<pron>
<seg type="phonemic">stíks</seg>
</pron>
<bibl>Y37.45</bibl>
<hyph><m sameAs="stiks">stíks</m></hyph>
</form>
<sense>
<def>
<seg>big male mountain <gloss>goat</gloss></seg>
<bibl>Y37.45</bibl>
</def>
</sense>
The most complex combination I've found so far is this kind - one form with two defs.
<form>
<pron>
<seg type="phonemic">ṣạpḷị́ḷ </seg>
<seg type="narrow">sàpᵊlél</seg>
</pron>
<bibl>G7.32; Y6.151, 305; Y16.189; Y21.11</bibl>
<bibl>W9.100</bibl>
</form>
<sense>
<def>
<seg><gloss>flour</gloss></seg>
<bibl>G7.32; Y6.151, 305; Y16.189; Y21.11</bibl>
</def>
<def>
<seg><gloss>bread</gloss></seg>
<bibl>W9.100</bibl>
</def>
</sense>
So the markup shows that the form was given by two speakers (well, the first bibl is actually at least two speakers), but each had a different definition for it.
If we didn't have the bibls in the form here, the formula Martin mentioned would actually still work, I think:
"A <bibl> appears in a <def>, and that actually means that it applies not only to the <def>'s parent <sense> element, but also to the preceding sibling <form> element."
That should actually cover all possible variations:
-different sources are all listed in the same bibl
-different definitions are handled as above
-different forms of the same word get different <form>s and <sense>s anyway.
So can I get away with not putting bibls in the form after all?
Here are some more decisions we made in Friday's meeting:
1) Yes, we still need seg tags within quote tags in example phrases, even if there is no breakdown of the gloss. For example ...
This one has a breakdown of the gloss:
<seg>whole wheat <gloss>flour</gloss></seg>
This one doesn't, but it should still have <seg>s:
<seg><gloss>flour mill</gloss></seg>
I have fixed all these in the s-rtr file.
2) There does not have to be a hyph line for every form in an entry - just the first form in each entry.
3) For zero morphemes, we will use the LATIN CAPITAL LETTER O WITH STROKE character. Ewa will create entries in the affix file for Ø1 and Ø2.
4) When Ewa has added a gloss which was not attested in the original data, we will format it as in this example:
<sense>
<def>
<seg><gloss>stretch</gloss></seg>
<note resp="ECH">[definition by ECH]</note>
</def>
</sense>
Things I have to remember:
- When @corresp is used to bracket multiple morpheme components for a single segment, the site needs to handle this. When you first click on the segment, it should expand into separate morpheme representations; then clicking on each of those should take you into the morpheme.
<bibl>elements need to be applied in many different locations, especially inside<def>and<form>, to make it absolutely clear what the source for each of them is. Right now, a<bibl>tends to appear in a<def>, and that actually means that it applies not only to the<def>'s parent<sense>element, but also to the preceding sibling<form>element. Since this is not reliably the case, though, we need to be explicit about it.
Ewa wrote yesterday:
Basically there are three crucial things to keep in mind when phonemicizing:
1. The “alphabet” is phonemic, so wherever a transcription deviates from the alphabet it is phonetic.
2. Raised vowels are schwas which are clearly phonetic in nature.
3. Schwas in general tend to be phonetic except for a few cases which are systematic. Only the latter kinds of schwas should be left in phonemic representations.
Here are the changes that Martin added to the XML markup documentation:
1) In section 4.1:
-The contents of hyph should be based on the phonemic transcription of the word.
-If the word contains reduplication, mark it in hyph with CV, etc., rather than the actual segments of the reduplicant, e.g.:
<entry xml:id="ṣə̣nṣə̣nt">
<form>
<pron>
<seg type="phonemic">ṣə̣́nṣə̣nt</seg>
<seg type="narrow">sə́nsə̀nt</seg>
</pron>
<hyph>√<m sameAs="ṣə̣n">ṣə̣̣́n</m>+<m sameAs="CHAR">CVC</m>-<m sameAs="t-STAT">t</m>
</hyph>
</form>
The types of reduplication include:
Characteristic = CVC
Distributive = CəC
Repetitive = Ca
Diminutive = C1
Out of Control = C2
Ewa is making sure all these are in the affix file.
2) For entries in which multiple morphemes combine inseparably to form a single-phoneme item (e.g, c = nt + sa + s), use @corresp instead of @sameAs, with the morphemes separated by spaces - e.g.
<hyph> <m corresp="nt sa s">c</m></hyph>
3) In section 4.2: Another thing I noticed on the blog that it would be good to have in the markup documentation:
Where there is no attested definition, Ewa will supply one in this form:
<def>
<note resp="ECH">[The definition/explanation]</note>
</def>
4) In section 4.4: cross references do not need to include an English gloss, so the format for <xr>s should be
<xr>See <ref target="idblah">blah</ref> and <ref target="idblah2">blah2</ref>.</xr>
(NOT: <xr>See <ref target="idblah">blah</ref> (English blah) <ref target="idblah2">blah2</ref>(English blah2).</xr>)