Made some changes to provide suggested psn: values in some attributes for convenience; also wrote some XSLT to generate said values for the ODD file from the personography, and created a Schematron schema (still to be linked into the XML files) which has two constraints, and will have more, designed to really tighten up our encoding practice.
Did a lot of manual work on unattested glosses in cits after the tranform, because the variety of formations turned out to be too complex for a couple of regexes.
Unattested glosses, originally indicated with angle brackets, appear in two places: <def>
s and <cit>
s. Those in <def>
s should be converted into something more suitable; once that's done, those in <cit>
s can be deleted. This is the form (a complex instance):
<def> <!--Generated from: [ *clabber|ed milk <*sour>; *cottage~*cheese <*sour> ]--> <seg><gloss>clabbered</gloss> milk <<gloss>sour></gloss>; <gloss>cottage~*cheese</gloss> <<gloss>sour></gloss> </seg> <bibl corresp="psn:JM psn:AM">Y14.182</bibl> </def>
This needs to be converted such that the unattested gloss is lifted out of the context, and turned into a new <seg>
with a <bibl>
ascribing it to ECH:
<def> <!--Generated from: [ *clabber|ed milk <*sour>; *cottage~*cheese <*sour> ]--> <seg> <gloss>clabbered</gloss> milk; <gloss>cottage~*cheese</gloss> </seg> <bibl corresp="psn:JM psn:AM">Y14.182</bibl> <seg><gloss type="i">sour</gloss></seg><bibl corresp="psn:ECH">ECH</bibl> </def>
Note that there are two instances of the same unattested gloss in the original, but we should have only one in the output, so I'm using distinct-values in the XSLT. Also note that the opening angle-bracket entity is outside the tag, but also needs to be removed. I've now written the XSLT for this, and I'll run it tomorrow morning.
Once that job is done, the only remaining unattested glosses will be in cits, and they can be commented out. You can find them with this regex:
(<<gloss>[^<]+<</gloss>)
and replace them with:
<!-- $1 -->
There are also instances of these things without gloss tags:
<seg> <gloss>clabber|ed</gloss> milk <*sour>; *cottage~*cheese <*sour> </seg>
Those can be matched with:
(<[^<]+<)
Updated the XQuery and XSLT to account for changes spelled out in recent posts. The online db is now working again with the current XML files.
Accomplished all the planned changes to XML files through XSLT, and also tweaked the schema a bit. More work is to be done on the schema, probably using oddbyexample.
Arising out of today's meeting:
- dictegs to be removed wholesale; their contents just sit in place.
- hyphs which consist only of a single morpheme pointing at the xml:id of their ancestor entry to be commented out.
- note type="referToElders" where ancestor::entry does NOT have the n="referToElders" should be changed to "referToEwa".
- note type="referToElders" where ancestor::entry DOES have n="referToElders" should be left alone.
- entry/@n should be deleted.
- Online db code needs to be updated to take account of all these changes.
We have decided to always hyphenate compound words into ALL their components, as in the following examples.
√<m sameAs="ḥawˀy">ḥáwˀy</m>-<m sameAs="aɬ">a</m>-
<m sameAs="s">s</m>-<m sameAs="n">n</m><m sameAs="DIM">C₁</m>√<m sameAs="cwˀaxaʔ">cwˀáxaʔ</m>
<m sameAs="kas">kas</m>-√<m sameAs="ḥawˀy">ḥáwˀiy</m>-<m sameAs="aɬ">ɬ</m>-√<m sameAs="təmnayˀ">təmnayˀ</m>-<m sameAs="mix">əxʷ</m>
That is, we will NOT just divide compounds into stem-connector-stem. If we keep the structure flat, as in the examples above, it reduces the number of inferred entries we have to create, and means we don't have to interpret potentially ambiguous morphological structures. (The first example above is clearly [√ḥáwˀy]-a-[s-n-c-√cwˀáxaʔ], but the second could be [kas-√ḥáwˀiy]-ɬ-[√təmnayˀ-əxʷ] or kas-[√ḥáwˀiy-ɬ-√təmnayˀ]-əxʷ.)
ALL compound entries will have this feature structure
<fs>
<f name="baseType">
<symbol value="compound"/>
</f>
</fs>
We will create an inferred root entry for the root of the second stem, if it does not already exist in the database, and add a <note type="referToElders"> to the compound entry, asking whether the second stem can stand on its own as a word. If the Elders say yes, we will create a new entry for the stem, and add <xr>s to and from the compound entry.
Password protection got removed when I uploaded my local copy of Moses to the server, so I tried copying the SVN version of web.xml up there, but that killed the app and Tomcat couldn't restart it, so I replaced the original web.xml, and instead added in the changes in the svn version of web.xml manually into the server copy. That seems to be working. Had to restart Tomcat a couple of times, but that seems to be smooth and problem free since Tomcat-stable was moved to Grape.
I have entered all the lexical suffixes from MDK's cards, but have not entered their dictegs, on the assumption that these words exist elsewhere in the database, organized by their prefix or root.
As I was entering the lexical suffixes, I only checked for their dictegs elsewhere in the data if the dictegs were:
-personal names, or
-examples of lexical suffixes with Meaning Not Determined.
I found that the vast majority of these dictegs do exist elsewhere in the data, BUT:
-sometimes additional info is on the lexical suffix card (e.g., Sam Miller)
-sometimes the word only exists in another dicteg (e.g., shotgun)
-sometimes the morpheme breakdown is different (e.g., Nellie Leo, Canada goose)
-sometimes the entry is not all there due to a bad conversion from Lexware (e.g., Paul Timentwa)
Our approach to this issue will therefore be:
-wait until all the alphabetical files are edited
-check that all dictegs NOT yet entered from the lexical suffix cards exist as entries elsewhere in the data. Enter any missing information, and address differing morpheme breakdown.
-check lexical suffix dictegs that WERE entered at the Lexware stage:
--check cards against Lexware. Pencil any changes onto the Lexware printout.
--check whether the dictegs exist as entries in other files. Refer to phr_to_seg_matches_2.odt.
---if phr to seg is a perfect match in the list, search on the xml:id to view the entry. Check whether the translation is also a match. If the translation adds any new information that's not in the entry already, copy the new information from the lex-suf file into the alphabetical file, with a Comment about where it came from. Then Comment out the dicteg in the lex-suf file.
---if phr to seg is NOT a perfect match in the list, search more carefully on the phr and/or the translation to try to find the entry, and why it didn't match. Check Lexware printout for discrepancies. Inform Martin of any perfect matches NOT found by the search.
----if the entry can be found, copy any relevant information from the lex-suf file into the alphabetical file, with a Comment about where it came from. Then Comment out the dicteg in the lex-suf file.
----if the entry really cannot be found, copy the whole dicteg into the alphabetical file (near its root), with a Comment about where it came from. Build a well-formed entry, changing the <phr> to pron:seg and the <seg> to def:seg. Then Comment out the dicteg in the lex-suf file, noting that it could not be found elsewhere and has been copied into the appropriate alphabetical file.
Tomcat went into a tailspin yesterday while I was in the middle of uploading to the Moses db, and the db (presumably) got corrupted; in any case, Moses would not come back even after two restarts of Tomcat. This morning, I brought down Tomcat and replaced the Moses webapp (in webapps-dev) with a copy of my local version. All seems to be working now.