TEI 2017: Dictionary Encoding Based on CSS and XML/HTML Parsers

TEI 2017 Victoria, British Columbia, Canada November 11 - 15 #tei2017Vic

Dictionary Encoding Based on CSS and XML/HTML Parsers (poster)

Kadyr Momunaliev* Kadyr Momunaliev (Kyrgyzstan-Turkey Manas University) has worked for over five years on the creation of electronic versions of dictionaries and encyclopedias in the Kyrgyz Republic, and is currently based in Bishkek, the capital city of the Kyrgyz Republic., Joseph Ten* Kyrgyz State Technical University, and Nella Israilova* Kyrgyz State Technical University

1. Source File Structure

1The source file has been obtained as a result of pdf-to-doc-to-(x)html conversion. Resulting xhtml file preserved almost all typographic features except correct column rendering. The reason of the conversion is to perform annotation on the text’s source level using OHCO model, not as we did previously in WYSIWYG editors. CSS and XHTML parsing is based on the following tag hierarchy:

                    
1. <html>
2.   <head>
3.   <! - - … - - >
4.     <body>
5.       <div>+
6.         <p>+
7.           <span>text</>+
8.         <h3>+
9.           <span>text</>+
10.        <any_element>+
11.          <span>text</>+

Figure 1. Object hierarchy of the source file.

2. Parser Output Structure

                     
1. <dict>
2.   <entry>+
3.        <form>
4.           <orth>
5.           <formt type ={“variant”|”inflected”|”compound”}>?
6.            <orth>
7.        <senseList>
8.           <sense n={“1.”|”2.”|”3.”|”4.”}>
9.            <transList>
10.                <trans>+
11.            <example>?
12.                <form>
13.                <sense>
14.        <subentryList>?
15.           <subentry>+
16.            <form>
17.            <sense>

Figure 2. Outline of the structure in XHTML.

1. dict_p = element dict{(entry_p)+} 
2. entry_p = element entry{form_p, senseList_p, subEntryList_p?}
3. form_p = element form{orth_p, element form {attribute type{“variant”|”inflected”|”compound”},orth_p}?}
4. orth_p = element orth{text}
5. senseList_p = element senseList{sense_p+}
6. sense_p = element sense{translation_p+, element example{form_p, sense_p}?}
7. translation_p =element translation{text}
8. subEntry_p = element subEntry{form_p, senseList_p}

Figure 3. RELAX NG version of schema.

3. Interpretation of Text Features

4Lexicographic structure of the dictionary is presented by means of typography (font features, layout) and syntax (predefined indicators and punctuation). Here below we provide some interpretation rules, which we used for parsing, in [text feature] : [interpretation] format:

1. Font Features
    1.1. [bold] AND [NOT[enumeration, reference]]: [form element]
    1.2. [non-bold] AND [non-italic]: [sense element] 
    1.3. [non-bold italic] AND [Latin Encoding]: [international names of flora and fauna]
2. Indicators/Punctuation
    2.1. [‘1.’, ‘2.’ etc.]: [entry sense number] 
    2.2. [‘1)’, ‘2)’ etc.]: [sub-entry sense number] 
    2.3. [Rome Digit]: [homograph is to the left]
    2.4. [‘:’]:	[delimits <orth> element from its compound form]
    2.5. [‘~’]:	[headword placeholder]=[<oRef/> in TEI]
    2.6. [‘, -’]:	[delimits <orth> element from it’s inflected form]
    2.7. [’,’ between bold words]: [delimits <orth> element from its variant form]
    2.8. [non-latin text between parenthesis]:[definition of sense] OR [context] OR [directing information] etc.

Figure 4. Dictionary interpretation rules.

4. Parsing Workflow

5The parser pursues a simple schema to provide clear logic and minimum complexity. The main principle: at first some structural tokens are defined. After that key tokens are used to identify desirable elements or their boundaries.

Retrieve lexical data from the source file
Markup dictionary entries
Markup <form> with @type attribute
Markup <senseList> and <subEntryList> elements
Markup sense items inside <senseList>
Markup and move sense examples, if so, into <sense> constructions
Markup all sub-entries inside <subEntryList>
After general schema is achieved indicated(<lbl>, <usg>, <def> etc.) constructions should be marked up and moved, if needed, to appropriate elements.

Figure 5. The workflow steps.

5. Structure According to TEI P5

                  
1. <dict>
2. 	   <superEntry>?
3. 	   <entry>+
4. 		     <form>
5. 			      <orth>
6. 			      <formt type ={“variant”|”inflected”|”compound”}>?
7. 				       <orth>
8. 		     <sense>
9. 			      <sense n={“1.”|”2.”|”3.”|”4.”}>+
10. 				       <cit type = “translation”>
11. 					        <quote>+
12. 					        <cit type = “example”>?
13. 						         <quote>
14. 						         <cit type = “translation”>
15. 							          <quote>				
16. 		     <re>*
17. 			      <form>
18. 				       <orth>
19. 			      <sense n={“1)”|”2)”|”3)”|”4)”}>+
20. 				       <cit type = “translation”>
21. 					        <quote>+

Figure 6. Outline of the structure in TEI P5.

1. dict_p = element dict{(superEntry_p | entry_p)+}
2.    superEntry_p = element superEntry{entry_p+}
3.    entry_p = element entry{form_p, sense_1, re_p*}
4.       form_p = element form{orth_p, element form{attribute att.type {”compound”|”inflected”|”variant”}, orth_p}?}
5.        orth_p = element orth{text}
6.       sense_1 = element sense{sense_2+}
7.        sense_2 = element sense{attribute att.n{“1.”|”2.”|”3.”|”5.”}, cit_1}
8.           cit_1 = element cit{attribute att.type{“translation”}, quote_1+, cit_ex?}
9.              quote_1 = element quote {text}
10.             cit_ex = element cit{attribute att.type{“example”},quote_2, cit_2}
11.                quote_2 = element quote{text}
12.                cit_2 = element cit{attribute att.type{“translation”}, quote_3}
13.                   quote_3 = element quote{text}
14.      re_p = element re{element form{orth_p}, sense_3}
15.         orth_p =element orth{text}
16.         sense_3 = element sense{sense_4+}
17.          sense_4 = element sense{attribute att.n{“1)”|”2)”|”3)”|”5)”}, cit_3}
18.          cit_3 = element cit{attribute att.type{“translation”}, quote_4+}
19.             quote_4 = element quote{text}

Figure 7. RELAX NG version of structure XML/TEI (<xr>, <usg>, <lbl> and <def> elements are not shown).

6. Mapping Parser Output to TEI P5

6Resuming the structure of the dictionary it can be said that it’s recursive. Generally saying a main entry contains a list of senses and after it an optional list of recursive entries which we called sub-entries. The only structural difference between main entry and sub-entry that senses of latter cannot contain examples. Figure 2 and Figure 6 illustrate that sense elements are presented differently: TEI P5 use <cit> elements and our XHTML doesn’t. Additionally our schema doesn’t deal with homographs.