Greek document conversion
Posted by mholmes on 20 Mar 2009 in Activity log
First proofable output created and sent as PDF to PC. Here are points from today:
- The most difficult thing turned out to be converting the text styles so that they used Gentium instead of SymbolGreekII. Even when we had what we thought to be all the right settings in the XML files from the ODT archive correct, OOo insisted on rendering all the Greek text in Times New Roman. In the end, Greg opened up the doc file in Word and used Word's rather obscure search-and-replace, specifying the font name in formatting, to replace all instances of the original font name with "Gentium". Then he saved in Word, and I took it from there.
- I opened it in Open Office Writer and saved it as ODT.
- I opened the ODT file in Archive Manager, and extracted it to a folder.
- I ran my now fiendishly-complicated XSLT on the
contents.xml
file and thestyles.xml
file. This is basically a combination of hugetranslate
andreplace
operations, carefully sequenced, along with some special cases (see below). - The resulting files are placed back into the ODT file in Archive Manager, and the ODT file is opened in Open Office Writer, for saving as a PDF.
So one lesson learned is that you need to do the font name conversion right at the beginning, and it's best to keep the Greek (or other language) font distinct from the main text font (which in this case is Times New Roman) so that you can still use the font name in the XSLT to identify text in the target language for conversion.
This particular conversion has the following special cases which are idiosyncratic to this text:
- There are many cases in which normal sequences of character + diacritic have been reversed, and also reversed with an injected space, where the character is a capital letter. This was done in order to manipulate the appearance of the output, where diacritics are supposed to be off to the left of the character to which they apply.
- A combining dot below (which is actually the greater-than-or-equals sign in Unicode) was used to mark characters whose transcription was uncertain. Since this symbol was not available in SymbolGreekII, these characters typically appear in a
<text:span>
element of their own, with a different style setting. I fixed this by replacing the style of this span with the style of the preceding span, which is (I assume) guaranteed to contain the character to which the diacritic is intended to apply. - Some footnotes had deviated from Times New Roman into the old Times font, and PC requested that we convert them. The same font/style substitution code which had failed to work for converting SymbolGreekII to Gentium in the XSLT did in fact work for this conversion, which seems to have gone OK (although it'll take a lot of checking to be sure).
PC will proof the results and come back with any inconsistencies or corrections.