Complete: created a reasonable workaround for this on Feb 7.
Paolo Cutini is using Transformer to recover some old WordStar word-processor files, and encountered a problem reloading the sequence file he had saved. The file appears to be corrupted; a control character occurs throughout the document. oXygen reports:
F An invalid XML character (Unicode: 0x1c) was found in the element content of the document.
That character is "INFORMATION SEPARATOR 4" or "file separator". It wouldn't normally be found in a Unicode document. However, that character is in the Unicode specification, so it ought to be somehow encoded in a format that UTF-8 can handle. This may be a limitation of the XDom engine I'm using for XML file handling, it could be a bug in my code, or it could be that I should automatically exclude control characters on the basis that they shouldn't show up in a text document. I'll look into it. Transformer is intended for working on Unicode texts, rather than ancient word-processor formats, but I do like the idea of using it to retrieve this old data; the program was written as part of a project to rescue some old DOS WordPerfect data, after all.
Entering this as a task with a long deadline, because it's not a major thing; what needs to be done is to investigate the code which uses XDom to save files, and see if the file data is being correctly encoded in UTF-8; if so, look into the specs and see if UTF-8 is supposed to handle this character, and if so, whether it should be somehow encoded or escaped.
No Pingbacks for this post yet...
Transformer is an open-source Windows application written in Delphi 2005 by Martin Holmes. Transformer loads Unicode text files and performs sequences of search-and-replace operations on them. It provides you with an interface to create and test these sequences of search-replace operations before running them in batch mode on a set of files.
|<< <||> >>|