Unicode and ANSI file handling: huge steps forward

Posted by on 19 Feb 2009 in Activity log

Made fantastic progress today. This is basically what I've implemented:

Several LoadFileToString functions with a range of different input parameters.
Functions for detecting the character set on load, including peeking into XML and HTML headers, and detecting UTF-8 byte-sequences.
An inventory (TDictionary) of code pages known to Windows, which make it possible to look up any code page id found in a header.
A dialog box which will let you test any of the different code pages against your text until you find the right one, with live conversion and font control.
Lots of testing with a variety of languages, encodings and BOMs.

With the exception of UTF32, I now have all of this stuff working. I'll have to add the UTF32 handling, and then work on finding a decent open-source implementation of regular expressions for Delphi. At some stage, it might be worth trying to take the broken port of Mozilla code, which has functions for recognizing likely ANSI encodings by their byte sequences, but that might be overkill.

This really has been hard, but quite rewarding, and infinitely valuable. I can add to Transformer the ability to specify an input code page as well as an output encoding.

This entry was posted by Martin and filed under Activity log.