Working on Unicode file handling: this is hard

Posted by on 18 Feb 2009 in Activity log

Delphi 2009 has better file i/o for Unicode text files than any previous version, but there are still lots of holes. It's good at loading a file which has a BOM, but if there's no BOM, it just uses the system's default encoding. I need to do much better than that, so I'm writing a lot of new code, and repurposing old code, to make that happen. What I've got so far is a function which automatically detects any UTF8, 16 or 32 BOM, and failing that, checks the bytes of the file to see if it's likely UTF8. Now that's working, I need to go further and check for explicit character encodings named in the preamble of the file itself, in HTML, XHTML or XML files. This will involve assuming ANSI, which is reasonable, and loading it that way, then searching all the likely locations, allowing for case, etc. I've done something a little like this before, but it has to be a bit more bulletproof. Then I have to follow Marco Cantu's example of defining a custom encoding to create the second of the UTF32 encodings, so my apps can load files in UTF32.

This entry was posted by Martin and filed under Activity log.