
On Mon, Apr 06, 2009 at 02:38:15PM +0200, Johann Gail wrote:
\u Syntax is java Syntax, and is *NOT* UTF8-Encoding!
Correct. For example, \u2020 (the dagger symbol, †) would be \xe2\x80\xa0 or \342\200\240 in the UTF-8 encoding and \x20\x20 or \40\40 in UTF-16 (no matter if big or little endian, in this case). The octal and hex notation are 8-bit byte codes. I think that it is much more readable to write \u2020 for U+2020 than \xe2\x80\xa0. The \u notation will apparently also be in the next C and C++ syntax.
Both of them are unicode, but the encoding scheme is different. At the moment it works fine, if you use an editor, which can handle unicode properly.
I'm not sure if I understand your comment. I have understood that java.lang.String uses something like UTF-16 internally. I have never seen a text file containing Unicode characters that would be encoded in anything else than UTF-8. As far as I understand, the MySQL database (which I develop for a living) accepts UTF-16 string literals (called "ucs2"), but the bug reports I've seen always have been in ASCII, ISO 8859-1, or UTF-8.
But it is good idea, instead of introducing a new proprietary ~[xx] style, use a n existing standard, as e.g. the \u4 notation.
That exactly was my point. It should be trivial to implement all three notations (\x hex bytes, \ octal bytes, \u hex unicode). Marko