Kenneth Whistler
2006-09-07 00:35:28 UTC
As for utf-8 vs. Unicode, this is a bit tricky. I agree that merely
specifying Unicode isn't sufficient given the potential for
incompatible CESs.
UTF-8 vs. Unicode is an incomplete way of specifying thespecifying Unicode isn't sufficient given the potential for
incompatible CESs.
distinctions to be made. It is a level-appropriateness issue.
If your concern is specification of the character semantics,
then you designate the Unicode Standard (or the equivalent
ISO/IEC 10646) and a version level to get the exact
repertoire.
If your concern is memory representation or API support then
you designate one of the 3 Character Encoding Forms formally
and normatively defined in the Unicode Standard (and equivalently
in ISO/IEC 10646): UTF-8, UTF-16, or UTF-32.
If your concern is serial byte representation in a char-oriented
protocol or stream, then you designate one of the CES's formally
and normatively defined in the Unicode Standard: UTF-8,
(UTF-16BE, UTF-16LE, UTF-16 with BOM), (UTF-32BE, UTF-32LE, UTF-32 with
BOM).
All of the CES's are fully interoperable and compatible with
each other. And only those CES's normatively defined in
the Unicode Standard should be considered CES's of Unicode.
And yet I'm sympathetic to the notion that UTF-8
pessimizes storage and transmission of text written in certain
languages. IMHO it's unreasonable to exclude the potential for a
Unicode based CES that has more-or-less equivalent information
density across a wide variety of languages.
Ah, but that is precisely none other than UTF-16, and is inpessimizes storage and transmission of text written in certain
languages. IMHO it's unreasonable to exclude the potential for a
Unicode based CES that has more-or-less equivalent information
density across a wide variety of languages.
widespread use for that reason and other reasons. But it doesn't
make much sense for the web or for most internet protocols,
because of the already existing ubiquity of UTF-8 in those
contexts.
But I do think that use of
multiple CESs in a new protocol should require substantial
justification, and that UTF-8 should be presumed to be the CES of
choice for any new protocol that requires ASCII compatibility for its
character representation.
I agree completely with that assessment.multiple CESs in a new protocol should require substantial
justification, and that UTF-8 should be presumed to be the CES of
choice for any new protocol that requires ASCII compatibility for its
character representation.
--Ken
Keith