Are charset names supposed to be case sensitive?

Are charset names supposed to be case sensitive?

-Shawn

ï£¢ï£ï£§ï£ ï£¢ï££ï£ï£ï£
http://blogs.msdn.com/shawnste

Shawn Steele

2011-12-15 19:00:06 UTC

That's what I thought, it was unclear to me if lief's proposal was making a distinction between utf16 and UTF16 :)

-----Original Message-----
From: Bjoern Hoehrmann [mailto:***@gmx.net]
Sent: Thursday, December 15, 2011 10:59 AM
To: Shawn Steele
Cc: ietf-***@iana.org
Subject: Re: Are charset names supposed to be case sensitive?

Post by Shawn Steele
Are charset names supposed to be case sensitive?

No, RFC 2978 implies they are case-insensitive and so they are pretty much everywhere.
--
Björn Höhrmann · mailto:***@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Leif Halvard Silli

2011-12-17 08:01:11 UTC

Hi Shawn and all,

Magic vs semantics: I don't attach magic to the casing. But I do
recognize that there nevertheless is semantics attached to the casing -
by others than myself. Thus I've tried to be consistent with the casing
found in the IANA registry (UPPERCASE) and in the Microsoft listing
(lowercase and mixedCASE). This consistency have the following benefits:

* It makes it easier to separate the semantics of the utf-16 alias
in the Microsoft listing from the semantics of the UTF-16 name in
the IANA registry.
* It separates 'unicode' from the trademarked/registered 'UNICODE'.
* The casing 'unicodeFFFE' is more readable than 'unicodefffe' or
'UNICODEFFFE'.

BTW: Here are 3 new, preliminary findings from the test suite I work on:

* Microsoft's products add the BOM *and* the <meta> charset
declaration. However, my new tests show that <meta> charset
declarations inside UTF-16 flavor files actually is IE-*incompatible*:
For a file without BOM or HTTP charset info, then the <meta> charset
declaration, regardless of its value, causes IE to not sniff the
encoding - if one deletes the <meta> charest, however, *then* it sniffs
it. This fact serves to underlies that 'unicode' and 'unicodeFFF'
require the BOM, as it would actually be unsafe to say charset=unicode
unless there is a BOM.

* HORROR: IE is not alone in treating 'UTF-16/utf-16' as an alias for
'unicode': Webkit (Safari, Chrome) behave the same way. Thus, if HTTP
announces 'UTF-16' for a file without the BOM, then instead of starting
to sniff, Webkit - just like IE - defaults to LE, resulting in
mojibake in both IE and Webkit.

* Contrary to my HTML5 process based impression, IE has zero problems
with guessing the encoding and the endianness of a BOMless UTF-16 file
that doesn't get any encoding info from the HTTP Content-Type or a
<meta> element. Safari/Chrome, however, *they* default incorrectlly in
such cases.

Leif H Silli

Post by Shawn Steele
That's what I thought, it was unclear to me if lief's proposal was
making a distinction between utf16 and UTF16 :)
-----Original Message-----
Sent: Thursday, December 15, 2011 10:59 AM
To: Shawn Steele
Subject: Re: Are charset names supposed to be case sensitive?

Post by Shawn Steele
Are charset names supposed to be case sensitive?

No, RFC 2978 implies they are case-insensitive and so they are pretty much everywhere.
--
+49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 ·

Martin J. Dürst

2011-12-17 13:09:31 UTC

Post by Leif Halvard Silli
Hi Shawn and all,
Magic vs semantics: I don't attach magic to the casing. But I do
recognize that there nevertheless is semantics attached to the casi=

ng -

Post by Leif Halvard Silli
by others than myself.

Who exactly? Does any product (Microsoft or otherwise) produce differ=
ent=20
results when they see different case in the same place?

If yes, which product, and what are the differences (and what happens=
=20
with mixed case cases)?

If not, I would strongly suggest not to use case differences to refer=
to=20
different usages of the same label, because this may cause a lot of=
=20
confusion. (It already had Shawn confused, and me, too.)

Post by Leif Halvard Silli
Thus I've tried to be consistent with the casing
found in the IANA registry (UPPERCASE) and in the Microsoft listing
(lowercase and mixedCASE).

The fact that the IANA registry lists the charset labels with upperca=
se=20
characters isn't more than a random convention, and this may also be =
so=20
for the Microsoft listings. Please use a single case version unless c=
ase=20
is really significant in the sense that one and the same product, in =
one=20
and the same protocol slot, reacts different to different case forms.

Post by Leif Halvard Silli
* It makes it easier to separate the semantics of the utf-16 alias
in the Microsoft listing from the semantics of the UTF-16 name i=

Post by Leif Halvard Silli
the IANA registry.

For all intents and purposes, these are one and the same charset labe=
l.=20
If you want to distinguish them, please do so with additional words, =
not=20
with case.

Post by Leif Halvard Silli
* It separates 'unicode' from the trademarked/registered 'UNICODE'.
* The casing 'unicodeFFFE' is more readable than 'unicodefffe' or
'UNICODEFFFE'.

For these two, you have only used a single casing, so there's no=20
confusion. So these should be fine.

Regards, Martin.

Post by Leif Halvard Silli
BTW: Here are 3 new, preliminary findings from the test suite I wor=
* Microsoft's products add the BOM *and* the<meta> charset
declaration. However, my new tests show that<meta> charset
declarations inside UTF-16 flavor files actually is IE-*incompatibl=
For a file without BOM or HTTP charset info, then the<meta> charse=

Post by Leif Halvard Silli
declaration, regardless of its value, causes IE to not sniff the
encoding - if one deletes the<meta> charest, however, *then* it sn=

iffs

Post by Leif Halvard Silli
it. This fact serves to underlies that 'unicode' and 'unicodeFFF'
require the BOM, as it would actually be unsafe to say charset=3Dun=

icode

Post by Leif Halvard Silli
unless there is a BOM.
* HORROR: IE is not alone in treating 'UTF-16/utf-16' as an alias f=

Post by Leif Halvard Silli
'unicode': Webkit (Safari, Chrome) behave the same way. Thus, if HT=

Post by Leif Halvard Silli
announces 'UTF-16' for a file without the BOM, then instead of star=

ting

Post by Leif Halvard Silli
to sniff, Webkit - just like IE - defaults to LE, resulting in
mojibake in both IE and Webkit.
* Contrary to my HTML5 process based impression, IE has zero proble=

Post by Leif Halvard Silli
with guessing the encoding and the endianness of a BOMless UTF-16 f=

ile

Post by Leif Halvard Silli
that doesn't get any encoding info from the HTTP Content-Type or a
<meta> element. Safari/Chrome, however, *they* default incorrectll=

y in

Post by Leif Halvard Silli
such cases.
Leif H Silli

Post by Shawn Steele
Are charset names supposed to be case sensitive?

No, RFC 2978 implies they are case-insensitive and so they are pre=

tty

Post by Shawn Steele
much everywhere.
--

=B7

Post by Shawn Steele
+49(0)160/4415681 =C2=B7 http://www.bjoernsworld.de
25899 Dageb=C3=BCll =C2=B7 PGP Pub. KeyID: 0xA4357E78 =C2=B7 http:=

//www.websitedev.de/

Shawn Steele

2011-12-17 15:35:04 UTC

Post by Leif Halvard Silli
* It makes it easier to separate the semantics of the utf-16 alias
in the Microsoft listing from the semantics of the UTF-16 name in
the IANA registry.

Microsoft's behavior is identical, regardless of casing, whatever you're observing is accidental and shouldn't be relied on.

Post by Leif Halvard Silli
* It separates 'unicode' from the trademarked/registered 'UNICODE'.

Not a lawyer, but I don't think the trademark office is going to consider cased versions of the same word sufficiently distinct as to be separate things.

-Shawn

Leif Halvard Silli

2011-12-17 17:59:20 UTC

Post by Shawn Steele

Post by Leif Halvard Silli
* It makes it easier to separate the semantics of the utf-16 alias
in the Microsoft listing from the semantics of the UTF-16 name in
the IANA registry.

Microsoft's behavior is identical, regardless of casing, whatever
you're observing is accidental and shouldn't be relied on.

The test suite I work on, does not - and has never - contained a single
test that focuses on the casing.

--
Leif Halvard Silli

Leif Halvard Silli

2011-12-20 00:45:20 UTC

Post by Shawn Steele

Post by Leif Halvard Silli
It separates 'unicode' from the trademarked/registered 'UNICODE'.

Not a lawyer, but I don't think the trademark office is going to
consider cased versions of the same word sufficiently distinct as to

http://www.unicode.org/policies/logo_policy.html
Apparently it is an internet meme that the CAPITALIZED version is
somehow official and refers to the standard and/or the trademark,
but "Unicode" is not any kind of acronym or abbreviation.

Interesting. But their logo uses something that in my view can be
interpreted as uppercase. Also, if you have ever announced something on
AdWords, then you might know that they do not permit the use of
all-caps, except in company name. So, in a way, this meme does not need
to be linked to all-caps as something that is 'internetish' - it can
instead be explained by typical layout conventions.

That said, Shawn's point is still valid -- it is unlikely that
"Unicode", "unicode", and "UNICODE" would be considered as
referring to different things based on their
capitalization (or lack thereof).

Regardless: Lowercase is the variant that *the least* is likely to be
perceived as the official name. Except, of course, that we were to pick
'unicODE' or something ...

I realize this is somewhat peripheral to the discussion about these
new registration proposals, but I just want to make sure that
whatever happens with them doesn't further contribute to confusion
about the identity of "Unicode".

If we were hindered from registering this charset, then it would be
logical if Unicode Inc took actions against Microsoft too ... But of
course: 'unicode' as a charset name is of course 'too good' - I'm
pretty sure that many users of Microsoft's Office products picks
'Unicode' - whichever capitalization they have used - because that name
sounds more familiar than the mysterious 'UTF-8'.

With regard to Unicode Inc, then I think it should be considered an
improvement if 'unicode' was registered as an 'of limited use' or - if
possible - 'obsolete' charset name, as either of those should mean that
'unicode' and 'unicodeFFFE' would have to be avoided in standard
compliant documents.

--
Leif H Silli

Ken Whistler

2011-12-20 00:13:26 UTC

Post by Shawn Steele

Post by Leif Halvard Silli
It separates 'unicode' from the trademarked/registered 'UNICODE'.

Not a lawyer, but I don't think the trademark office is going to consider cased versions of the same word sufficiently distinct as to be separate things.

I am also not a lawyer, but on this point wanted to note that "Unicode"
(titlecased) is
the actual term used by the Unicode Consortium for the standards, the
name of
the consortium, and the trademarked word. The Unicode Consortium does *not*
favor or promote the allcaps spelling "UNICODE". For details on the
trademark and logo
usage guidelines, see:

http://www.unicode.org/policies/logo_policy.html

Apparently it is an internet meme that the CAPITALIZED version is
somehow official
and refers to the standard and/or the trademark, but "Unicode" is not
any kind of acronym
or abbreviation. That said, Shawn's point is still valid -- it is
unlikely that "Unicode", "unicode",
and "UNICODE" would be considered as referring to different things based
on their
capitalization (or lack thereof).

I realize this is somewhat peripheral to the discussion about these new
registration
proposals, but I just want to make sure that whatever happens with them
doesn't
further contribute to confusion about the identity of "Unicode".

--Ken

Leif Halvard Silli

2011-12-17 22:02:37 UTC

Post by Leif Halvard Silli
Magic vs semantics: I don't attach magic to the casing. But I do
recognize that there nevertheless is semantics attached to the casing -
by others than myself.

...

Post by Martin J. DÃ¼rst
If not,

We have a 'if not' situation.

Post by Martin J. DÃ¼rst
I would strongly suggest not to use case differences to refer
to different usages of the same label, because this may cause a lot
of confusion. (It already had Shawn confused, and me, too.)

Looking at my registration letter for 'unicode', I think it isn't the
very casing, but the language I use about the casing that is possibly
confusing:

''' NB! Alias: At the time of this registration, the spec upon which
the registration of the 'unicode' and the 'unicodeFFFE' charset is
based, defines 'utf-16' (lowercase) as alias for 'unicode'.[2] '''

If I remove the '(lowercase)', then the above should be clear enough,
no? Also: In the same letter I say that 'utf-16' lowercase *cannot* be
registered as alias for 'unicode' due to the fact that 'UTF-16'
uppercase is already registered as a charset name. So there is material
enough to at least avoid jumping to conclusions ...

Post by Leif Halvard Silli
Thus I've tried to be consistent with the casing
found in the IANA registry (UPPERCASE) and in the Microsoft listing
(lowercase and mixedCASE).

The fact that the IANA registry lists the charset labels with
uppercase characters isn't more than a random convention, and this
may also be so for the Microsoft listings.

The registry says: ''However, no distinction is made between use of
upper and lower case letters.'' I read this to mean that products are
not supposed to make distinctions based on the casing. Hence I assumed
this mailing list would read *me* the same way ... Am I wrong w.r.t
the registry? Does the casing in the registry matter, except for
registry conventions or sorts?

Post by Martin J. DÃ¼rst
Please use a single case
version unless case is really significant in the sense that one and
the same product, in one and the same protocol slot, reacts different
to different case forms.

When I quote Microsoft or the IANA registry, I must of course use the
casing used in those documents. But - OK - else: Until I eventually
discover a case where the casing matters, I will try to use the a
single casing.

Post by Leif Halvard Silli
* It makes it easier to separate the semantics of the utf-16 alias
in the Microsoft listing from the semantics of the UTF-16 name in
the IANA registry.

For all intents and purposes, these are one and the same charset
label. If you want to distinguish them, please do so with additional
words, not with case.

I did that: I said 'utf-16 *alias*' versus 'UTF-16 *name*' ... But the
casing apparently nullified the effect ...

Post by Leif Halvard Silli
* It separates 'unicode' from the trademarked/registered 'UNICODE'.
* The casing 'unicodeFFFE' is more readable than 'unicodefffe' or
'UNICODEFFFE'.

For these two, you have only used a single casing, so there's no
confusion. So th

Martin J. Dürst

2011-12-18 08:33:40 UTC

Hello Leif,

Post by Leif Halvard Silli
Magic vs semantics: I don't attach magic to the casing. But I do
recognize that there nevertheless is semantics attached to the ca=

sing -

Post by Leif Halvard Silli
by others than myself.

...

Post by Martin J. DÃ¼rst
If not,

We have a 'if not' situation.

Thanks for the confirmation.

Post by Martin J. DÃ¼rst
I would strongly suggest not to use case differences to refer
to different usages of the same label, because this may cause a lo=

Post by Martin J. DÃ¼rst
of confusion. (It already had Shawn confused, and me, too.)

Looking at my registration letter for 'unicode', I think it isn't t=

Post by Leif Halvard Silli
very casing, but the language I use about the casing that is possib=

Post by Leif Halvard Silli
''' NB! Alias: At the time of this registration, the spec upon whic=

Post by Leif Halvard Silli
the registration of the 'unicode' and the 'unicodeFFFE' charset =

Post by Leif Halvard Silli
based, defines 'utf-16' (lowercase) as alias for 'unicode'.[2] '=

Post by Leif Halvard Silli
If I remove the '(lowercase)', then the above should be clear enoug=

Post by Leif Halvard Silli
no? Also: In the same letter I say that 'utf-16' lowercase *cannot*=

Post by Leif Halvard Silli
registered as alias for 'unicode' due to the fact that 'UTF-16'
uppercase is already registered as a charset name. So there is mate=

rial

Post by Leif Halvard Silli
enough to at least avoid jumping to conclusions ...

Can you go over your templates and check for these and similar places=
=20
and fix and resend them?

Post by Leif Halvard Silli
Thus I've tried to be consistent with the casing
found in the IANA registry (UPPERCASE) and in the Microsoft listi=

Post by Leif Halvard Silli
(lowercase and mixedCASE).

The fact that the IANA registry lists the charset labels with
uppercase characters isn't more than a random convention, and this
may also be so for the Microsoft listings.

The registry says: ''However, no distinction is made between use of
upper and lower case letters.'' I read this to mean that products a=

Post by Leif Halvard Silli
not supposed to make distinctions based on the casing.

Yes indeed.

Post by Leif Halvard Silli
Hence I assumed
this mailing list would read *me* the same way ...

Well, in general, it would. But you were so consistent in your case=
=20
distinctions and were talking about all kinds of edge cases, and that=
=20
made at least Shawn and me, and probably others, think that there rea=
lly=20
was a case distinction.

Post by Leif Halvard Silli
Am I wrong w.r.t
the registry? Does the casing in the registry matter, except for
registry conventions or sorts?

No, casing doesn't matter for charset labels, neither in the registry=
,=20
nor, as Shawn and you have fortunately confirmed, in any implementati=
ons=20
we know of.

Post by Martin J. DÃ¼rst
Please use a single case
version unless case is really significant in the sense that one an=

Post by Martin J. DÃ¼rst
the same product, in one and the same protocol slot, reacts differ=

ent

Post by Martin J. DÃ¼rst
to different case forms.

When I quote Microsoft or the IANA registry, I must of course use t=

Post by Leif Halvard Silli
casing used in those documents.

If you quote a whole sentence, then probably yes. But not when just=
=20
quoting a label, or when just using information from these places.

Post by Leif Halvard Silli
But - OK - else: Until I eventually
discover a case where the casing matters, I will try to use the a
single casing.

Great, thanks.

Post by Leif Halvard Silli
* It makes it easier to separate the semantics of the utf-16 alia=

Post by Leif Halvard Silli
in the Microsoft listing from the semantics of the UTF-16 nam=

e in

Post by Leif Halvard Silli
the IANA registry.

For all intents and purposes, these are one and the same charset
label. If you want to distinguish them, please do so with addition=

Post by Martin J. DÃ¼rst
words, not with case.

I did that: I said 'utf-16 *alias*' versus 'UTF-16 *name*' ... But=

the

Post by Leif Halvard Silli
casing apparently nullified the effect ...

Yes, and "alias" vs. "name" isn't very direct either. I'd personally =
use=20
"UTF-16 according to RFC..." vs. "UTF-16 according to Microsoft" or s=
ome=20
such.

Regards, Martin.

Post by Leif Halvard Silli
* It separates 'unicode' from the trademarked/registered 'UNICODE=

Post by Leif Halvard Silli
* The casing 'unicodeFFFE' is more readable than 'unicodefffe' or
'UNICODEFFFE'.

For these two, you have only used a single casing, so there's no
confusion. So these should be fine.

Good.

Leif Halvard Silli

2011-12-18 12:51:57 UTC

Hi Martin,

Looking at my registration letter for 'unicode', I think it isn't the
very casing, but the language I use about the casing that is possibly
''' NB! Alias: At the time of this registration, the spec upon which
the registration of the 'unicode' and the 'unicodeFFFE' charset is
based, defines 'utf-16' (lowercase) as alias for 'unicode'.[2] '''
If I remove the '(lowercase)', then the above should be clear enough,
no? Also: In the same letter I say that 'utf-16' lowercase *cannot* be
registered as alias for 'unicode' due to the fact that 'UTF-16'
uppercase is already registered as a charset name. So there is material
enough to at least avoid jumping to conclusions ...

Can you go over your templates and check for these and similar places
and fix and resend them?

Yes, I will do that.

... snip ...

Post by Martin J. DÃ¼rst
Yes, and "alias" vs. "name" isn't very direct either. I'd personally
use "UTF-16 according to RFC..." vs. "UTF-16 according to Microsoft"
or some such.

That's a good w

Leif Halvard Silli

2011-12-20 00:09:50 UTC

Post by Martin J. DÃ¼rst
Can you go over your templates and check for these and similar places
and fix and resend them?

Yes, I will do that.

I just resent the 'unicode' registration. The 'unicodeFFFE'
registration will be very identical - would you like to that I prepare
it ASAP? Or can we discuss 'unicode' first?

I reworked the letter quite much: Shorter except for the 'additional
info'-section. As for casing, then I think I used lowercase everywhere,
except once where I said 'Unicode' because it began a sentence.

--
Leif H Silli

Doug Ewell

2011-12-19 15:36:10 UTC

I guess I would like to see some sort of table breaking down the vari=
ous=20
flavors of UTF-16 and/or UCS-2 that would need to be tagged separatel=
y:

* big-endian or little-endian by default
* accepts BOM
* requires BOM
* supports all 17 planes or just BMP
* etc.

That way I would have a clearer sense of what can be currently tagged=
,=20
what cannot be tagged and needs to be, and what is just an applicatio=
n=20
quirk or bug.

It seems Leif might be trying to tag the incomplete or erroneous=20
behavior of individual applications, even if they don't correspond to=
=20
documented behavior, or to tag mis-documented behavior that may not=
=20
actually be implemented (like "unicode" meaning "BMP only"). I'm not=
=20
sure that's a goal of registering charsets. It also seemed to me=
=E2=80=94though=20
I assume I'm wrong here=E2=80=94that he was trying to call particular=
attention=20
to errors in Microsoft implementations, but I'm sure Shawn and others=
=20
can speak to that.

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell =C2=AD

Bjoern Hoehrmann

2011-12-19 21:31:04 UTC

I guess I would like to see some sort of table breaking down the var=

ious=20

flavors of UTF-16 and/or UCS-2 that would need to be tagged separate=
* big-endian or little-endian by default
* accepts BOM
* requires BOM
* supports all 17 planes or just BMP
* etc.

I think it would be helpful to start with separating what the encodin=
gs
are and what the particular behavior of "HTML implementations" is. Th=
e
registry is not really meant to cover the encoding detection rules fo=
r
"HTML when served over HTTP" with handling of <meta> elements and suc=
h,
it's more for "you have a label and you have bytes, this is how you g=
et
characters", where the definition of the label, and not the data form=
at
tells you how you get the characters.
--=20
Bj=F6rn H=F6hrmann =B7 mailto:***@hoehrmann.de =B7 http://bjoern.h=
oehrmann.de
Am Badedeich 7 =B7 Telefon: +49(0)160/4415681 =B7 http://www.bjoernsw=
orld.de
25899 Dageb=FCll =B7 PGP Pub. KeyID: 0xA4357E78 =B7 http://www.websit=
edev.de/=20

Leif Halvard Silli

2011-12-20 00:56:26 UTC

I guess I would like to see some sort of table breaking down the various
* big-endian or little-endian by default
* accepts BOM
* requires BOM
* supports all 17 planes or just BMP
* etc.

I think it would be helpful to start with separating what the encodings
are and what the particular behavior of "HTML implementations" is.

Agreed.

The
registry is not really meant to cover the encoding detection rules for
"HTML when served over HTTP" with handling of <meta> elements and such,
it's more for "you have a label and you have bytes, this is how you get
characters", where the definition of the label, and not the data format
tells you how you get the characters.

Well, the registry is supposed say whether the label should be seen as
obsolete, of limited use or 'normal'. These judgements are not simply a
question of 'you have label and you have bytes, this is how you get
characters'. The reasons why products *may* need to have some kind of
support for 'unicode' and 'unicodeFFFE' are the same as why they
probably should be considered 'obsolete' or 'of limited use': They
interfere in a negative way on the stability of 'utf-16', 'utf-16le'
and 'utf-16be'. And these negativities need to be reflected somewhere.
It also, in order to try to get a picture of those issues that I have
focused on what happens if so and so.

Meanwhile, perhaps my new version of the 'unicode' registration looks
better?

--
Leif H Silli

Bjoern Hoehrmann

2011-12-20 02:07:34 UTC

Well, the registry is supposed say whether the label should be seen =

as=20

obsolete, of limited use or 'normal'. These judgements are not simpl=

y a=20

question of 'you have label and you have bytes, this is how you get=

=20

characters'. The reasons why products *may* need to have some kind o=

f=20

support for 'unicode' and 'unicodeFFFE' are the same as why they=

=20

probably should be considered 'obsolete' or 'of limited use': They=

=20

interfere in a negative way on the stability of 'utf-16', 'utf-16le'=

=20

and 'utf-16be'. And these negativities need to be reflected somewher=

e.=20

It also, in order to try to get a picture of those issues that I hav=

e=20

focused on what happens if so and so.

Reasons for why a label is problematic should be part of the registry=
,
information on how certain browsers handle a certain name in the <met=
a>
element in the process of detecting the encoding of a HTML document
should not be. Right now I have trouble telling how to implement the
two encodings you would like to register. What I would do is probably
using my http://search.cpan.org/dist/Win32-MultiLanguage/ module to
convert from the encodings to UTF-8 and look at the results, like if
a "BOM" matters, how surrogates are handled, and so on. With test dat=
a
you could then say this is how stuff works independently of HTML. If
there are any issues with that, say things are different from how you
handle UTF-16/LE/BE, that would be useful aswell.

How HTML implementations might treat the labels, or whether somone ma=
y
or not want to implement the encoding, and other things like that, ar=
e
secondary and should be looked at when the definition of the encoding=
,
or perhaps the difficulties in defining the label, are clear.

Meanwhile, perhaps my new version of the 'unicode' registration look=

s=20

better?

You lost me at

The 'unicode' spec defines 'utf-16' as its alias, but this of
course contradicts with 'utf-16' as defined in the IANA registr=
y.

already. I can't tell for instance whether this would be still true i=
f
the label would be registered as you propose.
--=20
Bj=F6rn H=F6hrmann =B7 mailto:***@hoehrmann.de =B7 http://bjoern.h=
oehrmann.de
Am Badedeich 7 =B7 Telefon: +49(0)160/4415681 =B7 http://www.bjoernsw=
orld.de
25899 Dageb=FCll =B7 PGP Pub. KeyID: 0xA4357E78 =B7 http://www.websit=
edev.de/=20

Leif Halvard Silli

2011-12-20 04:15:17 UTC

Reasons for why a label is problematic should be part of the registry,
information on how certain browsers handle a certain name in the <meta>
element in the process of detecting the encoding of a HTML document
should not be. Right now I have trouble telling how to implement the
two encodings you would like to register. What I would do is probably
using my http://search.cpan.org/dist/Win32-MultiLanguage/ module to
convert from the encodings to UTF-8 and look at the results, like if
a "BOM" matters, how surrogates are handled, and so on. With test data
you could then say this is how stuff works independently of HTML. If
there are any issues with that, say things are different from how you
handle UTF-16/LE/BE, that would be useful aswell.

If it helps, this my 'test bed': <http://malform.no/testing/utf/>. But
it isn't ready yet - especially the first column, with KOI8-R. And
there are more test that could be added. As you can see, I only focus
on HTML and XML. Though I also had a brief look at plain text - it
seemed like at least IE did not accept UTF-16 as plain text.

The reason why there are so many tests - and some more to come - is
exactly because I wanted to check for false positives/negatives etc -
things that aren't what they seem to be. So for instance Webkit seems
run some encoding sniffing against the XML declaration, both for HTML
and XML.

The weirdest thing I have discovered is that IE sniffs *un-labelled*
UTF-16BE just fine *when served on my computer* but not when served
from the above web site. I tried to check the HTTP headers, but could
not spot any things that should have mattered.

Right now I am not in front of my Windows computer, but it seems that
IE9 in XML mode, is much better at coping with different flavors of
UTF-16 than it is at handling the same in HTML. My suspicion is that
this is primarily due to the nature of XML and not because it doesn't
implement MS 'unicode' and MS 'uncodeFFFE' in XML mode.

Speaking about XML, then there is the issue that an XML parser has got
to *know* the encoding label, or else it is supposed to be a fatal
error. So for xmllint spits out a fatal error in front of <?xml
version='1.0' encoding='unicode' ?> - but web browsers do not do that.
But Firefox does react if it comes via HTTP's charset parameter.

How HTML implementations might treat the labels, or whether somone may
or not want to implement the encoding, and other things like that, are
secondary and should be looked at when the definition of the encoding,
or perhaps the difficulties in defining the label, are clear.

I guess that makes sense, yes.

Post by Leif Halvard Silli
Meanwhile, perhaps my new version of the 'unicode' registration looks
better?

You lost me at
The 'unicode' spec defines 'utf-16' as its alias, but this of
course contradicts with 'utf-16' as defined in the IANA registry.
already. I can't tell for instance whether this would be still true if
the label would be registered as you propose.

You have a point there. In the first iteration, I answered a firmly 'no
aliases'. Nevertheless, 'utf-16' is seen as an alias by the mentioned
browsers - and perhaps even by HTML5? So I agree I must add back the
firm 'no aliases'.

Other reactions?

--
Leif H Silli

Ira McDonald

2011-12-20 02:11:15 UTC

Hi,

I appreciate Leif's efforts - but...

I'm very uncomfortable about registering (in whatever case)
'unicode' as a deprecated limited use charset name for some
flavor of UTF-16 (just the BMP?).

This has great potential to add confusion to an already confused
situation, IMHO.

It's audibly (in a screen reader) indistinguishable from the *real*
Unicode - technically aligned w/ ISO 10646 and the name for the
whole enchilada.

It's too bad that users don't know what UTF-8 means and that it
is NOT just another alternative to UTF-16 - but in fact the strongly
preferred alternative when handling message catalogs, streams
of text (marked up or not), and generally IETF and other protocol
elements.

Cheers,
- Ira

Ira McDonald (Musician / Software Architect)
Chair - Linux Foundation Open Printing WG
Secretary - IEEE-ISTO Printer Working Group
Co-Chair - IEEE-ISTO PWG IPP WG
Co-Chair - TCG Trusted Mobility Solutions WG
Chair - TCG Embedded Systems Hardcopy SG
IETF Designated Expert - IPP & Printer MIB
Blue Roof Music/High North Inc
http://sites.google.com/site/blueroofmusic
http://sites.google.com/site/highnorthinc
mailto:***@gmail.com
Winter 579 Park Place Saline, MI 48176 734-944-0094
Summer PO Box 221 Grand Marais, MI 49839 906-494-2434

On Mon, Dec 19, 2011 at 7:56 PM, Leif Halvard Silli <

I guess I would like to see some sort of table breaking down the various
* big-endian or little-endian by default
* accepts BOM
* requires BOM
* supports all 17 planes or just BMP
* etc.

I think it would be helpful to start with separating what the encodings
are and what the particular behavior of "HTML implementations" is.

Agreed.

Well, the registry is supposed say whether the label should be seen as
obsolete, of limited use or 'normal'. These judgements are not simply a
question of 'you have label and you have bytes, this is how you get
characters'. The reasons why products *may* need to have some kind of
support for 'unicode' and 'unicodeFFFE' are the same as why they
probably should be considered 'obsolete' or 'of limited use': They
interfere in a negative way on the stability of 'utf-16', 'utf-16le'
and 'utf-16be'. And these negativities need to be reflected somewhere.
It also, in order to try to get a picture of those issues that I have
focused on what happens if so and so.
Meanwhile, perhaps my new version of the 'unicode' registration looks
better?
--
Leif H Silli

Bjoern Hoehrmann

2011-12-20 02:20:12 UTC

Post by Ira McDonald
I'm very uncomfortable about registering (in whatever case)
'unicode' as a deprecated limited use charset name for some
flavor of UTF-16 (just the BMP?).
This has great potential to add confusion to an already confused
situation, IMHO.

Would you rather say nothing about 'unicode' than saying "This was a
very bad idea, don't do anything like it ever again" in the registry?
--=20
Bj=F6rn H=F6hrmann =B7 mailto:***@hoehrmann.de =B7 http://bjoern.h=
oehrmann.de
Am Badedeich 7 =B7 Telefon: +49(0)160/4415681 =B7 http://www.bjoernsw=
orld.de
25899 Dageb=FCll =B7 PGP Pub. KeyID: 0xA4357E78 =B7 http://www.websit=
edev.de/=20

Ira McDonald

2011-12-20 02:53:16 UTC

Hi,

No - you make a compelling point, Bjoern.

Leif - I'd like to see every use of 'unicode' in that registration
which means "the Microsoft specification or usage or 'unicode'
as a charset tag" be explicitly rewritten to instead say, e.g.,
"the MS 'unicode' charset" to very clearly distinguish it from
the Unicode/ISO10646 character set and UTF-x encodings.

When I read your current text, the sense of "which unicode"
wanders from paragraph to paragraph (for me).

Cheers,
- Ira

Ira McDonald (Musician / Software Architect)
Chair - Linux Foundation Open Printing WG
Secretary - IEEE-ISTO Printer Working Group
Co-Chair - IEEE-ISTO PWG IPP WG
Co-Chair - TCG Trusted Mobility Solutions WG
Chair - TCG Embedded Systems Hardcopy SG
IETF Designated Expert - IPP & Printer MIB
Blue Roof Music/High North Inc
http://sites.google.com/site/blueroofmusic
http://sites.google.com/site/highnorthinc
mailto:***@gmail.com
Winter 579 Park Place Saline, MI 48176 734-944-0094
Summer PO Box 221 Grand Marais, MI 49839 906-494-2434

Post by Bjoern Hoehrmann

Would you rather say nothing about 'unicode' than saying "This was a
very bad idea, don't do anything like it ever again" in the registry?
--
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Leif Halvard Silli

2011-12-20 03:56:36 UTC

Post by Ira McDonald
Leif - I'd like to see every use of 'unicode' in that registration
which means "the Microsoft specification or usage or 'unicode'
as a charset tag" be explicitly rewritten to instead say, e.g.,
"the MS 'unicode' charset" to very clearly distinguish it from
the Unicode/ISO10646 character set and UTF-x encodings.
When I read your current text, the sense of "which unicode"
wanders from paragraph to paragraph (for me).

That sounds like a very good point. For next iteration. Thanks!

As for your point the MS 'unicode' charset being limited to BMP, then
we - eventually - have the same situations as for, for instance,
US-ASCII, which on the Web is treated like Windows-1252. Or the same
issue we had in XML 1.0 when it limited certain features to a version
of Unicode which did not contain Burmese or Camboidan letters: It isn't
in tune with the facts - at least not in all contexts.

Or perhaps Microsoft will update their spec *now*? Why not?

--
leif h silli

Martin J. Dürst

2011-12-20 05:23:42 UTC

Post by Leif Halvard Silli
As for your point the MS 'unicode' charset being limited to BMP, then
we - eventually - have the same situations as for, for instance,
US-ASCII, which on the Web is treated like Windows-1252. Or the same
issue we had in XML 1.0 when it limited certain features to a version
of Unicode which did not contain Burmese or Camboidan letters: It isn't
in tune with the facts - at least not in all contexts.
Or perhaps Microsoft will update their spec *now*? Why not?

If it differs with reality, it would be a good thing to get it updated.
However, in the registration, I'd try to describe reality, and not get
too excited about what the documents say.

I hope that we can get confirmations from Microsoft as to what actually
is implemented.

Regards, Martin.

Leif Halvard Silli

2011-12-20 03:46:35 UTC

Post by Doug Ewell
It seems Leif might be trying to tag the incomplete or erroneous
behavior of individual applications, even if they don't correspond to
documented behavior, or to tag mis-documented behavior that may not
actually be implemented (like "unicode" meaning "BMP only").

* BMP: The motivation behind why the registrations says 'BMP' was only
that the written spec says so and because the registration template
asked for such data.

* Products: Reference to products are made in order to document that
the 'unicode'/'unicodeFFFE' specs actually are implemented. In that
regard, the possible 'BMP'-incorrectness seems far less important
w.r.t. practical 'real' problems than the endianness issues.

* Actually implemented: That 'unicode' and 'utf-16' (in the Microsoft
spec) are names for little-endian UTF-16, while 'unicodeFFFE' is name
for big-endian UTF-16, is a fact. To verify, try the following web page
in Chrome, Safari or IE - the clue being that the page is
'utf-16b'-encoded while HTTP says 'utf-16':
http://malform.no/testing/utf/html/16be/http.utf16
For reference, an identical, but little-endian encoded page:
http://malform.no/testing/utf/html/16le/http.utf16
If IE and Safari/Chrome implemented the official UTF-16
specification, the first page should have worked fine, while the latter
perhaps did not need to work. Instead, we see the opposite: The first
page fails in in the mentioned browsers.

* 'Actually implemented' has reached Web standards: HTML5 specifies:
«The requirement to default UTF-16 to little-endian rather than
big-endian is a willful violation of RFC 2781, motivated by a desire
for compatibility with legacy content. [RFC2781]»
<http://dev.w3.org/html5/spec/parsing.html#character-encodings-0>
Whether it is 'legacy content' - as HTML5 claims - or implementation
of the Microsoft spec - or both things - that makes HTML5 say this, is
perhaps an open question.

Post by Doug Ewell
I'm not sure that's a goal of registering charsets.

The goals with these registrations are to comply with section 2.5. In
particular did this seem relevant: «the use of a large number of
undocumented and/or unlabeled charsets hampers interoperability even
more.»
<http://tools.ietf.org/html/bcp19#section-2.5>

Post by Doug Ewell
It also seemed to
me—though I assume I'm wrong here—that he was trying to call
particular attention to errors in Microsoft implementations, but I'm
sure Shawn and others can speak to that.

It is not only products of Microsoft: Webkit is backed by Apple,
Google, HTML5 ...

But with Microsoft's positive attitude Unicode, including UTF-16, it
seems reasonable to ask: Is it certain that Microsoft - and the
community at large - is aware of how they operate with a shadow spec
that contradicts UTF-16 - and the impacts of this? Perhaps, with a
little attention to this, they will update or fine

Shawn Steele

2011-12-21 02:15:46 UTC

FWIW: I'm on vacation, so I'll probably forget to check in on this thread ;-), however, fwiw:

* I don't mind clarifying the name(s) being used.
* Changing names will break a ton of stuff probably, so anything you see that seems odd is probably "stuck" that way :(
* So, it'd be nice if any document noted cases where things have been historically different, eg "AAAAA means XXXXX, but sometimes people have done YYYY with it", or "behavior ZZZZ should be named BBBB, but sometimes people call it CCCC"

Doug's suggestion of a table sounds good, but I suspect some columns may end up with multiple meanings. However maybe we could then easily see where stuff is unambiguous?

-Shawn

 
http://blogs.msdn.com/shawnste

________________________________________
From: Doug Ewell [***@ewellic.org]
Sent: Monday, December 19, 2011 7:36 AM
To: ietf-***@iana.org
Subject: Re: Are charset names supposed to be case sensitive?

I guess I would like to see some sort of table breaking down the various
flavors of UTF-16 and/or UCS-2 that would need to be tagged separately:

* big-endian or little-endian by default
* accepts BOM
* requires BOM
* supports all 17 planes or just BMP
* etc.

That way I would have a clearer sense of what can be currently tagged,
what cannot be tagged and needs to be, and what is just an application
quirk or bug.

It seems Leif might be trying to tag the incomplete or erroneous
behavior of individual applications, even if they don't correspond to
documented behavior, or to tag mis-documented behavior that may not
actually be implemented (like "unicode" meaning "BMP only"). I'm not
sure that's a goal of registering charsets. It also seemed to me—though
I assume I'm wrong here—that he was trying to call particular attention
to errors in Microsoft implementations, but I'm sure Shawn and others
can speak to that.

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwe

Bjoern Hoehrmann

2011-12-15 18:58:55 UTC