Encoding Standard (mostly complete)

Discussion:

Anne van Kesteren

2012-04-17 09:38:47 UTC

Hi,

Apart from big5, all encoders and decoders are now defined.

http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html

Feedback is much appreciated.

(big5 is somewhat complicated unfortunately. See
http://annevankesteren.nl/2012/04/big5 for more details.)

Kind regards,

--
Anne van Kesteren
http://annevankesteren.nl/

Shawn Steele

2012-04-17 18:09:31 UTC

Permalink

I'm a little confused about what the purpose of the document is?

-----Original Message-----
From: Anne van Kesteren [mailto:***@opera.com]
Sent: Tuesday, April 17, 2012 2:39 AM
To: ietf-charsets
Subject: Encoding Standard (mostly complete)

Hi,

Apart from big5, all encoders and decoders are now defined.

http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html

Feedback is much appreciated.

(big5 is somewhat complicated unfortunately. See
http://annevankesteren.nl/2012/04/big5 for more details.)

Kind regards,

--
Anne van Kester

Doug Ewell

2012-04-17 18:20:34 UTC

Permalink

Post by Shawn Steele
I'm a little confused about what the purpose of the document is?

I assume it was intended to document the encodings deemed permissible in
HTML5, which I guess is supposed to be synonymous with "the web
platform."

I was surprised by some of the choices of "permissible," such as
including ibm864 and ibm866 but none of the other, much more widespread,
legacy OEM code pages. I was also puzzled by the reference to utf-16 and
utf-16be as "legacy" encodings.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell

Shawn Steele

2012-04-17 18:31:47 UTC

Permalink

Shouldn't the W3C be pointing to the charset registry then? Also is this doc on some sort of standards track?

FWIW: It'd be nice if like in section 0 it said "Encodings are scary, use UTF-8 because the rest are implemented inconsistently across platforms".

-Shawn

-----Original Message-----
From: Doug Ewell [mailto:***@ewellic.org]
Sent: Tuesday, April 17, 2012 11:21 AM
To: Shawn Steele; Anne van Kesteren; ietf-charsets
Subject: RE: Encoding Standard (mostly complete)

Post by Shawn Steele
I'm a little confused about what the purpose of the document is?

I assume it was intended to document the encodings deemed permissible in HTML5, which I guess is supposed to be synonymous with "the web platform."

I was surprised by some of the choices of "permissible," such as including ibm864 and ibm866 but none of the other, much more widespread, legacy OEM code pages. I was also puzzled by the reference to utf-16 and utf-16be as "legacy" encodings.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.o

Anne van Kesteren

2012-04-18 06:49:31 UTC

Permalink

On Tue, 17 Apr 2012 20:31:47 +0200, Shawn Steele

Post by Shawn Steele
Shouldn't the W3C be pointing to the charset registry then? Also is
this doc on some sort of standards track?

The charset registry is woefully inadequate and as you know is a far cry
of how encodings are actually implemented and what labels they have in
practice. Not to mention that it is unbounded and lists many encodings
that would be bad to support.

As for standards track, the W3C might standardize this in due course.

Post by Shawn Steele
FWIW: It'd be nice if like in section 0 it said "Encodings are scary,
use UTF-8 because the rest are implemented inconsistently across
platforms".

I added a note to the Preface section, reading "Note: This standard is
primarily intended for dealing with legacy content, it requires new
content and formats to use the utf-8 encoding exclusively."

--
Anne van Kesteren
http://annevankesteren.nl/

Shawn Steele

2012-04-18 20:31:27 UTC

Permalink

The charset registry is woefully inadequate and as you know is a far cry of how encodings
are actually implemented and what labels they have in practice. Not to mention that it is
unbounded and lists many encodings that would be bad to support.

Well, maybe point to the defined ones then that you want to support in HTML. We can't change our behavior, so if your document happens to diverge, it'll introduce additional confusion. Current cross-vendor/platform implementations already vary, but those ways should be well understood by people who hit those

Anne van Kesteren

2012-04-18 06:41:03 UTC

Permalink

Post by Doug Ewell

Post by Shawn Steele
I'm a little confused about what the purpose of the document is?

I assume it was intended to document the encodings deemed permissible in
HTML5, which I guess is supposed to be synonymous with "the web
platform."

More or less, yes. Encodings to be used by HTML, CSS, browser
implementations of XML, etc. As I explained before on this mailing list
http://mail.apps.ietf.org/ietf/charsets/msg02027.html the idea is to:

* Make the encodings that can be supported a finite list
* Carefully define the labels for these encodings
* Carefully define the algorithms to implement these encodings
** Including error and end-of-file handling
* Carefully define the indexes for these encodings, including any poorly
documented extensions

The idea is to make the web platform completely predictable with respect
to encodings rather than the morass it is now. This should help existing
implementations compete more effectively as well as help new
implementations enter the market more easily without significant reverse
engineering costs.

Post by Doug Ewell
I was surprised by some of the choices of "permissible," such as
including ibm864 and ibm866 but none of the other, much more widespread,
legacy OEM code pages. I was also puzzled by the reference to utf-16 and
utf-16be as "legacy" encodings.

I'm not quite sure if ibm864 and ibm866 should stay, they are not
universally supported but four out of five user agents have them if I
remember correctly. The list of encodings is based roughly on the
intersection of what browsers support. If I missed an encoding that is
actually "widely" used on pages it would be good to add it of course. My
assumption has been that if only one browser supports the encoding it is
probably not or not widely used.

I classified utf-16 as legacy because of its many gotchas and because most
web technology works entirely with utf-8 or does not work with utf-16.
E.g. form submission does not do utf-16, XMLHttpRequest only sends utf-8
encoded strings, several new formats are utf-8 only.

--
Anne van Kesteren
http://annevankesteren.nl/

Shawn Steele

2012-04-18 20:25:07 UTC

Permalink

* Carefully define the algorithms to implement these encodings

Encodings are really icky and inconsistent. Additional definitions are likely to only cause further divergence of actual implementations from each other. IMO it would be better to point to the already defined definitions than trying to do it all again. In fact that's what the charset registry was intended to help with?

I'd also really push UTF-8, and really discourage many of the encodings. If you really want to say which ones should be legal for HTML, I'd pick the smallest useful subset of encodings, rather than superset of all of the encodings that happen to be supported by any browser.

-Shawn

-----Original Message-----
From: Anne van Kesteren [mailto:***@opera.com]
Sent: Tuesday, April 17, 2012 11:41 PM
To: Shawn Steele; ietf-charsets; Doug Ewell
Subject: Re: Encoding Standard (mostly complete)

Post by Doug Ewell

Post by Shawn Steele
I'm a little confused about what the purpose of the document is?

I assume it was intended to document the encodings deemed permissible
in HTML5, which I guess is supposed to be synonymous with "the web
platform."

More or less, yes. Encodings to be used by HTML, CSS, browser implementations of XML, etc. As I explained before on this mailing list http://mail.apps.ietf.org/ietf/charsets/msg02027.html the idea is to:

* Make the encodings that can be supported a finite list
* Carefully define the labels for these encodings
* Carefully define the algorithms to implement these encodings
** Including error and end-of-file handling
* Carefully define the indexes for these encodings, including any poorly documented extensions

The idea is to make the web platform completely predictable with respect to encodings rather than the morass it is now. This should help existing implementations compete more effectively as well as help new implementations enter the market more easily without significant reverse engineering costs.

Post by Doug Ewell
I was surprised by some of the choices of "permissible," such as
including ibm864 and ibm866 but none of the other, much more
widespread, legacy OEM code pages. I was also puzzled by the reference
to utf-16 and utf-16be as "legacy" encodings.

I'm not quite sure if ibm864 and ibm866 should stay, they are not universally supported but four out of five user agents have them if I remember correctly. The list of encodings is based roughly on the intersection of what browsers support. If I missed an encoding that is actually "widely" used on pages it would be good to add it of course. My assumption has been that if only one browser supports the encoding it is probably not or not widely used.

I classified utf-16 as legacy because of its many gotchas and because most web technology works entirely with utf-8 or does not work with utf-16.
E.g. form submission does not do utf-16, XMLHttpRequest only sends utf-8 encoded strings, several new formats are utf-8 only.

--
Anne van Kesteren
http://annevankestere

Anne van Kesteren

2012-04-19 08:01:27 UTC

Permalink

On Wed, 18 Apr 2012 22:25:07 +0200, Shawn Steele

Post by Anne van Kesteren
* Carefully define the algorithms to implement these encodings
Encodings are really icky and inconsistent. Additional definitions are
likely to only cause further divergence of actual implementations from
each other. IMO it would be better to point to the already defined
definitions than trying to do it all again. In fact that's what the
charset registry was intended to help with?

My experience is that by defining a feature in detail and writing a test
suite implementations will converge over time. E.g. it was once
controversial that HTML parsing could be defined and implemented in the
same manner across browsers. (HTML parsers were really icky and
inconsistent, and a lot more complicated than decoder/encoder algorithms
if you look at their interaction with script execution.)

I can maybe explain why http://www.iana.org/assignments/character-sets
does not help implementors.

This is its entry for shift_jis:

===
Name: Shift_JIS (preferred MIME name)
MIBenum: 17
Source: This charset is an extension of csHalfWidthKatakana by
adding graphic characters in JIS X 0208. The CCS's are
JIS X0201:1997 and JIS X0208:1997. The
complete definition is shown in Appendix 1 of JIS
X0208:1997.
This charset can be used for the top-level media type "text".
Alias: MS_Kanji
Alias: csShiftJIS
===

This does not tell you the other labels you need to recognize, such as
"shift-jis" or "x-sjis". It references an extremely old document that does
not detail error handling end-of-file handling or a clear mapping to
Unicode or their relation to other Japanese encodings. It does not detail
the extensions to shift_jis made by Microsoft that you need to implement
in order to work with sites. It indeed misses all the critical details.

Entries for euc-kr, gb_2312-80, ... are similarly not helpful. euc-kr does
not mention you need to support Unified Hangul Code as Internet Explorer
does in order to work with Korean content and gb_2312-80 does not mention
you should really use your gbk decoder/encoder instead.

The registry is an interesting collection of entries, but does not help
implementors.

Post by Anne van Kesteren
I'd also really push UTF-8, and really discourage many of the
encodings. If you really want to say which ones should be legal for
HTML, I'd pick the smallest useful subset of encodings, rather than
superset of all of the encodings that happen to be supported by any
browser.

The Encoding Standard definitely does not define a superset.

From another reply:

On Wed, 18 Apr 2012 22:31:27 +0200, Shawn Steele

Post by Anne van Kesteren
Well, maybe point to the defined ones then that you want to support in
HTML. We can't change our behavior, so if your document happens to
diverge, it'll introduce additional confusion. Current
cross-vendor/platform implementations already vary, but those ways
should be well understood by people who hit those issues.

No they are not well understood. I do not know about Internet Explorer,
but browsers other than Internet Explorer continue to hit compatibility
issues in this part of their code and continue to make changes because of
it, without clear guidance thus far as what the end goal ought to be and
what everyone else is aiming for.

--
Anne van Kesteren
http://annevankesteren.nl/

Shawn Steele

2012-04-19 18:00:53 UTC

Permalink

Post by Anne van Kesteren
My experience is that by defining a feature in detail and writing a test
suite implementations will converge over time.

Our implementation of encodings WILL NOT change. Ever. I also don't have the resources to validate that your standard matches our behavior.

I'm not at all trying to say that our implementations are "perfect". (On the contrary, I've blogged quite often about how there are lots of variations in the wild. We even have slight differences in our various SDKs.) However, there are millions of our customers that depend on our current behavior. If that behavior changes even slightly, then that will "corrupt" their data.

Our handling of shift_jis is probably the most severe of those, where the standard has evolved beyond what we support, but we can't change without breaking people.

Post by Anne van Kesteren
Entries for euc-kr, gb_2312-80, ... are similarly not helpful. euc-kr does
not mention you need to support Unified Hangul Code as Internet Explorer
does in order to work with Korean content and gb_2312-80 does not mention
you should really use your gbk decoder/encoder instead.

Use Unicode.

Post by Anne van Kesteren
No they are not well understood. I do not know about Internet Explorer,
but browsers other than Internet Explorer continue to hit compatibility
issues in this part of their code and continue to make changes because of
it, without clear guidance thus far as what the end goal ought to be and
what everyone else is aiming for.

Use Unicode. Even if you figure out exactly what every browser is doing, you still have no idea what browser/version the page was targeting. Even if you created a perfect version of the ABC encoding (placeholder for your favorite encoding), and convinced all of the browsers to adopt the perfect ABC encoding, you'll continue to have encoding problems because there are millions of pages implemented with the existing variations of ABC encoding. If you want to convince them to update their pages to the "correct" ABC, then it'd be far better to get them to move to UTF-8. For that matter, getting them to correctly tag their existing data would solve most of the most egregious problems.

IMO I would MUCH rather see this much effort put into encouraging Unicode, than to pin down the existing rats nest and accidentally encouraging people to continue with the bad practice of using encodings.

Anne van Kesteren

2012-04-20 06:47:54 UTC

Permalink

On Thu, 19 Apr 2012 20:00:53 +0200, Shawn Steele

Post by Shawn Steele

Use Unicode.

You keep saying in this context of me trying to explain why a browser
handling *legacy* pages has a hard time knowing what to implement. It is
starting to get annoying. If those pages used Unicode we would not
continue to get bug reports.

Post by Shawn Steele

Post by Anne van Kesteren
No they are not well understood. I do not know about Internet Explorer,
but browsers other than Internet Explorer continue to hit compatibility
issues in this part of their code and continue to make changes because
of it, without clear guidance thus far as what the end goal ought to be
and
what everyone else is aiming for.

Yes, that is why we perform content analysis to figure out what the best
way to decode data would be. See e.g.
http://lists.w3.org/Archives/Public/public-html-ig-zh/2012Apr/

Post by Shawn Steele
If you want to convince them to update their pages to the "correct" ABC,
then it'd be far better to get them to move to UTF-8. For that matter,
getting them to correctly tag their existing data would solve most of
the most egregious problems.

The assumption is that neither of those is going to happen for data we
still want to read in say a hundred years time.

Post by Shawn Steele
IMO I would MUCH rather see this much effort put into encouraging
Unicode, than to pin down the existing rats nest and accidentally
encouraging people to continue with the bad practice of using encodings.

This effort is not aimed at content authors.

Speaking of which, I've been a tireless advocate of utf-8 since before I
knew how it worked. I wrote e.g.

http://annevankesteren.nl/2004/06/utf-8
http://annevankesteren.nl/2009/09/utf-8-reasons

And last night while you wrote your email I presented on the topic at a
local developer meetup:

http://annevankesteren.nl/presentations/1F4A9.html

This is not about that. This is about handling existing *legacy* content
that is unlikely to change.

--
Anne van Kesteren
http://annevankesteren.nl/

Shawn Steele

2012-04-20 18:05:55 UTC

Permalink

You keep saying in this context of me trying to explain why a browser handling *legacy*
pages has a hard time knowing what to implement. It is starting to get annoying.
If those pages used Unicode we would not continue to get bug reports.

Exactly. Opera isn't the only one that gets those bug reports. I get bug reports too.

User A visits web site W that was written to match JIS's latest and greatest EXACTLY. That fails on IE because we don't know everything about JIS.
User B visits web site X that was written based on an older Linux interpretation. That fails on IE because the Linux version differed from ours.
User C visits web site Y that wasn't tagged perfectly with an exact variation, but it succeeds on IE because it happens to be our behavior.
User D visits web site Z that is tagged perfectly, but it fails in IE because we don't recognize the name. The developer didn't bother with IE so didn't realize the problem.

These are all mutually exclusive. If I "fix" user A, then I risk breaking the rest, etc. If I break user C & web site Y, they get really mad because they used to work, even though it wasn't "right." The only way to resolve this is to update all (or at least many) of the documents (which will never happen). The only Encoding which is (reasonably) consistent across Windows/.Net/Linux/Unix/OS X/IE/notepad/Office/Firefox/Chrome/Opera/etc. is Unicode. (Even then, there are still oddities with the PUA and stuff, but at least it constrains the problem).

We support operating systems for many years, and data that originated much older than that. I have a billion machines in use, and a lot of those aren't going to upgrade until they die. I have no idea how to figure out how many documents there are. If I touch a code page that breaks those documents and those machines, then I get lots of colorful feedback to broaden my horizons.

I've tried to help update the IETF lists to point to our current behaviors in the interest of compatibility, and I think all of the code page data is published.

The assumption is that neither of those is going to happen for data we still want to read in say a hundred years time.
This is not about that. This is about handling existing *legacy* content that is unlikely to change.

But if that content isn't tagged with variation 123 of ABC standard, how can you read it? Like all those files generated by notepad? The author undoubtedly was able to read the document on her machine with her configuration. It's when that file crosses machines that things most often get confused, and we don't have the precision to distinguish between them.

IMO trying to guess variations heuristically makes it worse if you want cross everything compatibility because everyone's heuristics differ. (Indeed we had lots of push-back about IE's old code page autodetection/autocorrection behavior).

-S

Masatoshi Kimura

2012-04-21 15:22:08 UTC

Permalink

Post by Shawn Steele
Our implementation of encodings WILL NOT change. Ever.

Actually MS11-057 changed the IE decoder [1].

Post by Shawn Steele
However, there are millions of our customers that depend on our
current behavior. If that behavior changes even slightly, then
that will "corrupt" their data.

Indeed, MS10-090 broke some ISO-2022-JP encoded Web pages [2]. Please
make your action consitent with your words.

Post by Shawn Steele
Use Unicode.

Even if we ignored non-Unicode encodings, we will need a documentation
about UTF-16 anyway due to IE quirks (BOM overrides everything, default
endian is little-endian contrary to the Unicode Standard, etc).
Otherwise other browser implementors will have to (and did)
reverse-engineer to develop a competing browser. Obviously TUS is
useless about IE quirks.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=690225#c14
[2] http://support.microsoft.com/?kbid=2467659

Doug Ewell

2012-04-21 21:57:07 UTC

Permalink

Post by Masatoshi Kimura

Post by Shawn Steele
Our implementation of encodings WILL NOT change. Ever.

Actually MS11-057 changed the IE decoder [1].

It looks like this was a bug fix to correct bad error handling of=
=20
invalid data that could enable a cross-site scripting attack. Is this=
=20
undesirable?

Post by Masatoshi Kimura

Post by Shawn Steele
However, there are millions of our customers that depend on our
current behavior. If that behavior changes even slightly, then
that will "corrupt" their data.

Indeed, MS10-090 broke some ISO-2022-JP encoded Web pages [2]. Plea=

Post by Masatoshi Kimura
make your action consitent with your words.

MS10-090 seems to be about auto-detection, not encoding behavior.=
=20
Auto-detection becomes relevant when pages are not properly tagged.

Post by Masatoshi Kimura

Post by Shawn Steele
Use Unicode.

Even if we ignored non-Unicode encodings, we will need a documentat=

ion

Post by Masatoshi Kimura
about UTF-16 anyway due to IE quirks (BOM overrides everything,
default endian is little-endian contrary to the Unicode Standard,
etc). Otherwise other browser implementors will have to (and did)
reverse-engineer to develop a competing browser. Obviously TUS is
useless about IE quirks.

Is that the goal here=E2=80=94to help browser implementers compete wi=
th=20
Microsoft?

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell =C2=AD=20

Anne van Kesteren

2012-04-23 07:10:52 UTC

Permalink

Is that the goal here=E2=80=94to help browser implementers compete wit=

h =

Microsoft?

In brief, the goal is to document the "web platform". So that existing =

implementations and new implementations have no trouble figuring out wha=
t =

to implement. In part that requires figuring out what existing content =

relies upon, which can determined to some extent by figuring out what th=
e =

market leader is doing.

-- =

Anne van Kesteren
http://annevankesteren.nl/

Julian Reschke

2012-04-23 09:06:12 UTC

Permalink

Is that the goal here=E2=80=94to help browser implementers compete=

with

Microsoft?

In brief, the goal is to document the "web platform". So that exist=

ing

implementations and new implementations have no trouble figuring ou=

what to implement. In part that requires figuring out what existing
content relies upon, which can determined to some extent by figurin=

g out

what the market leader is doing.

But then, Microsoft isn't the market leader anymore, right?

Given the combined market share of Chrome and Firefox, maybe it's not=
=20
needed anymore to do exactly the same?

Best regards, Julian

Anne van Kesteren

2012-04-23 10:07:11 UTC

Permalink

Post by Julian Reschke

Post by Anne van Kesteren
In brief, the goal is to document the "web platform". So that existing
implementations and new implementations have no trouble figuring out
what to implement. In part that requires figuring out what existing
content relies upon, which can determined to some extent by figuring out
what the market leader is doing.

But then, Microsoft isn't the market leader anymore, right?

That depends on the market. E.g. in Taiwan they still seem to be.

Post by Julian Reschke
Given the combined market share of Chrome and Firefox, maybe it's not
needed anymore to do exactly the same?

The Encoding Standard is not codifying Internet Explorer at the moment.
It's a balance between compatibility with deployed content, existing
browsers, avoiding PUA, and XSS concerns. (We might have to cave on
avoiding PUA, we'll see.)

--
Anne van Kesteren
http://annevankesteren.nl/

Julian Reschke

2012-04-23 11:57:03 UTC

Permalink

On Mon, 23 Apr 2012 11:06:12 +0200, Julian Reschke

Post by Julian Reschke

But then, Microsoft isn't the market leader anymore, right?

That depends on the market. E.g. in Taiwan they still seem to be.

Post by Julian Reschke
Given the combined market share of Chrome and Firefox, maybe it's not
needed anymore to do exactly the same?

Ack.

PUA?

Anne van Kesteren

2012-04-23 12:05:10 UTC

Permalink

PUA?

Private Use Area of Unicode. In some browsers certain byte sequences in
legacy encodings will yield PUA code points (rather than an assigned code
point or U+FFFD). Combined with certain fonts they can end up as a
"meaningful glyph".

Thus far only
http://dvcs.w3.org/hg/encoding/raw-file/tip/index-macintosh.txt contains
<Private Use> (Apple's logo), but it has been suggested for CJK indexes
(and indeed typically Gecko/WebKit/Trident have such extensions).

--
Anne van Kesteren
http://annevankesteren.nl/

Shawn Steele

2012-04-23 18:24:20 UTC

Permalink

Post by Anne van Kesteren

Post by Julian Reschke
Given the combined market share of Chrome and Firefox, maybe it's not
needed anymore to do exactly the same?

The Encoding Standard is not codifying Internet Explorer at the moment.
It's a balance between compatibility with deployed content, existing browsers, avoiding
PUA, and XSS concerns. (We might have to cave on avoiding PUA, we'll see.)

As noted, FYI, IE is likely not to change, particularly if that'd require supporting code pages or variations that windows doesn't know about.

Doug Ewell

2012-04-17 19:03:58 UTC

Permalink

Post by Shawn Steele
FWIW: It'd be nice if like in section 0 it said "Encodings are scary,
use UTF-8 because the rest are implemented inconsistently across
platforms".

Way down in Section 7, a third of the way through the document, it does
say, "New content and formats must exclusively use the utf-8 encoding."
That's a start, although of course there's no way to enforce "new
content."

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell

-------- Original Message --------
Subject: RE: Encoding Standard (mostly complete)
From: Shawn Steele <***@microsoft.com>
Date: Tue, April 17, 2012 12:31 pm
To: Doug Ewell <***@ewellic.org>, Anne van Kesteren
<***@opera.com>, ietf-charsets <ietf-***@iana.org>

Shouldn't the W3C be pointing to the charset registry then? Also is this
doc on some sort of standards track?

FWIW: It'd be nice if like in section 0 it said "Encodings are scary,
use UTF-8 because the rest are implemented inconsistently across
platforms".

-Shawn

-----Original Message-----
From: Doug Ewell [mailto:***@ewellic.org]
Sent: Tuesday, April 17, 2012 11:21 AM
To: Shawn Steele; Anne van Kesteren; ietf-charsets
Subject: RE: Encoding Standard (mostly complete)

Post by Shawn Steele
I'm a little confused about what the purpose of the document is?

Leif Halvard Silli

2012-04-17 19:09:24 UTC

Permalink

Post by Anne van Kesteren
Apart from big5, all encoders and decoders are now defined.
http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html
Feedback is much appreciated.

Egoist remark regarding "byte order mark is considered more
authoritative than anything else": I think that our discussion in the
WHATwg - were what I was able to say were based on a lot of preceding
documentation work in relation to a HTML5 bug {which you subsequently
"re-filed" as another, broader bug for the whole "Web platform"} - may
have contributed to what the document now says on this. See for example
this message from you:
<http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2011-December/034271.html>

If you agree that that is so, then I would appreciate being mentioned
in the acknowledgments section.

--
Leif H Silli

Anne van Kesteren

2012-04-18 06:26:14 UTC

Permalink

On Tue, 17 Apr 2012 21:09:24 +0200, Leif Halvard Silli =

Post by Leif Halvard Silli
Egoist remark regarding "byte order mark is considered more
authoritative than anything else": I think that our discussion in the
WHATwg - were what I was able to say were based on a lot of preceding
documentation work in relation to a HTML5 bug {which you subsequently
"re-filed" as another, broader bug for the whole "Web platform"} - ma=

Post by Leif Halvard Silli
have contributed to what the document now says on this. See for exampl=

Post by Leif Halvard Silli
<http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2011-December/034=

271.html>

Post by Leif Halvard Silli
If you agree that that is so, then I would appreciate being mentioned
in the acknowledgments section.

You're right, thanks for reminding me!

-- =

Anne van Kesteren
http://annevankesteren.nl/

Bjoern Hoehrmann

2012-04-18 05:11:14 UTC

Permalink

Post by Anne van Kesteren
Apart from big5, all encoders and decoders are now defined.
http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html

What is your reasoning behind "defining" how to decode UTF-8? It seem=
s
to me this is well understood and does not require yet another speci-
fication. Anyone wanting to implement a UTF-8 decoder would have to
compare your proposal to the other specifications to see if there are
any differences, and if there are any differences, find out or decide
if that's due to errors in your specification, and whether they want =
to
adopt your specification rather than any of the others. That's not a
good use of anyone's resources.

I don't feel like reverse-engineering your assembly code and clicking
through half a dozen of definitions to confirm this, but it seems as
though your decoder is rather buggy, there is nothing obvious for in-
stance that would protect against overlong sequences.
--=20
Bj=F6rn H=F6hrmann =B7 mailto:***@hoehrmann.de =B7 http://bjoern.h=
oehrmann.de
Am Badedeich 7 =B7 Telefon: +49(0)160/4415681 =B7 http://www.bjoernsw=
orld.de
25899 Dageb=FCll =B7 PGP Pub. KeyID: 0xA4357E78 =B7 http://www.websit=
edev.de/=20

Anne van Kesteren

2012-04-18 06:24:13 UTC

Permalink

Post by Bjoern Hoehrmann

Post by Anne van Kesteren
Apart from big5, all encoders and decoders are now defined.
http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html

What is your reasoning behind "defining" how to decode UTF-8?

The idea is to remove the need for
http://www.whatwg.org/specs/web-apps/current-work/multipage/infrastructure.html#utf-8

Post by Bjoern Hoehrmann
It seems
to me this is well understood and does not require yet another speci-
fication. Anyone wanting to implement a UTF-8 decoder would have to
compare your proposal to the other specifications to see if there are
any differences, and if there are any differences, find out or decide
if that's due to errors in your specification, and whether they want to
adopt your specification rather than any of the others. That's not a
good use of anyone's resources.

I agree, but referencing another specification and then trying to
carefully subset it because it does not do what you want does not seem
ideal either. And it would be inconsistent with the rest of the standard.

I don't really mind changing this though. We could do something like what
HTML has done instead, but it just seems rather messy to me.

Post by Bjoern Hoehrmann
I don't feel like reverse-engineering your assembly code and clicking
through half a dozen of definitions to confirm this, but it seems as
though your decoder is rather buggy, there is nothing obvious for in-
stance that would protect against overlong sequences.

"utf-8 lower boundary" takes care of that.

--
Anne van Kesteren
http://annevankesteren.nl/

Doug Ewell

2012-04-19 00:33:35 UTC

Permalink

What is your reasoning behind "defining" how to decode UTF-8? It se=

ems

to me this is well understood and does not require yet another spec=

fication. Anyone wanting to implement a UTF-8 decoder would have to
compare your proposal to the other specifications to see if there a=

any differences, and if there are any differences, find out or deci=

if that's due to errors in your specification, and whether they wan=

to adopt your specification rather than any of the others. That's n=

a good use of anyone's resources.

Indeed, Anne's definition already diverges from the one supplied by t=
he=20
Unicode Standard.

Given the sequence F8 80 80 80 80, the Unicode Standard specifies tha=
t a=20
decoder should recognize F5 as an invalid UTF-8 code unit, do whateve=
r=20
it does on an error condition, and then continue with the next byte.=
=20
This will generate 5 error conditions if handling of errors includes=
=20
trying to continue.

Anne's decoder (section 7.1) will accept the entire sequence, convert=
it=20
to the value 0x200000, and then emit a single decoder error, generati=
ng=20
only one error condition. The algorithm described on the=20
infrastructure.html page, while worded differently, does the same.

Considering that Anne said the existing UTF-8 definition was "minus=
=20
[e.g. lacking] some error details," discrepancies like this seem=20
especially egregious.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell =AD=20

Anne van Kesteren

2012-04-19 07:09:56 UTC

Permalink

Indeed, Anne's definition already diverges from the one supplied by the
Unicode Standard.
Given the sequence F8 80 80 80 80, the Unicode Standard specifies that a
decoder should recognize F5 as an invalid UTF-8 code unit, do whatever
it does on an error condition, and then continue with the next byte.
This will generate 5 error conditions if handling of errors includes
trying to continue.
Anne's decoder (section 7.1) will accept the entire sequence, convert it
to the value 0x200000, and then emit a single decoder error, generating
only one error condition. The algorithm described on the
infrastructure.html page, while worded differently, does the same.
Considering that Anne said the existing UTF-8 definition was "minus
[e.g. lacking] some error details," discrepancies like this seem
especially egregious.

My apologies, I just went with HTML on this, but it seems Internet
Explorer / Safari / Chrome handle this as you say, so we should just
remove the handling of those byte sequences in this manner and make sure
Opera and Gecko are fixed.

The bug I filed on Gecko can be found here:
https://bugzilla.mozilla.org/show_bug.cgi?id=746900

The bug I filed on Opera can be found at CORE-45840 if you have access to
our system.

The specification is fixed: http://dvcs.w3.org/hg/encoding/rev/f2f234e98474

Thank you!

--
Anne van Kesteren
http://annevankesteren.nl/

Masatoshi Kimura

2012-04-19 16:48:20 UTC

Permalink

Given the sequence F8 80 80 80 80, the Unicode Standard specifies that a
decoder should recognize F5 as an invalid UTF-8 code unit, do whatever
it does on an error condition, and then continue with the next byte.
This will generate 5 error conditions if handling of errors includes
trying to continue.

Where TUS defines this? It seems to contradict TUS 6.1.0 p.96:
http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf#page=42
|Although a UTF-8 conversion process is required to never consume
|well-formed subsequences as part of its error handling for ill-formed
|subsequences, such a process is not otherwise constrained in how it
|deals with any ill-formed subsequence itself. An ill-formed subsequence
|consisting of more than one code unit could be treated as a single
|error or as multiple errors. For example, in processing the UTF-8 code
|unit sequence <F0 80 80 41>, the only formal requirement mandated by
|Unicode conformance for a converter is that the <41> be processed and
|correctly interpreted as <U+0041>. The converter could return
|<U+FFFD, U+0041>, handling <F0 80 80> as a single error, or
|<U+FFFD, U+FFFD, U+FFFD, U+0041>, handling each byte of <F0 80 80> as a
|separate error, or could take other approaches to signalling <F0 80 80>
|as an ill-formed code unit subsequence.
It is exactly a purpose of Encoding Standard to avoid these kind of
vagueness.

--
***@nifty.ne.jp

Doug Ewell

2012-04-19 00:39:03 UTC

Permalink

Given the sequence F8 80 80 80 80, the Unicode Standard specifies t=

hat

a decoder should recognize F5 as an invalid UTF-8 code unit,

Sorry, obviously that should have been "recognize F8 as an invalid co=
de=20
unit," though of course F5 is one too.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell =AD

Doug Ewell

2012-04-19 17:58:57 UTC

Permalink

Copying the Unicode mailing list.

Post by Masatoshi Kimura

Post by Doug Ewell
Given the sequence F8 80 80 80 80, the Unicode Standard specifies
that a decoder should recognize F5 as an invalid UTF-8 code unit, do
whatever it does on an error condition, and then continue with the
next byte. This will generate 5 error conditions if handling of
errors includes trying to continue.

http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf#page=42
|Although a UTF-8 conversion process is required to never consume
|well-formed subsequences as part of its error handling for ill-formed
|subsequences, such a process is not otherwise constrained in how it
|deals with any ill-formed subsequence itself. An ill-formed
|subsequence consisting of more than one code unit could be treated as
|a single error or as multiple errors. For example, in processing the
|UTF-8 code unit sequence <F0 80 80 41>, the only formal requirement
|mandated by Unicode conformance for a converter is that the <41> be
|processed and correctly interpreted as <U+0041>. The converter could
|return <U+FFFD, U+0041>, handling <F0 80 80> as a single error, or
|<U+FFFD, U+FFFD, U+FFFD, U+0041>, handling each byte of <F0 80 80> as
|a separate error, or could take other approaches to signalling <F0 80
|80> as an ill-formed code unit subsequence.

I remembered reading a statement from UTC that interpretation of an ill-
formed sequence was supposed to terminate as soon as the sequence was

Post by Masatoshi Kimura
For example, in UTF-8 every byte of the form 110xxxxx₂ must be
followed with a byte of the form 10xxxxxx₂. A sequence such as
<110xxxxx₂ 0xxxxxxx₂> is illegal, and must never be generated. When
faced with this illegal byte sequence while transforming or
interpreting, a UTF-8 conformant process must treat the first byte
110xxxxx₂ as an illegal termination error: for example, either
signaling an error, filtering the byte out, or representing the byte
with a marker such as FFFD (REPLACEMENT CHARACTER). In the latter two
cases, it will continue processing at the second byte 0xxxxxxx₂.

A lead byte of 11111000₂ is ill-formed.

Post by Masatoshi Kimura
Using the definition for maximal subpart, the best practice can be
Whenever an unconvertible offset is reached during conversion of a
1. The maximal subpart at that offset should be replaced by a single
U+FFFD.
2. The conversion should proceed at the offset immediately after the
maximal subpart.

However, this description does use the word "should," not "must," and it
goes on (on the same page) to offer a table with three "possible
alternative approaches" for mapping an ill-formed UTF-8 sequence into
characters. It recommends the method described above, but allows the
other two.

So the bottom line is that Masatoshi is right: the Unicode Standard does
not specify that a decoder *must* respond to an invalid lead byte as I
said, only that it *should*. I agree that this is unnecessarily vague.

Whether this calls for a complete recasting of the definition of UTF-8
by WHATWG, or by any individual contributors therein, is of course a
different matter.

Post by Masatoshi Kimura
It is exactly a purpose of Encoding Standard to avoid these kind of
vagueness.

Again, I'm not sure whether it is within the authority or responsibility
of WHATWG or any individual to provide a "better" definition of a
Unicode encoding form than that provided by Unicode. I do understand the
desire to nail down the various legacy encodings, such as Shift-JIS,
that have been interpreted over the years in very flexible and confusing
ways. I don't think UTF-8 falls into this category at all.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell

Anne van Kesteren

2012-04-20 06:28:44 UTC

Permalink

Post by Doug Ewell
Again, I'm not sure whether it is within the authority or responsibility
of WHATWG or any individual to provide a "better" definition of a
Unicode encoding form than that provided by Unicode. I do understand the
desire to nail down the various legacy encodings, such as Shift-JIS,
that have been interpreted over the years in very flexible and confusing
ways. I don't think UTF-8 falls into this category at all.

I think having a single specification to address all encoding questions is
useful. It presents encoding algorithms in a consistent style and gives
other specifications an simple reference.

The Unicode standard

* Does not address labels (e.g. I cannot find "utf8" in the PDF)
* Deals with byte order mark and utf-16 in a manner that is not matched by
implementations
* Does not go far enough in defining error handling
* Is a PDF, which is really annoying on the web

--
Anne van Kesteren
http://annevankesteren.nl/

Shawn Steele

2012-04-20 17:36:39 UTC

Permalink

I think having a single specification to address all encoding questions is useful.
It presents encoding algorithms in a consistent style and gives other specifications an simple reference.

Unfortunately this document doesn't "own" any of the other standards that it's summarizing. As an software developer I'd rather "go to the source" than an intermediary document that may have inadvertently introduced a discrepancy from the actual authoritative standards.

I'm not suggesting that a single source wouldn't be nice, but rather that it's impractical. Unless you can get Unicode to cede the definition of UTF-8 to your document, and the same with all the other standards, they're bound to be inconsistent or diverge.

It may be better as a pointer to the other standards, &/or documentation of quirks where other standards have been implemented differently and the pros & cons of those standards.

Leif Halvard Silli

2012-04-23 11:30:04 UTC

Permalink

Post by Anne van Kesteren
http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html
Feedback is much appreciated.

(1) Just an idea - take it or leave it: Section 7 is called "The
encoding" and that it only describes a single encoding - UTF-8. In
order to emphasize "UTF-8" as _the_ encoding, how about collapsing
section 8 to 13 into a single section named "Legacy encodings", with 6
sub-sections?

(2) I would suggest that you, the first time you talk about "byte order
mark", also introduce the abbreviation - "BOM". Currently, BOM occurs
in section 13 while "byte order mark" occurs in section 6.

(3) Regarding the note "the byte order mark is considered more
authoritative than anything else", then I would suggest specifying what
"anything else" means. I suppose that it includes - or at least ought
to include

HTTP,
<meta charset>,
<meta http-equiv=Content-Type>,
<?xml version="1.0" encoding="<anyvalue>" ?>
Manual encoding overriding by the user
The above is valid for both XML and HTML.

Unless this is listed/described, then I think one starts to guess
what "anything else" means.

--
Leif H Silli

Anne van Kesteren

2012-04-26 11:44:05 UTC

Permalink

On Mon, 23 Apr 2012 13:30:04 +0200, Leif Halvard Silli =

Post by Leif Halvard Silli
(1) Just an idea - take it or leave it: Section 7 is called "The
encoding" and that it only describes a single encoding - UTF-8. In
order to emphasize "UTF-8" as _the_ encoding, how about collapsing
section 8 to 13 into a single section named "Legacy encodings", with 6=
sub-sections?

How would you title the subsections? I could not think of something good=
.

Post by Leif Halvard Silli
(2) I would suggest that you, the first time you talk about "byte orde=

Post by Leif Halvard Silli
mark", also introduce the abbreviation - "BOM". Currently, BOM occurs
in section 13 while "byte order mark" occurs in section 6.

Done.

Post by Leif Halvard Silli
(3) Regarding the note "the byte order mark is considered more
authoritative than anything else", then I would suggest specifying wha=

Post by Leif Halvard Silli
"anything else" means. I suppose that it includes - or at least ought
to include
HTTP,
<meta charset>,
<meta http-equiv=3DContent-Type>,
<?xml version=3D"1.0" encoding=3D"<anyvalue>" ?>
Manual encoding overriding by the user
The above is valid for both XML and HTML.
Unless this is listed/described, then I think one starts to guess
what "anything else" means.

I think this should become clearer once this is integrated into other =

specifications. I rather not mention too much format specific things her=
e.

-- =

Anne van Kesteren
http://annevankesteren.nl/

Leif Halvard Silli

2012-04-26 13:24:08 UTC

Permalink

Hi Anne!

First, another proposal: In the name/label table, state that "unicode"
is another label for "utf-16" and that "unicodeFFFE" is another label
for "utf-16be".

Paradoxically, at least for HTML, then this is perhaps more relevant as
a label for UTF-8 encodings than as a label for for UTF-16 encodings -
however, this is also the case of for the other names of the UTF-16
encodings.

How would you title the subsections? I could not think of something good.

Roughly this was what I had in mind, starting with section 7:

7 The <ins>standard</ins> encoding
7.1 utf-8
8 The legacy encodings
8.1 The Single-byte legacy encodings
8.2 The Multi-byte legacy encodings
8.2.1 The Chinese (simplified) legacy encodings
8.2.1.1 gbk
8.2.1.2 gb18030
8.2.1.3 hz-gb-2312
8.2.2 The Chinese (traditional) legacy encodings
8.2.2.1 big5
8.2.3 The Japanese legacy encodings
8.2.3.1 euc-jp
8.2.3.2 iso-2022-jp
8.2.3.3 shift_jis
8.2.4 The Korean legacy encodings
8.2.4.1 euc-kr
8.2.4.2 iso-2022-kr
8.2.5 The utf-16 legacy encodings
8.2.5.1 utf-16
8.2.5.2 utf-16be

Post by Leif Halvard Silli
(2) I would suggest that you, the first time you talk about "byte order
mark", also introduce the abbreviation - "BOM". Currently, BOM occurs
in section 13 while "byte order mark" occurs in section 6.

Done.

Would it not be an idea to say "(BOM)" (in an parenthesis) also in the
note under '13 Legacy utf-16 encodings', just in case someone searches
for "BOM"?

Post by Leif Halvard Silli
(3) Regarding the note "the byte order mark is considered more
authoritative than anything else", then I would suggest specifying what
"anything else" means. I suppose that it includes - or at least ought
to include
HTTP,
<meta charset>,
<meta http-equiv=Content-Type>,
<?xml version="1.0" encoding="<anyvalue>" ?>
Manual encoding overriding by the user
The above is valid for both XML and HTML.
Unless this is listed/described, then I think one starts to guess
what "anything else" means.

I think this should become clearer once this is integrated into other
specifications. I rather not mention too much format specific things
here.

OK. But may be it would be possible to say something quite general
which leads the thought in the right direction? For instance:

"... than anything else, such as encoding declarations in higher
protocols, format specific encoding declarations and manual user
overriding."

--
Leif H Silli