Encodings and the web

Discussion:

Anne van Kesteren

2011-12-20 10:59:49 UTC

Hi,

When doing research into encodings as implemented by popular user agents I
have found the current standards lacking. In particular:

* More encodings in the registry than needed for the web
* Error handling for encodings is undefined (can lead to XSS exploits,
also gives interoperability problems)
* Often encodings are implemented differently from the standard

A year ago I did some research into encodings[1] and more detailed for
single-octet encodings[2] and I have now taken that further into starting
to define a standard[3] for encodings as they are to be implemented by
user agents. The current scope is roughly defining the encodings, their
labels and name, and how you match a label.

The goal is to unify encoding handling across user agents for the web so
legacy pages can be interpreted "correctly" (i.e. as expected by users).

If you are interested in helping out testing (and reverse engineering)
multi-octet encodings please let me know. Any other input is much
appreciated as well.

Kind regards,

[1]<http://wiki.whatwg.org/wiki/Web_Encodings>
[2]<http://annevankesteren.nl/2010/12/encodings-labels-tested>
[3]<http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html>

--
Anne van Kesteren
http://annevankesteren.nl/

Martin J. Dürst

2011-12-21 11:09:03 UTC

Permalink

Hello Anne,

Post by Anne van Kesteren
Hi,
When doing research into encodings as implemented by popular user agents I
* More encodings in the registry than needed for the web

That's not a surprise. I think there are two reasons. One is that the
main initial contributor tried to me more general than necessary, and
included things that aren't actually usable, such as just the
double-byte parts of some multibyte standards, and so on. The second is
that, believe it or not, there's actually stuff outside the Web.

There is absolutely no problem for a Web-related spec to say "these are
the encodings you should support in a Web UA, and not more". The reason
we didn't do something like this in the RFC 2070 or HTML4 timeframe is
that at that time it wasn't opportune yet to say "no more new encodings,
please use Unicode".

Post by Anne van Kesteren
* Error handling for encodings is undefined (can lead to XSS exploits,
also gives interoperability problems)

We don't have security considerations for charset registrations, but we
probably should. If there are any specific issues, I'm sure we will find
a way to add them to the registry, so please send them here.

Post by Anne van Kesteren
* Often encodings are implemented differently from the standard

There are often minor differences. ICU has a huge collection of
variants. Getting rid of them may be impossible, unfortunately. As an
example, both Microsoft and Apple have long-standing traditions for some
differences, and both may not be very willing to change.

It would be good to know where and how your tables differ from "the
standard".

It would also be good to know what it is that you refer to by "the
standard". (I guess I would know if there were a "standard" for all
character encodings, but I don't know of such a thing.)

Post by Anne van Kesteren
A year ago I did some research into encodings[1] and more detailed for
single-octet encodings[2] and I have now taken that further into starting
to define a standard[3] for encodings as they are to be implemented by
user agents. The current scope is roughly defining the encodings, their
labels and name, and how you match a label.

I have quickly looked at the document. Many single-octet encodings only
show a few rows. As an expert, I have a pretty sure feel of where to
look for the rest of the table, but it would really be better if there
was an explicit pointer to the table used for completion from the table
that lacked the rows.

Also, it would be very helpful if the entries where there are
differences were marked as such directly in the table. Having to go back
and forth between table and notes is really though.

Also, it would probably be better to use a special value instead of
U+FFFD for undefined values. Somewhere at the start, or in another spec,
it can then say what that means. The reason for that is that in contexts
other than final display, one wants other things to happen than just
conversion to U+FFFD.

Another point is that "platform" turns up a lot. It would be much easier
to understand for outsiders if it read "Web platform".

Instead of writing "Define the finite list of encodings for the platform
and obsolete the "CHARACTER SETS" registry.", please use some wording
that makes it clear that your document does NOT obsolete the charset
registry (even if it may make it irrelevant for Web browsers).

You write "Need to define decode a byte string as UTF-8, with error
handling in a way that avoids external dependencies.". I'm really not
sure why this would be needed. The Unicode consortium went over UTF-8
decoding issues with a very fine comb many times. If you find a hair in
their soup, they have to fix it. If not, duplicating the work doesn't
help at all. What you might need is some glue text, because the Unicode
spec is worded for various situations, not only final display on Web UAs.

Last but not least, solving encoding conversion issues does not fix all
problems. On a Japanese OS, I regularly see 0+005C as a Yen symbol
rather than as a backslash. I haven't looked at browser differences in
this respect, but I'm sure they exist.

Regards, Martin.

Post by Anne van Kesteren
The goal is to unify encoding handling across user agents for the web so
legacy pages can be interpreted "correctly" (i.e. as expected by users).
If you are interested in helping out testing (and reverse engineering)
multi-octet encodings please let me know. Any other input is much
appreciated as well.
Kind regards,
[1]<http://wiki.whatwg.org/wiki/Web_Encodings>
[2]<http://annevankesteren.nl/2010/12/encodings-labels-tested>
[3]<http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html>

Anne van Kesteren

2011-12-22 12:51:33 UTC

Permalink

On Wed, 21 Dec 2011 12:09:03 +0100, Martin J. D=C3=BCrst =

* Error handling for encodings is undefined (can lead to XSS exploits=

also gives interoperability problems)

We don't have security considerations for charset registrations, but w=

e =

probably should. If there are any specific issues, I'm sure we will fi=

nd =

a way to add them to the registry, so please send them here.

See e.g. http://zaynar.co.uk/docs/charset-encoding-xss.html

* Often encodings are implemented differently from the standard

There are often minor differences. ICU has a huge collection of =
variants. Getting rid of them may be impossible, unfortunately. As an =

example, both Microsoft and Apple have long-standing traditions for so=

me =

differences, and both may not be very willing to change.

We'll see. The same was said when we tackled the HTML parser =

interoperability mess.

It would be good to know where and how your tables differ from "the =
standard".
It would also be good to know what it is that you refer to by "the =
standard". (I guess I would know if there were a "standard" for all =
character encodings, but I don't know of such a thing.)

When my document says "standard" it refers to itself. I have thought abo=
ut =

including differences with respect to the IANA registry, but was hoping =
=

someone else would do that based on the data tables available.

I have quickly looked at the document. Many single-octet encodings onl=

y =

show a few rows. As an expert, I have a pretty sure feel of where to =
look for the rest of the table, but it would really be better if there=

was an explicit pointer to the table used for completion from the tabl=

e =

that lacked the rows.

In the beginning of the section it states that in missing rows all octet=
=

values match the code point value. So if 80-8F =3D=3D U+0080-U+008F the =
row is =

simply not there for brevity.

Also, it would be very helpful if the entries where there are =
differences were marked as such directly in the table. Having to go ba=

ck =

and forth between table and notes is really though.

The plan is for the notes to go away by implementations aligning. Both =

Mozilla and Opera are making some effort towards that.

Also, it would probably be better to use a special value instead of =
U+FFFD for undefined values. Somewhere at the start, or in another spe=

c, =

it can then say what that means.

Thank you, I have done this now.

Another point is that "platform" turns up a lot. It would be much easi=

er =

to understand for outsiders if it read "Web platform".

Also done for now.

Instead of writing "Define the finite list of encodings for the platfo=

rm =

and obsolete the "CHARACTER SETS" registry.", please use some wording =

that makes it clear that your document does NOT obsolete the charset =
registry (even if it may make it irrelevant for Web browsers).

Fair enough, done.

You write "Need to define decode a byte string as UTF-8, with error =
handling in a way that avoids external dependencies.". I'm really not =

sure why this would be needed. The Unicode consortium went over UTF-8 =

decoding issues with a very fine comb many times. If you find a hair i=

n =

their soup, they have to fix it. If not, duplicating the work doesn't =

help at all. What you might need is some glue text, because the Unicod=

e =

spec is worded for various situations, not only final display on Web U=

As.

That is what HTML currently does (as referenced from the issue). It does=
=

not strike me as ideal for implementors.

Last but not least, solving encoding conversion issues does not fix al=

l =

problems. On a Japanese OS, I regularly see 0+005C as a Yen symbol =
rather than as a backslash. I haven't looked at browser differences in=

this respect, but I'm sure they exist.

That particular issue is related to fonts in most browsers. I do not =

expect the Japanese fonts that display U+005C as Yen to change, but =

regardless it is out of scope for this document.

-- =

Anne van Kesteren
http://annevankesteren.nl/

Leif Halvard Silli

2011-12-21 15:55:30 UTC

Permalink

Post by Anne van Kesteren
* More encodings in the registry than needed for the web
* Error handling for encodings is undefined (can lead to XSS exploits,
also gives interoperability problems)
* Often encodings are implemented differently from the standard

Comment: In the HTML5 spec, the term 'character encoding' is used.
Perhaps this document should say the same? At least once ... for
instance in the title ...

Comment: The approach of the 'old' character sets registry is to
document the encodings in use, but not necessarily to endorse them. Do
you follow a similar approach? E.g. do you intend to list all encodings
and encoding labels, including obsolete ones? And if you make things
into aliases which previously were different character sets/encodings,
do you intend to point to the original specs or registrations? I have
the feeling that you take a synchronic approach - gloss over the past.
It appears simpler to contribute if the spec tries to be complete.

For instance, I could not find ISO-IR-111 in your list ... just to name
one character encoding that stuck in my mind ... It is a superset of
KOI8-R.

...

Post by Anne van Kesteren
The goal is to unify encoding handling across user agents for the web so
legacy pages can be interpreted "correctly" (i.e. as expected by users).

As expected by users, you say. Or as UAs have created the expectations
... Users expect their pages to work. HTML5 says that UTF-32 is
explicitly not supported. And I think 'not supported anymore' should be
documented. I would suggest that the spec ought to take this approach:
W.r.t. 'dubious' encodings, then UAs should be allowed to support any
legacy encoding they like unless it is explicitly listed as 'not
supported'. That way we get to quarrel about what to ban, rather than
about what to welcome.

As for 'users', then I note that you for instance for IBM 864 say
'since Presto has no support, may be we can remove it'? Opera is of
course the dominating browser ... Though I might not understand the
impact of the mobile Web in that statement - Opera mini is pop in
Arabic countries? But to be certain: Where are the users in this line
of thought?

You thereafter say that 'Chromium only supports it because of Webkit'.
How do you know that? In my experience, Chromium appears almost biased
towards Arabic ... E.g. for unlabelled koi8-r, then it defaults to
Arabic ... At least on my computer and on this page - without the same
thing happening in Safari:
<http://www.malform.no/testing/utf/html/koi8/1>.

Personally, I'd like to see more robust detection of UTF-eight - and,
of course - also of UTF-sixteen. As for UTF-eight, then it really ought
be some kind of pre-default, before defaulting to the locale encoding.
(Opera and Chrome are perhaps closest to my wish in that regard.)

Btw, what is this spec's relation to the encoding sniffing algorithm of
HTML5 supposed to be?

And what are 'Encodings and the web'? Does XML fit in there? I think
some would like to say 'hopefully not' ...

Post by Anne van Kesteren
If you are interested in helping out testing (and reverse engineering)
multi-octet encodings please let me know. Any other input is much
appreciated as well

As part of my MS 'unicode' effort, I have created a test bed that I try
to update in my perceived spare time:
<http://www.malform.no/testing/utf/>. But it takes some time to analyze
and document it all. However, it is quite interesting ... I will find a
suitable place to post it when I'm ready.

One thing I've found, in that regard, is that browsers vary a good deal
w.r.t. what they use in order to detect encoding. For instance they
vary w.r.t. whether they use the XML prolog, both with and without the
XML encoding inside - including in HTML - when sniffing the encoding.
Chrome does use the XML prolog - at least it sniffs UTF-16LE and
UTF-16BE when the prolog is there, but not necessarily otherwise. If
you - as I think you do - want to eat into how not only HTML but also
XML handles encodings, perhaps HTML should accept being eaten into by
XML too? (I suggested for HTML5 that it should allow limited use of XML
prolog, but guess if the Editor closed that bug ...)

--
Leif H Silli

Anne van Kesteren

2012-01-06 09:02:44 UTC

Permalink

Post by Anne van Kesteren
If you are interested in helping out testing (and reverse engineering)
multi-octet encodings please let me know. Any other input is much
appreciated as well.

I made some modest progress since last time. In particular the to Unicode
algorithms behind hz-gb-2312, euc-jp, iso-2202-jp, and shift_jis are done.

http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html

I was wondering if people had ideas on how to present rather large data
tables. For single-octet encodings I think what I have now is okay, but
for multi-octet encodings it probably needs to be a separate file. Should
such a file be HTML or is a simple data file sufficient? Maybe JSON or the
Unicode.org format?

Input appreciated.

By the way, if it is inappropriate for me to discuss this here let me know.

--
Anne van Kesteren
http://annevankesteren.nl/