Discussion:
Volunteer needed to serve as IANA charset reviewer
Ned Freed
2006-09-06 20:30:44 UTC
Permalink
> (IETF list removed, since this is about to become specialized)

> --On Wednesday, 06 September, 2006 11:04 -0700 Ted Hardie
> <***@qualcomm.com> wrote:

> > The Applications Area is soliciting volunteers willing to
> > serve as the IANA charset reviewer. This position entails
> > reviewing charset registrations submitted to IANA in
> > accordance with the procedures set out in RFC 2978. It
> > requires the reviewer to monitor discussion on the
> > ietf-charsets mailing list (moderating it, if necessary); it
> > also requires that the reviewer interact with the registrants
> > and IANA on the details of the registration. There is
> > currently a small backlog, and it will be necessary to work to
> > resolve that backlog during the initial period of the
> > appointment.
> >...

> Perhaps the need for a new volunteer in this area is the time to
> ask a broader question:

> At the time 2978 (and its predecessor, 2278) were defined, there
> were a large number of charsets in heavy use and there was some
> general feeling in the implementer community that, despite the
> provisions of RFC 2277, Unicode/ISO 10646 were not quite ready.
> Although we probably still have some distance to go (the issues
> with my net-Unicode draft may be illustrative), I wonder if we
> are reaching the point at which a stronger "use Unicode on the
> wire" recommendation would be in order. The implications of
> such a recommendation would presumably include a 2978bis that
> made the requirements for registration of a new charset _much_
> tougher, e.g., requiring a demonstration that the then-current
> version of Unicode cannot do the relevant job and/or evidence
> that the newly-proposed charset is needed in deployed
> applications.

I agree that we've reached a point where "use UTF-8" is what we need to be
pushing for in new protocol development. (Note that I said UTF-8 and not
Unicode - given the existance of gb18030 [*] I don't regard a recommendation of
"use Unicode" as even close to sufficient. The last thing we want is to see the
development of specializesd Unicode CESes for Korean, Japanese, Arabic, Hebrew,
Thai, and who knows what else.) And if the reason there are new charset
registrations was because of the perceived need to have new charsets for use in
new protocols, I would be in total agreement that a change in focus for charset
registration is in order.

But that's not why we're seeing new registrations. The new registrations we're
seeing are of legacy charsets used in legacy applications and protocols that
for whatever reason never got registered previously. Given that these things
are in use in various nooks and crannies around the world, it is critically
important that when they are used they are labelled accurately and
consistently.

The plain fact of the matter is that we have done a miserable job of producing
an accurate and useful charset registry, and considerable work needs to be done
both to register various missing charsets as well as to clean up the existing
registry, which contains many errors. I've seen no interest whatsoever in
registering new charsets for new protocols, so to my mind pushing back on, say,
the recent registration of iso-8859-11, is an overreaction to a non-problem.
[**]

> This question is motivated, not by a strong love for Unicode,
> but by the observation that RFC 2277 requires it and that the
> IETF is shifting toward it in a number of areas. More options
> and possibilities for local codings that are not generally known
> and supported do not help with interoperability; perhaps it is
> time to start pushing back.

Well, I have to say that to the extent we've pushed back on registrations, what
we've ended up with is ad-hoc mess of unregistered usage. I am therefore quite
skeptical of any belief that pushing back on registrations is a useful tactic.

> And that, of course, would dramatically change the work of the
> charset reviewer by reducing the volume but increasing the
> amount of evaluation to be done.

Even if we closed the registry completely there is still a bunch of work to do
in terms of registry cleanup.

Now, having said all this, I'm willing to take on the role of charset reviewer,
but with the understanding that one of the things I will do is conduct a
complete overhaul of the existing registry. [***] Such a substantive change will
of course require some degree of oversight, which in turn means I'd like to see
some commitment from the IESG of support for the effort.

As for qualifications, I did write the charset registration specification, and
I also wrote and continue to maintain a fairly full-features charset conversion
library. I can provide more detail if anyone cares.

Ned

[*] - For those not fully up to speed on this stuff, gb18030 can be seen as an
encoding of Unicode that is backwards compatible with the previous simplified
Chinese charsets gb2312 and gbk.

[**] - The less recent attempt to register ISO-2022-JP-2004 is a more
interesting case. I believe this one needed to be pushed on, but not
because of potential use in new applications or protocols.

[***] - I have the advantage of being close enough to IANA that I can drive
over there and have F2F meetings should the need arise - and I suspect
it will.
Keith Moore
2006-09-06 21:45:58 UTC
Permalink
I concur with the need to maintain the current charset registry to
support legacy apps that use it.

And I think Ned would be an excellent choice for reviewer, though it
wouldn' t bother me if he could have the assistance of people with
specialized expertise in Asian writing schemes.

As for utf-8 vs. Unicode, this is a bit tricky. I agree that merely
specifying Unicode isn't sufficient given the potential for
incompatible CESs. And yet I'm sympathetic to the notion that UTF-8
pessimizes storage and transmission of text written in certain
languages. IMHO it's unreasonable to exclude the potential for a
Unicode based CES that has more-or-less equivalent information
density across a wide variety of languages. But I do think that use of
multiple CESs in a new protocol should require substantial
justification, and that UTF-8 should be presumed to be the CES of
choice for any new protocol that requires ASCII compatibility for its
character representation.

Keith
Ned Freed
2006-09-06 22:58:19 UTC
Permalink
> I concur with the need to maintain the current charset registry to
> support legacy apps that use it.

> And I think Ned would be an excellent choice for reviewer, though it
> wouldn' t bother me if he could have the assistance of people with
> specialized expertise in Asian writing schemes.

Any such assistance would be hugely welcome. As an aside, it would also be nice
if more people would post comments to the list...

> As for utf-8 vs. Unicode, this is a bit tricky. I agree that merely
> specifying Unicode isn't sufficient given the potential for
> incompatible CESs. And yet I'm sympathetic to the notion that UTF-8
> pessimizes storage and transmission of text written in certain
> languages. IMHO it's unreasonable to exclude the potential for a
> Unicode based CES that has more-or-less equivalent information
> density across a wide variety of languages. But I do think that use of
> multiple CESs in a new protocol should require substantial
> justification, and that UTF-8 should be presumed to be the CES of
> choice for any new protocol that requires ASCII compatibility for its
> character representation.

This is pretty much where I'm at as well. I have no problem with UTF-16 or
UTF-32 if there is a compelling reason to allow them, but I really want to at
least try and close the door to additional CESes to the greatest extent
possible. Of course this is really an issue for the IAB and not the charset
reviewer - thank goodness.

Ned
Bruce Lilly
2006-09-07 10:33:48 UTC
Permalink
On Wed September 6 2006 18:58, Ned Freed wrote:
> > I concur with the need to maintain the current charset registry to
> > support legacy apps that use it.
>
> > And I think Ned would be an excellent choice for reviewer, though it
> > wouldn' t bother me if he could have the assistance of people with
> > specialized expertise in Asian writing schemes.
>
> Any such assistance would be hugely welcome. As an aside, it would also be nice
> if more people would post comments to the list...

OK. I concur with most of what has already been said by others, specifically
that if a charset (i.e. something meeting the definition of charset) is in
use, it ought to be registered; using the registry as a way to force some
agenda is a very bad idea. Also that Ned would be an excellent choice for
reviewer, and I would add that I fully support his stated plan to overhaul
the existing registry, which has long been in need of such an overhaul (e.g.
the registration procedure has long said that "ASCII" is disallowed, yet it
is in fact registered as an alias).

A few differences of opinion:
Keith Moore wrote:
> > But I do think that use of
> > multiple CESs in a new protocol should require substantial
> > justification, and that UTF-8 should be presumed to be the CES of
> > choice for any new protocol that requires ASCII compatibility for its
> > character representation.

There may well be areas of application for new protocols which cannot fully
support Unicode which underlies use of utf-8, due to character set size,
huge tables needed for normalization, etc. (see sections 3.1 (paying particular
attention to "memory-starved microprocessors") and 3.4 of RFC 1958). Not all
protocols need to fully support utf-8 directly; the highly successful mail
system, for example, supports only a subset of ANSI X3.4 in message header
fields, yet it allows pass-through of utf-8 and other charsets via RFC 2047
mechanisms as amended by RFC 2231 and errata.

Ted Hardie wrote:
> > This question is motivated, not by a strong love for Unicode,
> > but by the observation that RFC 2277 requires it and that the
> > IETF is shifting toward it in a number of areas.

To be precise, RFC 2277 says:
" Protocols MUST be able to use the UTF-8 charset, which consists of
the ISO 10646 coded character set combined with the UTF-8 character
encoding scheme, as defined in [10646] Annex R (published in
Amendment 2), for all text.

Protocols MAY specify, in addition, how to use other charsets or
other character encoding schemes for ISO 10646, such as UTF-16, but
lack of an ability to use UTF-8 is a violation of this policy; such a
violation would need a variance procedure ([BCP9] section 9) with
clear and solid justification in the protocol specification document
before being entered into or advanced upon the standards track.

For existing protocols or protocols that move data from existing
datastores, support of other charsets, or even using a default other
than UTF-8, may be a requirement. This is acceptable, but UTF-8
support MUST be possible.

When using other charsets than UTF-8, these MUST be registered in the
IANA charset registry, if necessary by registering them when the
protocol is published.
"
Several points:
1. "MUST be able to use" is a bit different from "requires" (see the above
example of the mail system, which is able to use utf-8 by the mechanisms
noted, but which does not require and in fact cannot directly accommodate
raw utf-8).
2. The explicitly stated policy of allowing alternative charsets is important.
3. Most important, note that 2277 explicitly requires registration.

> I have no problem with UTF-16 or
> UTF-32 if there is a compelling reason to allow them,

Well neither (as well as their "BE" and "LE" variants) is suitable for use
with MIME text types, which precludes their use in a number of important
applications. And one thing the charset registry sorely needs is a more
explicit indication of which charsets are/are not suitable for such use
(heck, some registrations have lacked the required statement of
[un]suitability, so even groping through all of the registrations is of
no use (and don't get me started on RFC 1345 issues)).
Keld Jørn Simonsen
2006-09-07 15:56:14 UTC
Permalink
On Thu, Sep 07, 2006 at 06:33:48AM -0400, Bruce Lilly wrote:
> On Wed September 6 2006 18:58, Ned Freed wrote:
> > > I concur with the need to maintain the current charset registry to
> > > support legacy apps that use it.
> >
> > > And I think Ned would be an excellent choice for reviewer, though it
> > > wouldn' t bother me if he could have the assistance of people with
> > > specialized expertise in Asian writing schemes.
> >
> > Any such assistance would be hugely welcome. As an aside, it would also be nice
> > if more people would post comments to the list...
>
> OK. I concur with most of what has already been said by others, specifically
> that if a charset (i.e. something meeting the definition of charset) is in
> use, it ought to be registered; using the registry as a way to force some
> agenda is a very bad idea. Also that Ned would be an excellent choice for
> reviewer, and I would add that I fully support his stated plan to overhaul
> the existing registry, which has long been in need of such an overhaul (e.g.
> the registration procedure has long said that "ASCII" is disallowed, yet it
> is in fact registered as an alias).

There seems to be a problem here, but maybe it whould then be the
procedures that be revised, as ASCII is a well known name for a specific
character set.

I can agree to that ascii not be the recommended name for the specific
charset.

Best regards
keld
Bruce Lilly
2006-09-07 21:17:18 UTC
Permalink
[cc's trimmed]
On Thu September 7 2006 11:56, Keld Jørn Simonsen wrote:
> On Thu, Sep 07, 2006 at 06:33:48AM -0400, Bruce Lilly wrote:
> > the registration procedure has long said that "ASCII" is disallowed, yet it
> > is in fact registered as an alias).
>
> There seems to be a problem here, but maybe it whould then be the
> procedures that be revised, as ASCII is a well known name for a specific
> character set.

Quoting RFC 2046:
" The character set name "ASCII" is reserved and must not
be used for any purpose.
"

The same text appeared in RFC 1521 and in RFC 1341, dated 1992.

[corrigenda] The current registration procedures per se do not
forbid "ASCII", but MIME initially established charset registrations,
and the "for any purpose" certainly seems clear. Quoting RFC 1341:
" Several other MIME fields, notably
including character set names, are likely to have new values
defined over time. In order to ensure that the set of such
values is developed in an orderly, well-specified, and
public manner, MIME defines a registration process which
uses the Internet Assigned Numbers Authority (IANA) as a
central registry for such values.
"
And RFC 1341 Appendix F section 2 is, as far as I know, the initial
(abbreviated) character set registration procedure.

So at least as far as MIME is concerned, "ASCII" has always been
forbidden; the default and preferred MIME name for ANSI X3.4 is
"US-ASCII".

One problem is that "ASCII" has been [mis]used for things other than
one specific character set and is therefore not unambiguous.

Also, we should distinguish informal usage from registered names
used in protocols.

As with most IANA registries, it would be quite unwise to remove something
once registered. So I wouldn't want to simply remove "ASCII" leaving no
trace in case there is some archived content which used that alias in spite
of the prohibition against such use. I would support a mechanism to mark
(clearly, and in the registry) a name as deprecated, along with a
"MUST NOT generate" rule applicable to deprecated names.

------------------

Another footnote: By noting that Ned would make a fine charset reviewer,
I am not indicating any fault with Paul Hoffman, who is still listed as
charset reviewer on the IANA site (http://www.iana.org/numbers.html#C)
and who has done a fine job as evidenced by his past participation on this
mailing list.
Keld Jørn Simonsen
2006-09-08 02:41:55 UTC
Permalink
On Thu, Sep 07, 2006 at 05:17:18PM -0400, Bruce Lilly wrote:
> [cc's trimmed]
> On Thu September 7 2006 11:56, Keld J=F8rn Simonsen wrote:
> > On Thu, Sep 07, 2006 at 06:33:48AM -0400, Bruce Lilly wrote:
> > > the registration procedure has long said that "ASCII" is disall=
owed, yet it
> > > is in fact registered as an alias).
> >=20
> > There seems to be a problem here, but maybe it whould then be the
> > procedures that be revised, as ASCII is a well known name for a s=
pecific
> > character set.
>=20
> Quoting RFC 2046:
> " The character set name "ASCII" is reserved and must not
> be used for any purpose.
> "

Well, that is fine for me, we can have the name registered but not us=
ed
for any purpose in IETF specs. I think this is what we meant with thi=
s
statement, when we wrote it.

> So at least as far as MIME is concerned, "ASCII" has always been
> forbidden; the default and preferred MIME name for ANSI X3.4 is
> "US-ASCII".

Agree

> One problem is that "ASCII" has been [mis]used for things other tha=
n
> one specific character set and is therefore not unambiguous.

Agree

> Also, we should distinguish informal usage from registered names
> used in protocols.
>=20
> As with most IANA registries, it would be quite unwise to remove so=
mething
> once registered. So I wouldn't want to simply remove "ASCII" leavi=
ng no
> trace in case there is some archived content which used that alias =
in spite
> of the prohibition against such use. I would support a mechanism t=
o mark
> (clearly, and in the registry) a name as deprecated, along with a
> "MUST NOT generate" rule applicable to deprecated names.

Also agree with you here.

Best regards
Keld
Claus Färber
2006-09-16 01:44:22 UTC
Permalink
Bruce Lilly schrieb:
> One problem is that "ASCII" has been [mis]used for things other than
> one specific character set and is therefore not unambiguous.

Maybe it should be possible to register charsets that _are_ ambigous. Of
course, there should be a warning not to use them if at all possible.
Still, some applications which don't know the charset (converters from
other formats) might make use of them if its not feasible to detect the
true charset.

"ASCII" could be an alias for "UNKNOWN-7BIT" (or "UNKNOWN-ISO-646",
"UNKNOWN-ASCII"?)

Other possible ambigous charsets could be:

"UNKNOWN-8BIT" (already used by some mail transport agents when
MIMEifying messages).
"UNKNOWN-EBCDIC"
"UNKNOWN-UTF16" with alias "UNICODE".
"UNKNOWN-ISO-8859" with alias "ANSI".
"UNKNOWM-IBMPC" with alias "OEM".

Claus
Frank Ellermann
2006-09-17 12:50:22 UTC
Permalink
Claus Färber wrote:

> "UNKNOWN-8BIT" (already used by some mail transport agents

First defined in RFC 1428, used in RFC 1700 and RFC 2557, it's
already registered.

> "UNKNOWN-UTF16"

What's the difference from UTF-16 ?

> with alias "UNICODE".

Ugh, thanks, but no thanks.

> "UNKNOWN-ISO-8859" with alias "ANSI".
> "UNKNOWM-IBMPC" with alias "OEM".

One of those could do, "unknown-ascii-8bit", alias "oem".

Frank
Claus Färber
2006-10-01 18:18:25 UTC
Permalink
Frank Ellermann schrieb:
> Claus F=E4rber wrote:
>> "UNKNOWN-8BIT" (already used by some mail transport agents
> First defined in RFC 1428, used in RFC 1700 and RFC 2557, it's
> already registered.

Oops.

>> "UNKNOWN-UTF16"
> What's the difference from UTF-16 ?

UTF-16 "SHOULD be interpreted as being big-endian" if there's no BOM,=
=20
RFC 2781, 4.3. UNKNOWN-UTF16 would not have such a fall back.

>> with alias "UNICODE".
> Ugh, thanks, but no thanks.

The idea is to deprecate the label "UNICODE" by tying it to an=20
incompletly specified charset.

>> "UNKNOWN-ISO-8859" with alias "ANSI".
>> "UNKNOWM-IBMPC" with alias "OEM".
>=20
> One of those could do, "unknown-ascii-8bit", alias "oem".

We already have UNKNOWN-8BIT.

When you convert legacy data, you often DO know that something is in =
a=20
DOSish (IBMPC-based) or Windowsish (ANSI-based) charset. Having chars=
et=20
labels to carry this information (instead of the unspecified=20
UNKNOWN-8BIT) is a good idea.

Claus
Frank Ellermann
2006-10-02 00:59:31 UTC
Permalink
Martin Duerst
2006-10-02 09:38:25 UTC
Permalink
Ned and me, as newly appointed charset reviewers,
plan to first address pending registrations, and once
they are dealt with, looking at ways to clean up the
registry.

At 03:18 06/10/02, Claus F$BgS(Bber wrote:
>Frank Ellermann schrieb:
>> Claus F$BgS(Bber wrote:
>>> "UNKNOWN-8BIT" (already used by some mail transport agents
>> First defined in RFC 1428, used in RFC 1700 and RFC 2557, it's
>> already registered.
>
>Oops.

From a purely personal viewpoint, this one actually occasionally
came in handy.

>>> "UNKNOWN-UTF16"
>> What's the difference from UTF-16 ?
>
>UTF-16 "SHOULD be interpreted as being big-endian" if there's no BOM, RFC 2781, 4.3. UNKNOWN-UTF16 would not have such a fall back.

Has UNKNOWN-UTF-16 been proposed formally, or is this just an
idea floated in an email? As a reviewer, I'd prefer to deal
with "really existing charsets" first.

>>> with alias "UNICODE".
>> Ugh, thanks, but no thanks.
>
>The idea is to deprecate the label "UNICODE" by tying it to an incompletly specified charset.

Personally, I agree with the idea of deprecating "Unicode".
As a charset reviewer, I think this should be done by just
noting the entry as DECRECATED or OBSOLETE or some such,
rather than by registering additional aliases.

>>> "UNKNOWN-ISO-8859" with alias "ANSI".
>>> "UNKNOWM-IBMPC" with alias "OEM".
>> One of those could do, "unknown-ascii-8bit", alias "oem".
>
>We already have UNKNOWN-8BIT.
>
>When you convert legacy data, you often DO know that something is in a DOSish (IBMPC-based) or Windowsish (ANSI-based) charset. Having charset labels to carry this information (instead of the unspecified UNKNOWN-8BIT) is a good idea.

To repeat, as a reviewer, I'd prefer to deal with "really existing
charsets" first. We may be able to consider ideas such as these
later, if we look at more and less precise labels for encodings
(e.g. labels to indicate various variants of Shift_JIS).


Regards, Martin.


#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
Mark Davis
2006-10-02 15:22:33 UTC
Permalink
I'd suggest taking a look at the ICU charset data. This was gathered by
calling APIs on different platforms, instead of going by the documentation,
which was often false.

http://icu.sourceforge.net/charts/charset/
http://icu.sourceforge.net/charts/charset/roundtripIndex.html

The other thing that needs to be done is establish criteria for identity. If
two mappings are identical except that one adds an additional mapping from
bytes to Unicode, which gets registered? Both? The subset? The superset?

There are literally hundreds of such cases, so without clarity it doesn't
help to propose registrations.

Mark

On 10/2/06, Martin Duerst < ***@it.aoyama.ac.jp> wrote:
>
> Ned and me, as newly appointed charset reviewers,
> plan to first address pending registrations, and once
> they are dealt with, looking at ways to clean up the
> registry.
>
> At 03:18 06/10/02, Claus F舐ber wrote:
> >Frank Ellermann schrieb:
> >> Claus F舐ber wrote:
> >>> "UNKNOWN-8BIT" (already used by some mail transport agents
> >> First defined in RFC 1428, used in RFC 1700 and RFC 2557, it's
> >> already registered.
> >
> >Oops.
>
> From a purely personal viewpoint, this one actually occasionally
> came in handy.
>
> >>> "UNKNOWN-UTF16"
> >> What's the difference from UTF-16 ?
> >
> >UTF-16 "SHOULD be interpreted as being big-endian" if there's no BOM, RFC
> 2781, 4.3. UNKNOWN-UTF16 would not have such a fall back.
>
> Has UNKNOWN-UTF-16 been proposed formally, or is this just an
> idea floated in an email? As a reviewer, I'd prefer to deal
> with "really existing charsets" first.
>
> >>> with alias "UNICODE".
> >> Ugh, thanks, but no thanks.
> >
> >The idea is to deprecate the label "UNICODE" by tying it to an
> incompletly specified charset.
>
> Personally, I agree with the idea of deprecating "Unicode".
> As a charset reviewer, I think this should be done by just
> noting the entry as DECRECATED or OBSOLETE or some such,
> rather than by registering additional aliases.
>
> >>> "UNKNOWN-ISO-8859" with alias "ANSI".
> >>> "UNKNOWM-IBMPC" with alias "OEM".
> >> One of those could do, "unknown-ascii-8bit", alias "oem".
> >
> >We already have UNKNOWN-8BIT.
> >
> >When you convert legacy data, you often DO know that something is in a
> DOSish (IBMPC-based) or Windowsish (ANSI-based) charset. Having charset
> labels to carry this information (instead of the unspecified UNKNOWN-8BIT)
> is a good idea.
>
> To repeat, as a reviewer, I'd prefer to deal with "really existing
> charsets" first. We may be able to consider ideas such as these
> later, if we look at more and less precise labels for encodings
> (e.g. labels to indicate various variants of Shift_JIS).
>
>
> Regards, Martin.
>
>
> #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
> #-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
>
>
Tim Bray
2006-09-06 23:27:46 UTC
Permalink
On Sep 6, 2006, at 2:45 PM, Keith Moore wrote:

> As for utf-8 vs. Unicode, this is a bit tricky. I agree that merely
> specifying Unicode isn't sufficient given the potential for
> incompatible CESs. And yet I'm sympathetic to the notion that UTF-8
> pessimizes storage and transmission of text written in certain
> languages. IMHO it's unreasonable to exclude the potential for a
> Unicode based CES that has more-or-less equivalent information
> density across a wide variety of languages. But I do think that
> use of
> multiple CESs in a new protocol should require substantial
> justification, and that UTF-8 should be presumed to be the CES of
> choice for any new protocol that requires ASCII compatibility for its
> character representation.

Agreed on all counts. Section 5.1 of RFC3470 (aka BCP70) says smart
things about this, referencing 2277. Basically, if you're going to
use XML, there's probably no point trying to legislate against UTF-16
since any conformant reader is required to accept it, and in practice
all known XML software can handle 8859 and Shift-JIS and EUC. But
if you're not doing XML, compulsory UTF-8 removes a lot of failure
points without costing much.

-Tim
Martin Duerst
2006-09-08 10:02:00 UTC
Permalink
At 06:45 06/09/07, Keith Moore wrote:
>I concur with the need to maintain the current charset registry to
>support legacy apps that use it.

I concur with Keith (and it seems almost everybody else) that we
still need a charset registry.

>And I think Ned would be an excellent choice for reviewer, though it
>wouldn' t bother me if he could have the assistance of people with
>specialized expertise in Asian writing schemes.

He would certainly have my assistance, for whatever it's worth.

>As for utf-8 vs. Unicode, this is a bit tricky. I agree that merely
>specifying Unicode isn't sufficient given the potential for
>incompatible CESs. And yet I'm sympathetic to the notion that UTF-8
>pessimizes storage and transmission of text written in certain
>languages.

True. The most affected languages are not CJK (Chinese, Japanese, Korean),
but all the scripts that have most of their characters beyond
U+0800 but don't need two bytes to encode the particular script,
i.e. all the Indian Scripts, and so on. A serious part of the
overhead is often (but not always) compensated by the fact that
protocol or markup information is usually heavily ascii-biased.


Regards, Martin.


#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
Jefsey_Morfin
2006-09-08 12:45:00 UTC
Permalink
At 12:02 08/09/2006, Martin Duerst wrote:
>True. The most affected languages are not CJK (Chinese, Japanese,
>Korean), but all the scripts that have most of their characters beyond
>U+0800 but don't need two bytes to encode the particular script,
>i.e. all the Indian Scripts, and so on. A serious part of the
>overhead is often (but not always) compensated by the fact that
>protocol or markup information is usually heavily ascii-biased.

Correct, this is one part of their problem, the other is the
difference between graphemes and characters. However, the first
question is: is the charset registry meant to register existing
charsets or the IETF to standardise the new charsets and keyboards
language need? Or do you mean you would suggest designing new
charsets somewhere else and to have them registered by the IETF on the IANA?
jfc
Terje Bless
2006-09-06 22:03:23 UTC
Permalink
[ My apologies for replying to a reply ]

Ned Freed <***@mrochek.com> wrote:

>>I wonder if we are reaching the point at which a stronger "use Unicode on
>>the wire" recommendation would be in order. The implications of such a
>>recommendation would presumably include a 2978bis that made the requirements
>>for registration of a new charset _much_ tougher, e.g., requiring a
>>demonstration that the then-current version of Unicode cannot do the
>>relevant job and/or evidence that the newly-proposed charset is needed in
>>deployed applications.

The time is, IMO, certainly ripe for pushing UTF-8 much stronger, but the place
to do so is *not* at IANA — the registry of assigned names and numbers; protocol
values — but rather in the development of new specifications.

Not even the Unicode Consortium envisions a mass conversation of all legacy
content into, say, UTF-8. The IANA registry's documentary function is quite
orthogonal to the desire to avoid defining new charsets or mandating or even
just enabling legacy charsets in new specifications.

If a charset exists it should, modulo other factors, be registered with IANA.

--
Everytime I write a rhyme these people thinks its a crime
I tell `em what's on my mind. I guess I'm a CRIMINAL!
I don't gotta say a word I just flip `em the bird and keep goin,
I don't take shit from no one. I'm a CRIMINAL!
Frank Ellermann
2006-09-06 22:48:50 UTC
Permalink
Ned Freed wrote:

> one of the things I will do is conduct a complete overhaul of
> the existing registry.
[...]
> I also wrote and continue to maintain a fairly full-features
> charset conversion library.

Then there are several sources: The ICU converters, your lib,
the Unicode mappings, and standards in cases like ISO 8859-11.

For the most important charsets these sources hopefully agree,
and maybe it's possible to use CharMapML to list most of them
(not for UTF7, UTF16-LE, UTF32-LE, BOCU-1 and SCSU). If needed
I could create a CharMapML file for UTF-1. Not because anybody
uses it, let alone "over the wire", but because it's a part of
the history. Less than 9,000 lines without cheating.

A complete CharMapML file for UTF-8 takes less than 40 lines,
US-ASCII, Latin-1, Latin-9, UTF16-BE, and UTF32-BE would be
shorter. Does "provide mappings" belong to what you have in
mind for a "registry cleanup" ?

Frank
Keld Jørn Simonsen
2006-09-07 15:59:02 UTC
Permalink
On Thu, Sep 07, 2006 at 12:48:50AM +0200, Frank Ellermann wrote:
> Ned Freed wrote:
>
> > one of the things I will do is conduct a complete overhaul of
> > the existing registry.
> [...]
> > I also wrote and continue to maintain a fairly full-features
> > charset conversion library.
>
> Then there are several sources: The ICU converters, your lib,
> the Unicode mappings, and standards in cases like ISO 8859-11.
>
> For the most important charsets these sources hopefully agree,
> and maybe it's possible to use CharMapML to list most of them
> (not for UTF7, UTF16-LE, UTF32-LE, BOCU-1 and SCSU). If needed
> I could create a CharMapML file for UTF-1. Not because anybody
> uses it, let alone "over the wire", but because it's a part of
> the history. Less than 9,000 lines without cheating.

there is also a fairly complete UNIX utility called recode,
which has extensive mappings.

Best regards
keld
Jefsey_Morfin
2006-09-06 23:02:19 UTC
Permalink
At 22:30 06/09/2006, Ned Freed wrote:
>Now, having said all this, I'm willing to take on the role of
>charset reviewer,
>but with the understanding that one of the things I will do is conduct a
>complete overhaul of the existing registry. [***] Such a substantive
>change will
>of course require some degree of oversight, which in turn means I'd
>like to see
>some commitment from the IESG of support for the effort.

+1

>As for qualifications, I did write the charset registration specification, and
>I also wrote and continue to maintain a fairly full-features charset
>conversion
>library. I can provide more detail if anyone cares.

I care.
thank you.
jfc
McDonald Ira
2006-09-07 18:04:49 UTC
Permalink
Hi,

+1 for Ned as Charset Reviewer!

-1 for the idea of relaxing the IESG policy that UTF-8
specifically MUST be supported by any new IETF protocol
to some statement that UTF-anything MUST be supported.

Besides MIME incompatibility, because of their BOM
dependence (and thus broken string parse/concatenate),
UTF-16-[endian] and UTF-32-[endian] are unsuitable for
wire protocols (the proper sphere of the IETF).

Cheers,
- Ira

PS - When I was writing the IANA Charset MIB (RFC 3808), I
wrote a parser (*) for the plaintext IANA Charset Registry
to generate updates to the MIB. It warns of RFC 2978
problems (e.g., a missing "cs" alias). I'd be happy to
enhance it to do a better check for RFC 2978 compliance.

* ftp://ftp.pwg.org/pub/pwg/pmp/tools/ianachar.c


Ira McDonald (Musician / Software Architect)
Blue Roof Music / High North Inc
PO Box 221 Grand Marais, MI 49839
phone: +1-906-494-2434
email: ***@sharplabs.com


--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.405 / Virus Database: 268.12.2/441 - Release Date: 9/7/2006
Keith Moore
2006-09-07 18:39:38 UTC
Permalink
> Besides MIME incompatibility, because of their BOM
> dependence (and thus broken string parse/concatenate),
> UTF-16-[endian] and UTF-32-[endian] are unsuitable for
> wire protocols (the proper sphere of the IETF).

agree. but I would not want to preclude future use of a better CES than
utf-8. I just don't think it exists yet.
Kenneth Whistler
2006-09-07 19:15:37 UTC
Permalink
> > Besides MIME incompatibility,

That aside...

> > because of their BOM
> > dependence (and thus broken string parse/concatenate),
> > UTF-16-[endian] and UTF-32-[endian]

This is in incorrect characterization of those CES's.

UTF-16BE, UTF-16LE (and UTF-32BE and UTF-32LE) explicitly
*disallow* BOM, and thus are *not* broken for string parse/concatenate.

The UTF-16 encoding scheme (no "LE", no "BE") is the one that
interprets an initial U+FEFF as a BOM.

> > are unsuitable for
> > wire protocols (the proper sphere of the IETF).

By the way, I'm not arguing that any of those are particularly suitable
for a wire protocol, compared to UTF-8 -- just noting that the
dependence on BOM does *not* occur for any of the explicit BE or LE
CES's.

--Ken

>
> agree. but I would not want to preclude future use of a better CES than
> utf-8. I just don't think it exists yet.
Kenneth Whistler
2006-09-07 22:17:20 UTC
Permalink
Forwarding a contribution from Mark Davis.

--Ken

------------- Begin Forwarded Message -------------

From: Mark Davis <***@icu-project.org>
Date: Sep 6, 2006 4:44 PM
Subject: Re: Volunteer needed to serve as IANA charset reviewer
...

If the registry provided an unambiguous, stable definition of each charset
identifier in terms of an explicit, available mapping to Unicode/10646
(whether the UTF-8 form of Unicode or the UTF-32 code points -- that is just
a difference in format, not content), it would indeed be useful. However, I
suspect quite strongly that it is a futile task. There are a number of
problems with the current registry.

1. Poor registrations (minor)
There are some registered charset names that are not syntactically compliant
to the spec.

2. Incomplete (more important)
There are many charsets (such as some windows charsets) that are not in the
registry, but that are in *far* more widespread use than the majority of the
charsets in the registry. Attempted registrations have just been left
hanging, cf http://mail.apps.ietf.org/ietf/charsets/msg01510.html
<http://www.google.com/url?sa=D&q=http%3A%2F%2Fmail.apps.ietf.org%2Fietf%2Fchars
ets%2Fmsg01510.html>

2. Ill-defined registrations (crucial)
a) There are registered names that have useless (inaccessable or unstable)
references; there is no practical way to figure out what the charset
definition is.
b) There are other registrations that are defined by reference to an
available chart, but when you actually test what the vendor's APIs map to,
they actually *use* a different definition: for example, the chart may say
that 0x80 is undefined, but actually map it to U+0080.
c) The RFC itself does not settle important issues of identity among
charsets. If a new mapping is added to a charset converter, is that a
different charset (and thus needs a different registration) or not? Does
that go for any superset? etc. We've raised these issues before, but with no
resolution (or even attempt at one) Cf.
http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icuhtml/design/charset
_questions.html<http://www.google.com/url?sa=D&q=http%3A%2F%2Fdev.icu-project.or
g%2Fcgi-bin%2Fviewcvs.cgi%2F*checkout*%2Ficuhtml%2Fdesign%2Fcharset_questions.ht
ml>

As a product of the above problems, the actual results obtained by using the
iana charset names on any given platform* may vary wildly. For example,
among the iana-registry-named charsets, there were over a million different
mapping differences between Sun's and IBM's Java, total.

* "platform" speaking broadly -- ithe results may vary by OS (Mac vs Windows
vs Linux...), by programming language [Java) or by version of programming
language runtime (IBM vs Sun's Java), or even by product (database version).

In ICU, for example, our requirement was to be able to reproduce the actual,
observeable, character conversions in effect on any platform. With that
goal, we basically had to give up trying to use the IANA registry at all. We
compose mappings by scraping; calling the APIs on those platforms to do
conversions and collecting the results, and providing a different internal
identifier for any differing mapping. We then have a separate name mapping
that goes from each platform's name (the name according to that platform)
for each character to the unique identifier. Cf.
http://icu.sourceforge.net/charts/charset/.

And based on work here at Google, it is pretty clear that -- at least in
terms of web pages -- little reliance can be placed on the charset
information. As imprecise as heuristic charset detection is, it is more
accurate than relying on the charset tags in the html meta element (and what
is in the html meta element is more accurate than what is communicated by
the http protocol).

So while I applaud your goal, I would suspect that that it would be a huge
amount of effort for very little return.

Mark


------------- End Forwarded Message -------------
Frank Ellermann
2006-09-08 00:18:20 UTC
Permalink
Kenneth Whistler wrote:
> Forwarding a contribution from Mark Davis.

> http://mail.apps.ietf.org/ietf/charsets/msg01510.html

That belongs to what was announced as "considerable backlog":
It's not about the amount of pending requests (a dozen or so),
but about their age (minus two they are from last year).

> useless (inaccessable or unstable) references; there is no
> practical way to figure out what the charset definition is.

Yes, that's bad. As a first approximation I'd try something
like http://purl.net/net/cp/874 - but that also changed, some
years ago I could ask for 1004, now I'm supposed to know what
it is. Apparently ICU had its very own "decruft" experience.

> The RFC itself does not settle important issues of identity
> among charsets. If a new mapping is added to a charset
> converter, is that a different charset (and thus needs a
> different registration) or not?

In CharMapML you've allowed version info for minor cases. Some
historic oddities like 0x1A -> 0x7F -> 0x1C -> 0x1A are IMHO
not more relevant. Others like "modulo 190 Unicode" might be
interesting, if compared with "modulo 243 distances". Just an
example.

> it would be a huge amount of effort for very little return.

Meaning what, delete registry and RFC as hopeless ? Looking
at the list in RFC 3808 we're talking about 250 charsets.

If you want to resolve ambiguities like the "IBM rotation"
(see above), PC graphics for C0, and cases like "in theory
unassigned" vs. "in practice roundtrip" for 1252 etc, what is
it, 1000 charsets, how much do you have in ICU ?

> I'm being blocked

Maybe the list server has only your @jtcsv.com address.

Frank
Tim Bray
2006-09-06 20:04:17 UTC
Permalink
On Sep 6, 2006, at 11:45 AM, John C Klensin wrote:

> I wonder if we
> are reaching the point at which a stronger "use Unicode on the
> wire" recommendation would be in order. The implications of
> such a recommendation would presumably include a 2978bis that
> made the requirements for registration of a new charset _much_
> tougher, e.g., requiring a demonstration that the then-current
> version of Unicode cannot do the relevant job and/or evidence
> that the newly-proposed charset is needed in deployed
> applications.

Yes, please. Unicode is not perfect, but it's actually very good in
quite a few different ways, and network effects have pretty well
taken over, so if you're doing any text on the network at all, it's
almost certainly the right thing to use Unicode.

As a Certified Unicode Bigot(tm) I would volunteer to help with such
a redraft.

-Tim
John C Klensin
2006-09-06 18:45:27 UTC
Permalink
(IETF list removed, since this is about to become specialized)

--On Wednesday, 06 September, 2006 11:04 -0700 Ted Hardie
<***@qualcomm.com> wrote:

> The Applications Area is soliciting volunteers willing to
> serve as the IANA charset reviewer. This position entails
> reviewing charset registrations submitted to IANA in
> accordance with the procedures set out in RFC 2978. It
> requires the reviewer to monitor discussion on the
> ietf-charsets mailing list (moderating it, if necessary); it
> also requires that the reviewer interact with the registrants
> and IANA on the details of the registration. There is
> currently a small backlog, and it will be necessary to work to
> resolve that backlog during the initial period of the
> appointment.
>...

Perhaps the need for a new volunteer in this area is the time to
ask a broader question:

At the time 2978 (and its predecessor, 2278) were defined, there
were a large number of charsets in heavy use and there was some
general feeling in the implementer community that, despite the
provisions of RFC 2277, Unicode/ISO 10646 were not quite ready.
Although we probably still have some distance to go (the issues
with my net-Unicode draft may be illustrative), I wonder if we
are reaching the point at which a stronger "use Unicode on the
wire" recommendation would be in order. The implications of
such a recommendation would presumably include a 2978bis that
made the requirements for registration of a new charset _much_
tougher, e.g., requiring a demonstration that the then-current
version of Unicode cannot do the relevant job and/or evidence
that the newly-proposed charset is needed in deployed
applications.

This question is motivated, not by a strong love for Unicode,
but by the observation that RFC 2277 requires it and that the
IETF is shifting toward it in a number of areas. More options
and possibilities for local codings that are not generally known
and supported do not help with interoperability; perhaps it is
time to start pushing back.

And that, of course, would dramatically change the work of the
charset reviewer by reducing the volume but increasing the
amount of evaluation to be done.

john
Mark Davis
2006-09-06 23:44:42 UTC
Permalink
If the registry provided an unambiguous, stable definition of each charset
identifier in terms of an explicit, available mapping to Unicode/10646
(whether the UTF-8 form of Unicode or the UTF-32 code points -- that is just
a difference in format, not content), it would indeed be useful. However, I
suspect quite strongly that it is a futile task. There are a number of
problems with the current registry.

1. Poor registrations (minor)
There are some registered charset names that are not syntactically compliant
to the spec.

2. Incomplete (more important)
There are many charsets (such as some windows charsets) that are not in the
registry, but that are in *far* more widespread use than the majority of the
charsets in the registry. Attempted registrations have just been left
hanging, cf http://mail.apps.ietf.org/ietf/charsets/msg01510.html<http://www.google.com/url?sa=D&q=http%3A%2F%2Fmail.apps.ietf.org%2Fietf%2Fcharsets%2Fmsg01510.html>

2. Ill-defined registrations (crucial)
a) There are registered names that have useless (inaccessable or unstable)
references; there is no practical way to figure out what the charset
definition is.
b) There are other registrations that are defined by reference to an
available chart, but when you actually test what the vendor's APIs map to,
they actually *use* a different definition: for example, the chart may say
that 0x80 is undefined, but actually map it to U+0080.
c) The RFC itself does not settle important issues of identity among
charsets. If a new mapping is added to a charset converter, is that a
different charset (and thus needs a different registration) or not? Does
that go for any superset? etc. We've raised these issues before, but with no
resolution (or even attempt at one) Cf.
http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icuhtml/design/charset_questions.html<http://www.google.com/url?sa=D&q=http%3A%2F%2Fdev.icu-project.org%2Fcgi-bin%2Fviewcvs.cgi%2F*checkout*%2Ficuhtml%2Fdesign%2Fcharset_questions.html>

As a product of the above problems, the actual results obtained by using the
iana charset names on any given platform* may vary wildly. For example,
among the iana-registry-named charsets, there were over a million different
mapping differences between Sun's and IBM's Java, total.

* "platform" speaking broadly -- ithe results may vary by OS (Mac vs Windows
vs Linux...), by programming language [Java) or by version of programming
language runtime (IBM vs Sun's Java), or even by product (database version).

In ICU, for example, our requirement was to be able to reproduce the actual,
observeable, character conversions in effect on any platform. With that
goal, we basically had to give up trying to use the IANA registry at all. We
compose mappings by scraping; calling the APIs on those platforms to do
conversions and collecting the results, and providing a different internal
identifier for any differing mapping. We then have a separate name mapping
that goes from each platform's name (the name according to that platform)
for each character to the unique identifier. Cf.
http://icu.sourceforge.net/charts/charset/.

And based on work here at Google, it is pretty clear that -- at least in
terms of web pages -- little reliance can be placed on the charset
information. As imprecise as heuristic charset detection is, it is more
accurate than relying on the charset tags in the html meta element (and what
is in the html meta element is more accurate than what is communicated by
the http protocol).

So while I applaud your goal, I would suspect that that it would be a huge
amount of effort for very little return.

Mark


> I agree that we've reached a point where "use UTF-8" is what we need to be
> pushing for in new protocol development. (Note that I said UTF-8 and not
> Unicode - given the existance of gb18030 [*] I don't regard a
recommendation of
> "use Unicode" as even close to sufficient. The last thing we want is to
see the
> development of specializesd Unicode CESes for Korean, Japanese, Arabic,
Hebrew,
> Thai, and who knows what else.) And if the reason there are new charset
> registrations was because of the perceived need to have new charsets for
use in
> new protocols, I would be in total agreement that a change in focus for
charset
> registration is in order.
>
> But that's not why we're seeing new registrations. The new registrations
we're
> seeing are of legacy charsets used in legacy applications and protocols
that
> for whatever reason never got registered previously. Given that these
things
> are in use in various nooks and crannies around the world, it is
critically
> important that when they are used they are labelled accurately and
> consistently.
>
> The plain fact of the matter is that we have done a miserable job of
producing
> an accurate and useful charset registry, and considerable work needs to be
done
> both to register various missing charsets as well as to clean up the
existing
> registry, which contains many errors. I've seen no interest whatsoever in
> registering new charsets for new protocols, so to my mind pushing back on,
say,
> the recent registration of iso-8859-11, is an overreaction to a
non-problem.
> [**]
>
> > This question is motivated, not by a strong love for Unicode,
> > but by the observation that RFC 2277 requires it and that the
> > IETF is shifting toward it in a number of areas. More options
> > and possibilities for local codings that are not generally known
> > and supported do not help with interoperability; perhaps it is
> > time to start pushing back.
>
> Well, I have to say that to the extent we've pushed back on registrations,
what
> we've ended up with is ad-hoc mess of unregistered usage. I am therefore
quite
> skeptical of any belief that pushing back on registrations is a useful
tactic.
>
> > And that, of course, would dramatically change the work of the
> > charset reviewer by reducing the volume but increasing the
> > amount of evaluation to be done.
>
> Even if we closed the registry completely there is still a bunch of work
to do
> in terms of registry cleanup.
>
> Now, having said all this, I'm willing to take on the role of charset
reviewer,
> but with the understanding that one of the things I will do is conduct a
> complete overhaul of the existing registry. [***] Such a substantive
change will
> of course require some degree of oversight, which in turn means I'd like
to see
> some commitment from the IESG of support for the effort.
>
> As for qualifications, I did write the charset registration specification,
and
> I also wrote and continue to maintain a fairly full-features charset
conversion
> library. I can provide more detail if anyone cares.
>
> Ned
>
> [*] - For those not fully up to speed on this stuff, gb18030 can be seen
as an
> encoding of Unicode that is backwards compatible with the previous
simplified
> Chinese charsets gb2312 and gbk.
>
> [**] - The less recent attempt to register ISO-2022-JP-2004 is a more
> interesting case. I believe this one needed to be pushed on, but not
> because of potential use in new applications or protocols.
>
> [***] - I have the advantage of being close enough to IANA that I can
drive
> over there and have F2F meetings should the need arise - and I suspect
> it will.
>
Mark Davis
2006-09-07 01:47:11 UTC
Permalink
This appears to have bounced from ietf-***@iana.org on first send. --
MD

On 9/6/06, Mark Davis <***@icu-project.org> wrote:
>
> If the registry provided an unambiguous, stable definition of each charset
> identifier in terms of an explicit, available mapping to Unicode/10646
> (whether the UTF-8 form of Unicode or the UTF-32 code points -- that is just
> a difference in format, not content), it would indeed be useful. However, I
> suspect quite strongly that it is a futile task. There are a number of
> problems with the current registry.
>
> 1. Poor registrations (minor)
> There are some registered charset names that are not syntactically
> compliant to the spec.
>
> 2. Incomplete (more important)
> There are many charsets (such as some windows charsets) that are not in
> the registry, but that are in *far* more widespread use than the majority of
> the charsets in the registry. Attempted registrations have just been left
> hanging, cf http://mail.apps.ietf.org/ietf/charsets/msg01510.html
> <http://www.google.com/url?sa=D&q=http%3A%2F%2Fmail.apps.ietf.org%2Fietf%2Fcharsets%2Fmsg01510.html>
>
> 2. Ill-defined registrations (crucial)
> a) There are registered names that have useless (inaccessable or
> unstable) references; there is no practical way to figure out what the
> charset definition is.
> b) There are other registrations that are defined by reference to an
> available chart, but when you actually test what the vendor's APIs map to,
> they actually *use* a different definition: for example, the chart may say
> that 0x80 is undefined, but actually map it to U+0080.
> c) The RFC itself does not settle important issues of identity among
> charsets. If a new mapping is added to a charset converter, is that a
> different charset (and thus needs a different registration) or not? Does
> that go for any superset? etc. We've raised these issues before, but with no
> resolution (or even attempt at one) Cf.
> http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icuhtml/design/charset_questions.html<http://www.google.com/url?sa=D&q=http%3A%2F%2Fdev.icu-project.org%2Fcgi-bin%2Fviewcvs.cgi%2F*checkout*%2Ficuhtml%2Fdesign%2Fcharset_questions.html>
>
> As a product of the above problems, the actual results obtained by using
> the iana charset names on any given platform* may vary wildly. For example,
> among the iana-registry-named charsets, there were over a million different
> mapping differences between Sun's and IBM's Java, total.
>
> * "platform" speaking broadly -- ithe results may vary by OS (Mac vs
> Windows vs Linux...), by programming language [Java) or by version of
> programming language runtime (IBM vs Sun's Java), or even by product
> (database version).
>
> In ICU, for example, our requirement was to be able to reproduce the
> actual, observeable, character conversions in effect on any platform. With
> that goal, we basically had to give up trying to use the IANA registry at
> all. We compose mappings by scraping; calling the APIs on those platforms to
> do conversions and collecting the results, and providing a different
> internal identifier for any differing mapping. We then have a separate name
> mapping that goes from each platform's name (the name according to that
> platform) for each character to the unique identifier. Cf.
> http://icu.sourceforge.net/charts/charset/.
>
> And based on work here at Google, it is pretty clear that -- at least in
> terms of web pages -- little reliance can be placed on the charset
> information. As imprecise as heuristic charset detection is, it is more
> accurate than relying on the charset tags in the html meta element (and what
> is in the html meta element is more accurate than what is communicated by
> the http protocol).
>
> So while I applaud your goal, I would suspect that that it would be a huge
> amount of effort for very little return.
>
> Mark
>
>
>
> > I agree that we've reached a point where "use UTF-8" is what we need to
> be
> > pushing for in new protocol development. (Note that I said UTF-8 and not
> > Unicode - given the existance of gb18030 [*] I don't regard a
> recommendation of
> > "use Unicode" as even close to sufficient. The last thing we want is to
> see the
> > development of specializesd Unicode CESes for Korean, Japanese, Arabic,
> Hebrew,
> > Thai, and who knows what else.) And if the reason there are new charset
> > registrations was because of the perceived need to have new charsets for
> use in
> > new protocols, I would be in total agreement that a change in focus for
> charset
> > registration is in order.
> >
> > But that's not why we're seeing new registrations. The new registrations
> we're
> > seeing are of legacy charsets used in legacy applications and protocols
> that
> > for whatever reason never got registered previously. Given that these
> things
> > are in use in various nooks and crannies around the world, it is
> critically
> > important that when they are used they are labelled accurately and
> > consistently.
> >
> > The plain fact of the matter is that we have done a miserable job of
> producing
> > an accurate and useful charset registry, and considerable work needs to
> be done
> > both to register various missing charsets as well as to clean up the
> existing
> > registry, which contains many errors. I've seen no interest whatsoever
> in
> > registering new charsets for new protocols, so to my mind pushing back
> on, say,
> > the recent registration of iso-8859-11, is an overreaction to a
> non-problem.
> > [**]
> >
> > > This question is motivated, not by a strong love for Unicode,
> > > but by the observation that RFC 2277 requires it and that the
> > > IETF is shifting toward it in a number of areas. More options
> > > and possibilities for local codings that are not generally known
> > > and supported do not help with interoperability; perhaps it is
> > > time to start pushing back.
> >
> > Well, I have to say that to the extent we've pushed back on
> registrations, what
> > we've ended up with is ad-hoc mess of unregistered usage. I am therefore
> quite
> > skeptical of any belief that pushing back on registrations is a useful
> tactic.
> >
> > > And that, of course, would dramatically change the work of the
> > > charset reviewer by reducing the volume but increasing the
> > > amount of evaluation to be done.
> >
> > Even if we closed the registry completely there is still a bunch of work
> to do
> > in terms of registry cleanup.
> >
> > Now, having said all this, I'm willing to take on the role of charset
> reviewer,
> > but with the understanding that one of the things I will do is conduct a
> > complete overhaul of the existing registry. [***] Such a substantive
> change will
> > of course require some degree of oversight, which in turn means I'd like
> to see
> > some commitment from the IESG of support for the effort.
> >
> > As for qualifications, I did write the charset registration
> specification, and
> > I also wrote and continue to maintain a fairly full-features charset
> conversion
> > library. I can provide more detail if anyone cares.
> >
> > Ned
> >
> > [*] - For those not fully up to speed on this stuff, gb18030 can be seen
> as an
> > encoding of Unicode that is backwards compatible with the previous
> simplified
> > Chinese charsets gb2312 and gbk.
> >
> > [**] - The less recent attempt to register ISO-2022-JP-2004 is a more
> > interesting case. I believe this one needed to be pushed on, but not
> > because of potential use in new applications or protocols.
> >
> > [***] - I have the advantage of being close enough to IANA that I can
> drive
> > over there and have F2F meetings should the need arise - and I suspect
> > it will.
> >
>
Ted Hardie
2006-09-06 18:04:53 UTC
Permalink
The Applications Area is soliciting volunteers willing to serve as the
IANA charset reviewer. This position entails reviewing charset registrations
submitted to IANA in accordance with the procedures set out in
RFC 2978. It requires the reviewer to monitor discussion on the
ietf-charsets mailing list (moderating it, if necessary); it also requires
that the reviewer interact with the registrants and IANA on the
details of the registration. There is currently a small backlog, and
it will be necessary to work to resolve that backlog during the initial
period of the appointment.

If you are willing to serve in this capacity, please notify the
ADs by September 20th, 2006. A short summary of your experience
in the area should be included.
John C Klensin
2006-09-07 13:58:49 UTC
Permalink
Ned,

Several observations...

The first is that my note was intended as "is it time to review
RFC 2978 and the definition of the charset reviewer job". Just
a question. I had no expectation of discontinuing the current
registry, nor any realistic one of banning future registrations.
I think your comments, Mark's, and those of others are
consistent with my goal in asking the question. What should be
done is another matter -- see below.

Second, while I agree with your concern about GB 18030 and its
ilk, what I learned in trying to put a network-Unicode
definition together (see draft-klensin-net-utf8-01.txt) is that,
for practical use, just specifying "UTF-8" may not be good
enough either. For example, for at least most purposes other
than pure rendering, one probably wants to specify the
normalization form (ideally a "stable" one(++)) for text going
on the wire, so "Unicode, in Stable NFC, encoded in UTF-8" is
probably the level of specification we are looking for, not
"UTF-8". I deliberately said "Unicode" in my note, not because
I thought it was adequate, but because I was certain that it
would expose this issue if we got this far.

If we really need to be pushing toward a specific encoding and
either the required specification of the normalization applied
or, preferably, a specific normalization, then RFC 2978 isn't
our only issue -- we need to review, and possibly reopen RFC
2277 and 3629 and might need to look at some other
specifications. Realizing this was what caused me to
temporarily put the network-Unicode draft on hold.

I am delighted that you would be willing to take this on -- I
think you have just exactly the right combination of skill and
experience with both character sets and Internet applications
protocols.

Your ability to do the currently-defined job, or a slightly
different one, is largely independent of whether the
specifications for new additions to the registry are what we
should have today. Clearly, the registry serves the purpose of
reducing the odds of the same name being used, inadvertently, to
describe different things and that is a benefit in itself. Mark
suggests that the definitions are not sufficiently consistent
and of high quality to be used for anything else. I think we
need to figure out what we need (does the current quality of
registrations meet your criteria for "accurately and
consistently"?) and then respecify things so that we get it on
future reservations (and maybe can ask IANA to send out requests
for clarification to relevant existing ones). Certainly your
notion of overhauling the current registry is consistent with
this... it even goes beyond what I had hoped there were energy
for.

You wrote...

> The plain fact of the matter is that we have done a miserable
> job of producing an accurate and useful charset registry, and
> considerable work needs to be done both to register various
> missing charsets as well as to clean up the existing registry,
> which contains many errors. I've seen no interest whatsoever in
> registering new charsets for new protocols, so to my mind
> pushing back on, say, the recent registration of iso-8859-11,
> is an overreaction to a non-problem. [**]

Speaking personally, we are in complete agreement.

> Well, I have to say that to the extent we've pushed back on
> registrations, what we've ended up with is ad-hoc mess of
> unregistered usage. I am therefore quite skeptical of any
> belief that pushing back on registrations is a useful tactic.

Also agree, regardless of what my note appeared to say (in the
interest of opening up exactly this discussion).

john

++ For those who have not been following that particular piece
of work, the Unicode Consortium now has a proposal for "Stable
Normalization Process" under public review (see
http://www.unicode.org/review/pr-95.html). It differs from the
existing normalization forms by applying additional prohibitions
on unassigned code points and problematic sequences and
originated from discussions about the conditions under which
IDNA and Stringprep could be migrated from Unicode 3.2 to
contemporary versions. I would encourage those in IETF who are
interested in these issues to review that proposal carefully and
comment on it as appropriate.
Martin Duerst
2006-09-09 06:23:39 UTC
Permalink
Hello Mark, others,

I think it's good to have such a collection of problems in the registry.
But I think it's also fair to say that what Mark lists as problems may
not in all cases actually be problems.

At 08:44 06/09/07, Mark Davis wrote:
>If the registry provided an unambiguous, stable definition of each charset identifier in terms of an explicit, available mapping to Unicode/10646 (whether the UTF-8 form of Unicode or the UTF-32 code points -- that is just a difference in format, not content), it would indeed be useful. However, I suspect quite strongly that it is a futile task. There are a number of problems with the current registry.

I think the request for an explicit, fixed mapping is a good one,
but in some cases, it would come at a high cost. A typical example
is Shift_JIS: We know there are many variants, on the one hand due
to additions made by makers (or even privately), on the other hand
also due to some changes in the underlying standard (which go back
to 1983).

For an example like Shift_JIS, the question becomes whether
we want to use a single label, or whether we want to carefully
label each variant.

Benefits for a single label:
- Better chance that the recipient knows the label and can do
something with it.
- Better chance to teach people about different encodings
(it's possible to teach people the difference between Shift_JIS,
EUC-JP, and UTF-8, but it would be close to impossible to
teach them about all the various variants)
- No 'overlabeling' (being more precise than necessary, for the
cases (the huge majority) where in the actual data, there is
no difference).
- Usually enough for visual decoding (reading of emails and
Web pages by humans)
- Not influenced by issues outside actual character encoding
(e.g. error treatment of APIs)

Benefits for detailled labeling:
- Accurate data transmission for all data, even fringe cases
- True round-trips for a wider range of scenarios
- May be better suited for machine-to-machine processing


>2. Incomplete (more important)
>There are many charsets (such as some windows charsets) that are not in the registry, but that are in *far* more widespread use than the majority of the charsets in the registry. Attempted registrations have just been left hanging, cf <http://www.google.com/url?sa=D&q=http%3A%2F%2Fmail.apps.ietf.org%2Fietf%2Fcharsets%2Fmsg01510.html>http://mail.apps.ietf.org/ietf/charsets/msg01510.html<http://www.google.com/url?sa=D&q=http%3A%2F%2Fmail.apps.ietf.org%2Fietf%2Fcharsets%2Fmsg01510.html>

Some of this is due to the original rather strict management of the registry.
Some of it is due to the current backlog. A lot of this is due to the fact
that nobody cares enough to work through the registration process; many
people think that 'the IETF' or 'the IANA' will just do it. The solution
is easy: Don't complain, register.


>2. Ill-defined registrations (crucial)
> a) There are registered names that have useless (inaccessable or unstable) references; there is no practical way to figure out what the charset definition is.

This is possible; some of these registrations are probably irrelevant,
for others, it's pretty clear what the charset means in general, even
though there might be implementation differences for some codepoints.


> b) There are other registrations that are defined by reference to an available chart, but when you actually test what the vendor's APIs map to, they actually *use* a different definition: for example, the chart may say that 0x80 is undefined, but actually map it to U+0080.

It's clear that for really faulty charts, the vendors should be blamed,
and not the registry.

However, the difference between the published map and the
actually used API may be due to the fact that 0x80 is indeed not
part of the encoding as formally defined, and is mapped to U+0080
just as part of error treatment. For most applications (not for
all necessarily), it would be a mistake to include error processing
in the formal definition of an encoding.


> c) The RFC itself does not settle important issues of identity among charsets. If a new mapping is added to a charset converter, is that a different charset (and thus needs a different registration) or not? Does that go for any superset? etc. We've raised these issues before, but with no resolution (or even attempt at one) Cf. <http://www.google.com/url?sa=D&q=http%3A%2F%2Fdev.icu-project.org%2Fcgi-bin%2Fviewcvs.cgi%2F*checkout*%2Ficuhtml%2Fdesign%2Fcharset_questions.html>http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icuhtml/design/charset_questions.<http://www.google.com/url?sa=D&q=http%3A%2F%2Fdev.icu-project.org%2Fcgi-bin%2Fviewcvs.cgi%2F*checkout*%2Ficuhtml%2Fdesign%2Fcharset_questions.html>html

It seems that what you would want, for your purposes, is to use a new
label if a new character gets added to a legacy encoding, but not use
a new label e.g. for UTF-8 each time a character gets added.
So things would be somewhat case-by-case.


>As a product of the above problems, the actual results obtained by using the iana charset names on any given platform* may vary wildly. For example, among the iana-registry-named charsets, there were over a million different mapping differences between Sun's and IBM's Java, total.

It would be better to express these numbers in terms of percentages:
In the experiment made, how many codepoints were mapped, and for how many
did you get differences?

Even better would be to express these numbers in terms of percentages
of actual average data. My guess is that this would be much lower than
the percentage of code points.

This is in no way to belittle the problems, just to make sure we put
it in proportions.


>In ICU, for example, our requirement was to be able to reproduce the actual, observeable, character conversions in effect on any platform. With that goal, we basically had to give up trying to use the IANA registry at all.

This is understandable. The start of the IANA registry was MIME, i.e.
email. The goal was to be able to (ultimately visually) decode the
characters at the other end.


>We compose mappings by scraping; calling the APIs on those platforms to do conversions and collecting the results, and providing a different internal identifier for any differing mapping. We then have a separate name mapping that goes from each platform's name (the name according to that platform) for each character to the unique identifier. Cf. <http://icu.sourceforge.net/charts/charset/>http://icu.sourceforge.net/charts/charset/.

This is very thorough, and may look foolproof, but isn't.
One issue already mentioned is error behavior. If a first API
maps an undefined character to some codepoint, and a second
API maps it to another codepoint (or sequence), they just
made different decision for error behavior (e.g. mapping
unknown codepoints to '?' or to one of the substitution
characters, or dropping them,...). This would be particularly
prominent when converting from Unicode to a legacy encoding,
because in this case there are tons of codepoints that
can't be converted. But this most probably should not
be part of the definition of a 'charset'.

Also, There are cases where there are no differences in transcoding,
but font differences. Examples are the treatment of the backslash
character on MS Windows systems (shown as a Yen symbol because
most (Unicode!) fonts on Japanese Windows systems have it that
way, or certain cases where the traditional and (Japanese) simplified
variant of a character got exchanged when moving from the 1978 to
the 1983 version of the standard.


>And based on work here at Google, it is pretty clear that -- at least in terms of web pages -- little reliance can be placed on the charset information.

Yes, but this is a problem on a diffent level than the above.
Above, you are speaking about small variant differences.
Here, you are speaking about completely mislabled pages.
The problems with small variants don't make the labels in
the registry unsuitable for labeling Web pages. Most Web
pages don't contain characters where the minor version
differences matter, because private use and corporate
characters don't work well on the Web, and because some
of the transcoding differences are between e.g. full-width
and half-width variants, which are mostly irrelevant for
viewing and in particular for search.


>As imprecise as heuristic charset detection is, it is more accurate than relying on the charset tags in the html meta element (and what is in the html meta element is more accurate than what is communicated by the http protocol).

This sounds reasonable.


>So while I applaud your goal, I would suspect that that it would be a huge amount of effort for very little return.

There are other protocols, in particular email. My understanding is
that for email, the situation is quite a bit better, because people
use a dedicated tool (their MUA) to write emails, and emails rarely
get transcoded on the character level, and there is no server involved,
whereas users use whatever they can get their hands on to create
Web pages, the server can mess things up (or help, in some cases),
and pages may get transcoded.


A reasonable conclusion from the above is that no size fits all.
Many applications may be very well served with the granularity
of labels we have now. Some others may need more granularity.
We would either have to decide that we stay with the current
granularity, or that we maybe move to a system with multiple
levels of granularity.


Regards, Martin.




#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
Frank Ellermann
2006-09-10 15:43:16 UTC
Permalink
Martin Duerst wrote:

> I think it's also fair to say that what Mark lists as
> problems may not in all cases actually be problems.

Anything he said can be done with CharMapML. I can't judge
your Shift_JIS example, but for windows-1252 it's pretty
obvious:

Probably the owners want to be free to do whatever plases
them with the remaining five unassigned code points if
necessary. CharMapML can express that with its versions.

And "we" know that it uses 1:1 mappings in practice for these
five code points, CharMapML could express that as fallback
mappings.

If the most recent mapping can use the preferred MIME name
it's good enough. Purists worried about persistence will
please convert it to one of the less obscure UTFs, and store
the document in that form.

> It's clear that for really faulty charts, the vendors should
> be blamed, and not the registry.

Nothing's wrong with the charts, there are two legits aspects,
one is "we'll continue to call it windows-1252 even if we
assign one of the five codepoints" (if that happens anytime
soon before windows-1252 is history like all legacy charsets).

Until then the other aspect is "just use u+0081 for 0x81",
because it's KISS. If you insist on it throw an error, the
user can then decide if it's completely mislabelled.

> For most applications (not for all necessarily), it would
> be a mistake to include error processing in the formal
> definition of an encoding.

If I'd see a perfectly legal u+0080 in Latin-1 I'd guess that
this must be an error, probably the document is windows-1252.
Today "claims to be Latin-1" is as convincing as "claims to
be ASCII" when RFC 1341 was written. Nothing to do for the
registry, it offers windows-1252 for those who want to get it
right.

Maybe a good reason why registering 8859-11 but not 874 isn't
ideal, we're not interested in folks using 8859-11 if what
they really mean is windows-874.

> It seems that what you would want, for your purposes, is to
> use a new label if a new character gets added to a legacy
> encoding

That's a very clean solution with its own drawbacks, my box
insists on saying 1004 for windows-1252, my browser claims
erroneously that this is Latin-1, and my box says 850 when it
means 858. That's my local business, I know where to fix it
before it hits others.

> not use a new label e.g. for UTF-8 each time a character gets
> added. So things would be somewhat case-by-case.

Yes. For BOCU-1, SCSU, and the two UTF-*BE we don't need a
mapping. We also don't "need" it for UTF-1, UTF-8, and the
two UTF-*LE, but it's possible, e.g. about 1100 (long) lines
for UTF-16LE, or 3200 folded lines in a CharMapML mapping.

I like its <range> element - took me some time to understand
it, UTF-8 in 48 folded lines is really nice.

[0x1A]
> This would be particularly prominent when converting from
> Unicode to a legacy encoding, because in this case there are
> tons of codepoints that can't be converted. But this most
> probably should not be part of the definition of a 'charset'.

CharMapML offers to specify a legacy SUB. Applications can
offer to use another legacy character like 0x7F or '?', or
throw an error. Or <shudder> silently drop it </shudder> -
but that's IMO on the wrong side of the border to "broken".

It's mildly interesting to minimize the reported errors, the
implementation details are irrelevant for the registry. I'd
like to get "official" mappings for most registered charsets,
and "pull the *.ucn mappings out of ICU, check that they're
okay, and host them at IANA" could make sense. The registry
format mostly as is, adding URLs of "official" mappings to
checked entries.

Maybe join some entries, csUnicode + UTF-16, csUCS4 + UTF-32,
the works. Or explain what the difference is supposed to be.

Frank
Ned Freed
2006-09-09 13:39:06 UTC
Permalink
> Hello Mark, others,

> I think it's good to have such a collection of problems in the registry.
> But I think it's also fair to say that what Mark lists as problems may
> not in all cases actually be problems.

I agree. I also think that there's a bunch of low-hanging fruit here: Many (but
certainly not all) of the registry problems can be fixed without a huge
investment of time and effort.

Once the obvious stuff is addressed we can discuss how far we want to go,
especially in regards to versioning, variant tagging, and so on. But let's
please not get bogged down in the hard stuff before dealing with the easy
stuff.

> > If the registry provided an unambiguous, stable definition of each charset
> > identifier in terms of an explicit, available mapping to Unicode/10646
> > (whether the UTF-8 form of Unicode or the UTF-32 code points -- that is
> > just a difference in format, not content), it would indeed be useful.
> > However, I suspect quite strongly that it is a futile task. There are a
> > number of problems with the current registry.

> I think the request for an explicit, fixed mapping is a good one,
> but in some cases, it would come at a high cost. A typical example
> is Shift_JIS: We know there are many variants, on the one hand due
> to additions made by makers (or even privately), on the other hand
> also due to some changes in the underlying standard (which go back
> to 1983).

> For an example like Shift_JIS, the question becomes whether
> we want to use a single label, or whether we want to carefully
> label each variant.

Exactly so. And I would provose that we defer worrying about such tricky issues
until the obvious stuff is done. It is always important not to let the best
be the enemy of the good.

> >2. Incomplete (more important)

> > There are many charsets (such as some windows charsets) that are not in the
> > registry, but that are in *far* more widespread use than the majority of the
> > charsets in the registry. Attempted registrations have just been left
> > hanging, cf <http://www.google.com/url?sa=D&q=http%3A%2F%2Fmail.apps.ietf.org%2Fietf%2Fcharsets%2Fmsg01510.html>http://mail.apps.ietf.org/ietf/charsets/msg01510.html<http://www.google.com/url?sa=D&q=http%3A%2F%2Fmail.apps.ietf.org%2Fietf%2Fcharsets%2Fmsg01510.html>

> Some of this is due to the original rather strict management of the registry.
> Some of it is due to the current backlog. A lot of this is due to the fact
> that nobody cares enough to work through the registration process; many
> people think that 'the IETF' or 'the IANA' will just do it. The solution
> is easy: Don't complain, register.

And reviewing registration applications in a timely way might help encourage
more registration activitiy.

> > 2. Ill-defined registrations (crucial)

> > a) There are registered names that have useless (inaccessable or unstable)
> > references; there is no practical way to figure out what the charset definition
> > is.

> This is possible; some of these registrations are probably irrelevant,
> for others, it's pretty clear what the charset means in general, even
> though there might be implementation differences for some codepoints.

Well said. If nobody can figure out anything about a charset in the registry,
that's a pretty good indication that it's irrlevant at least in terms of
current usage. I suspect quite a few of the incomplete entries fall into
this category.

> > b) There are other registrations that are defined by reference to an
> > available chart, but when you actually test what the vendor's APIs map to, they
> > actually *use* a different definition: for example, the chart may say that 0x80
> > is undefined, but actually map it to U+0080.

> It's clear that for really faulty charts, the vendors should be blamed,
> and not the registry.

Well, to be fair, vendors sometimes add wierd mappings in response to customer
demand. For example, I've seen codepoints specific to some Microsoft variant of
a national standard charset creep into usage of the standard charset. In such
cases what's a vendor to do when they have a bunch of customers saying "we
don't care what the standard is, either add these codepoints or we'll switch to
the competitor's product that does do this"?

> However, the difference between the published map and the
> actually used API may be due to the fact that 0x80 is indeed not
> part of the encoding as formally defined, and is mapped to U+0080
> just as part of error treatment. For most applications (not for
> all necessarily), it would be a mistake to include error processing
> in the formal definition of an encoding.

Yes, that's an issue too. I've observed wide variations in the handling
of unassigned code points by different converters.

> > c) The RFC itself does not settle important issues of identity among
> > charsets. If a new mapping is added to a charset converter, is that a different
> > charset (and thus needs a different registration) or not? Does that go for any
> > superset? etc. We've raised these issues before, but with no resolution (or
> > even attempt at one) Cf. <http://www.google.com/url?sa=D&q=http%3A%2F%2Fdev.icu-project.org%2Fcgi-bin%2Fviewcvs.cgi%2F*checkout*%2Ficuhtml%2Fdesign%2Fcharset_questions.html>http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icuhtml/design/charset_questions.<http://www.google.com/url?sa=D&q=http%3A%2F%2Fdev.icu-project.org%2Fcgi-bin%2Fviewcvs.cgi%2F*checkout*%2Ficuhtml%2Fdesign%2Fcharset_questions.html>html

> It seems that what you would want, for your purposes, is to use a new
> label if a new character gets added to a legacy encoding, but not use
> a new label e.g. for UTF-8 each time a character gets added.
> So things would be somewhat case-by-case.

I agree - I don't think it is possible to codify a single "best practice" to
handle every case here.

> > As a product of the above problems, the actual results obtained by using the
> > iana charset names on any given platform* may vary wildly. For example, among
> > the iana-registry-named charsets, there were over a million different mapping
> > differences between Sun's and IBM's Java, total.

> It would be better to express these numbers in terms of percentages:
> In the experiment made, how many codepoints were mapped, and for how many
> did you get differences?

> Even better would be to express these numbers in terms of percentages
> of actual average data. My guess is that this would be much lower than
> the percentage of code points.

It certainly better be or thing must not be working very well somewhere!

> This is in no way to belittle the problems, just to make sure we put
> it in proportions.

Right.

> > In ICU, for example, our requirement was to be able to reproduce the actual,
> > observeable, character conversions in effect on any platform. With that goal,
> > we basically had to give up trying to use the IANA registry at all.

> This is understandable. The start of the IANA registry was MIME, i.e.
> email. The goal was to be able to (ultimately visually) decode the
> characters at the other end.

And that tends to drive various tradeoffs in a particular direction. For
example, it tends to argue for fewer tags for minor charset variations because
that's easier to deploy and update.

> ...

> >And based on work here at Google, it is pretty clear that -- at least in
> > terms of web pages -- little reliance can be placed on the charset information.

> Yes, but this is a problem on a diffent level than the above.
> Above, you are speaking about small variant differences.
> Here, you are speaking about completely mislabled pages.

I don't see any way we can fix the mislabelling problem (which also exists in
email but takes on a somewhat different form there) in any direct way. However,
to the extent that mislabelling has occured due to people getting confused by
our current registry mess, cleaning our own house might help a little. But
probably not much - a site that labels every message they send as iso-8859-1 no
matter what's actually in the message (and I've seen some big ones that do
this) isn't going to be cured by our having a perfectly accurate and totally
comprehensive registry. Like it or not, we cannot fix everything here.

> The problems with small variants don't make the labels in
> the registry unsuitable for labeling Web pages. Most Web
> pages don't contain characters where the minor version
> differences matter, because private use and corporate
> characters don't work well on the Web, and because some
> of the transcoding differences are between e.g. full-width
> and half-width variants, which are mostly irrelevant for
> viewing and in particular for search.

I've observed similar behavior in email. People are pretty adaptable and tend
to figure out what works and what doesn't fairly quickly.

> >So while I applaud your goal, I would suspect that that it would be a huge amount of effort for very little return.

> There are other protocols, in particular email. My understanding is
> that for email, the situation is quite a bit better, because people
> use a dedicated tool (their MUA) to write emails, and emails rarely
> get transcoded on the character level, and there is no server involved,
> whereas users use whatever they can get their hands on to create
> Web pages, the server can mess things up (or help, in some cases),
> and pages may get transcoded.

To be honest, it is hard for me to judge whether the situation for email is
better or worse than for the web or other protocols. Since I work on email
systems a lot more than I work on other stuff, I tend to see reports of lots
more problems with email. But this is probably more the result of being on the
receiving end of bug reports for email while not being on the receiving end of
bug reports for, say, a web client.

In any case, the Internet is now such a big place with so many nooks
and crannies that I have no idea how you'd figure out where the worst problems
are. And I'm not sure it matters either.

Ned
Bruce Lilly
2006-09-09 15:34:02 UTC
Permalink
On Sat September 9 2006 09:39, Ned Freed wrote:

> I agree. I also think that there's a bunch of low-hanging fruit here: Many (but
> certainly not all) of the registry problems can be fixed without a huge
> investment of time and effort.
>
> Once the obvious stuff is addressed we can discuss how far we want to go,
> especially in regards to versioning, variant tagging, and so on. But let's
> please not get bogged down in the hard stuff before dealing with the easy
> stuff.

A possibly good starting point would be the list of issues brought up the
last time somebody offered to get problems fixed; the thread on this mailing
list beginning with http://mail.apps.ietf.org/ietf/charsets/msg01495.html

One would certainly hope that some of those issues (e.g. removing silly
"Alias: None" lines[*], assigning cs* aliases where missing) could be (or
could have been) done without huge effort. Paul's message looked promising
(Apps AD support, etc.), yet most of the problems noted remain in the
registry some 21 months later...

---------------
* here follows a one-liner fix for the silly lines:
grep -v "^Alias.*: *None" <character-sets >foo && mv -f foo character-sets
Martin Duerst
2006-10-03 05:17:22 UTC
Permalink
Hello Mark,

We should definitely start to look at such issues once we have
processed the backlog of requests and have cleaned up some of
the garbage in the current registry.

On the side, I think it would be great if
<http://icu.sourceforge.net/charts/charset/roundtripIndex.html>http://icu.sourceforge.net/charts/charset/roundtripIndex.html
could be split up into some smaller pages. It's really huge.

Regards, Martin.

At 00:22 06/10/03, Mark Davis wrote:
>I'd suggest taking a look at the ICU charset data. This was gathered by calling APIs on different platforms, instead of going by the documentation, which was often false.
>
><http://icu.sourceforge.net/charts/charset/>http://icu.sourceforge.net/charts/charset/
>http://icu.sourceforge.net/charts/charset/roundtripIndex.html
>
>The other thing that needs to be done is establish criteria for identity. If two mappings are identical except that one adds an additional mapping from bytes to Unicode, which gets registered? Both? The subset? The superset?
>
>There are literally hundreds of such cases, so without clarity it doesn't help to propose registrations.
>
>Mark



#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
Loading...