Registration of some code pages

Discussion:

Registration of some code pages

Shawn Steele

2010-09-01 23:48:21 UTC

Iâve been asked to register a few code pages, but some of them are problematic because the names donât match our behavior exactly. This is similar to the problem facing HTML5 when they are trying to map from the names used to the actual behavior that IE (and therefore others) use.

Specifically, Iâm wondering about registering âwindows 950â, but somehow annotating it that Microsoft typically redirects âbig5â to that behavior. So an alias isnât really appropriate.

Similarly, is there something we could add to the Windows-31J registration to recognize that Microsoft uses shift_jis to point to Windows-31J instead of using the registered form?

IMO thereâs not much point in me registering ânewâ names to help people understand existing compatibility issues since the new names wonât be recognized by the existing implementations.

Thoughts,

- Shawn

ï£¢ï£ï£§ï£ ï£¢ï££ï£ï£ï£
http://blogs.msdn.com/shawnste
(Selfhost 7810)

Anne van Kesteren

2010-09-03 10:18:47 UTC

On Thu, 02 Sep 2010 01:48:21 +0200, Shawn Steele =

I=E2=80=99ve been asked to register a few code pages, but some of them=

are =

problematic because the names don=E2=80=99t match our behavior exactly=

. This is =

similar to the problem facing HTML5 when they are trying to map from t=

he =

names used to the actual behavior that IE (and therefore others) use.
Specifically, I=E2=80=99m wondering about registering =E2=80=9Cwindows=

950=E2=80=9D, but somehow =

annotating it that Microsoft typically redirects =E2=80=9Cbig5=E2=80=9D=

to that =

behavior. So an alias isn=E2=80=99t really appropriate.
Similarly, is there something we could add to the Windows-31J =
registration to recognize that Microsoft uses shift_jis to point to =
Windows-31J instead of using the registered form?
IMO there=E2=80=99s not much point in me registering =E2=80=9Cnew=E2=80=

=9D names to help people =

understand existing compatibility issues since the new names won=E2=80=

=99t be =

recognized by the existing implementations.

Ideally the registry (or maybe a new one specific to the Web) reflects =

what implementations do. If we all agree that "big5" means "windows 950"=
=

than "big5" should just mean that even though originally the intention =

might have been different. After all, that is what the running code is =

doing. That would give the most clear guidance to new implementors and =

allows existing implementors to converge.

-- =

Anne van Kesteren
http://annevankesteren.nl/

Shawn Steele

2010-09-03 16:19:49 UTC

I don't think it can be restricted to the web. For various reasons this difference could happen in other places too (MIME).

And "Big5" doesn't always mean "windows 950". It often does on a Windows box, but it might actually mean "Big5" on other machines. (I was going to provide examples, but there's a lot of variety across vendors, wikipedia lists a lot of variations).

HTML5 seems to be providing "clear guidance" to web developers, so if we had this kind of annotation, then HTML5 could point to that. However, even in HTML (especially in HTML?), big5 doesn't necessarily always mean Windows 950, it could actually be a different varient from whatever authoring system originated it.

Seems to me that, in practice, these labels narrow down the behavior, but there are variations. Sometimes only a couple codepoints, sometimes bigger, but a "wise" application would allow for imperfect code page identification. (Like the user drop-down to change code pages).

Ironically, IE9 beta's are getting stricter about the page declarations, and I've started seeing the opposite (Arabic sites tagged as Arabic but they're actually UTF-8, etc.), which is, I guess, why people have been doing autodetection.

Anyway, I'd like to annotate a couple records as "this is sometimes called xxx" without making it an alias, especially when the "alias" is already defined as something else.

-Shawn

 
http://blogs.msdn.com/shawnste

________________________________________
From: Anne van Kesteren [***@opera.com]
Sent: Friday, September 03, 2010 3:18 AM
To: 'ietf-***@iana.org'; Shawn Steele
Subject: Re: Registration of some code pages

On Thu, 02 Sep 2010 01:48:21 +0200, Shawn Steele

I’ve been asked to register a few code pages, but some of them are
problematic because the names don’t match our behavior exactly. This is
similar to the problem facing HTML5 when they are trying to map from the
names used to the actual behavior that IE (and therefore others) use.
Specifically, I’m wondering about registering “windows 950”, but somehow
annotating it that Microsoft typically redirects “big5” to that
behavior. So an alias isn’t really appropriate.
Similarly, is there something we could add to the Windows-31J
registration to recognize that Microsoft uses shift_jis to point to
Windows-31J instead of using the registered form?
IMO there’s not much point in me registering “new” names to help people
understand existing compatibility issues since the new names won’t be
recognized by the existing implementations.

Ideally the registry (or maybe a new one specific to the Web) reflects
what implementations do. If we all agree that "big5" means "windows 950"
than "big5" should just mean that even though originally the intention
might have been different. After all, that is what the running code is
doing. That would give the most clear guidance to new implementors and
allows existing implementors to converge.

--
Anne van Kesteren
http://anneva

Anne van Kesteren

2010-09-03 17:29:19 UTC

On Fri, 03 Sep 2010 18:19:49 +0200, Shawn Steele

Post by Shawn Steele
I don't think it can be restricted to the web. For various reasons this
difference could happen in other places too (MIME).

More generic is fine with me, but from the other people commenting I get
the feeling that changing the details of how things in the registry are
today is controversial. And I'd really like to just get it done.

Post by Shawn Steele
And "Big5" doesn't always mean "windows 950". It often does on a
Windows box, but it might actually mean "Big5" on other machines. (I
was going to provide examples, but there's a lot of variety across
vendors, wikipedia lists a lot of variations).

Sure, that is a problem that needs to be fixed though.

Post by Shawn Steele
HTML5 seems to be providing "clear guidance" to web developers, so if we
had this kind of annotation, then HTML5 could point to that. However,
even in HTML (especially in HTML?), big5 doesn't necessarily always mean
Windows 950, it could actually be a different varient from whatever
authoring system originated it.

Sure, it could mean something else, but that seems hardly relevant. Or are
you proposing that we have some additional variable for deciding how to
decode something labeled as "big5"? I don't think that is going to happen.

Post by Shawn Steele
Seems to me that, in practice, these labels narrow down the behavior,
but there are variations. Sometimes only a couple codepoints, sometimes
bigger, but a "wise" application would allow for imperfect code page
identification. (Like the user drop-down to change code pages).

That seems way more complicated than what we have now and since this is
all legacy (I consider non-UTF-8 legacy) anyway I am not sure we should be
concerned with that. We should pick that what works best and most often
that is "what IE does" as IE dominated the market in when all the legacy
encodings dominated.

Post by Shawn Steele
Ironically, IE9 beta's are getting stricter about the page declarations,
and I've started seeing the opposite (Arabic sites tagged as Arabic but
they're actually UTF-8, etc.), which is, I guess, why people have been
doing autodetection.

Are you talking about the platform releases or internal betas that have
the IE-specific code activated?

Post by Shawn Steele
Anyway, I'd like to annotate a couple records as "this is sometimes
called xxx" without making it an alias, especially when the "alias" is
already defined as something else.

All I'm saying is that is way too vague for implementations to use. I
suppose it is an incremental step to the registry getting closer to
reality, but I would prefer something more drastic.

--
Anne van Kesteren
http://annevankesteren.nl/

Shawn Steele

2010-09-03 19:11:42 UTC

All I'm saying is that is way too vague for implementations to use. I suppose it is an incremental step to the registry getting closer to reality, but I would prefer something more drastic.

I think the current state is too complex to quantify, and moving toward UTF-8 is the only solution that really solves the problem (though it takes forever, but Mark Davis @ Google is reporting UTF-8 on the web is now over 50%!)

I'd just ignore the problem, but people see that some of our products recycle the registry names for similar-but-different purposes, and then they want us to register our behavior. However the names already taken. So they want a new name, but what's the point in naming our behavior with something we can't recognize? And the requestors are ones I can't ignore.

So I'm trying to figure out a workaround that makes sense: Registering our behavior when it differs, but provide an alternative name in case we do something different.

Eg: Something like:

Charset name: Windows Codepage 932

Charset aliases: (None)

Suitability for use in MIME text:

Yes, windows-932 is suitable for use with subtypes of the "text"
Content-Type. Note that windows-932 is an 8-bit, double byte
charset. Care should be taken to choose an appropriate
Content-Transfer-Encoding.

Published specification(s):

1) http://www.microsoft.com/globaldev/reference/dbcs/932.htm

ISO 10646 equivalency table:

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT

Additional information:

Although not authoritative, the following references may also be of
interest:

Microsoft windows extended "best fit" behavior:
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit932.txt

This charset is in use, but inconsistently named.

This charset is also known as Windows Code Page 932 or cp932 for
short; these are NOT aliases.

This charset is also defined as Windows-31J, however that name
is not recognized by many applications.

Some vendors and applications use Windows Codepage 932 instead of
shift_jis, however Windows Codepage 932 has different behavior
than that of shift_jis name. For compatibility, applications
may need to map shift_jis to Windows Codepage 932 in some cases.

This code page is a vendor specific extension of shift_jis.

Person & email address to contact for further information:

Shawn Steele
Email: ***@microsoft.com

Microsoft Corporation
One Microsoft Way
Redmond, WA 98052
U.S.A.

Intended usage: COMMON

-Sha

Bjoern Hoehrmann

2010-09-04 00:54:37 UTC

Post by Shawn Steele
Charset name: Windows Codepage 932

(Note that spaces are not allowed in the name, currently.)
--=20
Bj=F6rn H=F6hrmann =B7 mailto:***@hoehrmann.de =B7 http://bjoern.h=
oehrmann.de
Am Badedeich 7 =B7 Telefon: +49(0)160/4415681 =B7 http://www.bjoernsw=
orld.de
25899 Dageb=FCll =B7 PGP Pub. KeyID: 0xA4357E78 =B7 http://www.websit=
edev.de/=20

NARUSE, Yui

2010-09-04 18:56:22 UTC

What is the difference between Windows-31J and this.

--
NARUSE, Yui <***@airemix.jp>

Shawn Steele

2010-09-06 06:12:39 UTC

Windows-31J isn't helpful to describe microsoft behavior because that's the IANA name, but Microsoft apps don't recognize it. Unfortunately the names we use are already assigned :( Obviously existing behavior isn't going to change.

So my thinking would be to us a name that we do recognize (though as an alias), and point out that we sometimes refer to this code page by other names, though those names may have different meanings to other applications.

It's not a great solution to the single name meaning two things problem, but I'm not sure there is a perfect solution to this issue.

This is probably an issue for a few of the code pages. I don't think a "Web specific" registry helps much because we use these names for pretty much everything, so any application on a windows/.Net box reading "shift_jis" content could get "Windows-31J" behavior. (Presumably some coudl also get shift_jis behavior if they used their own conversion code instead of the system APIs).

-Shawn

 
http://blogs.msdn.com/shawnste

________________________________________
From: NARUSE, Yui [***@airemix.jp]
Sent: Saturday, September 04, 2010 11:56 AM
To: Shawn Steele
Cc: Anne van Kesteren; 'ietf-***@iana.org'
Subject: Re: Registration of some code pages

What is the difference between Windows-31J and this.

--
NAR

NARUSE, Yui

2010-09-06 09:50:33 UTC

Windows-31J isn't helpful to describe microsoft behavior because that's t=

he IANA name,

but Microsoft apps don't recognize it.

In my understanding, Windows-31J refers Microsoft Windows Codepage 932.
Sp it should be helpful to describe microsoft behavior.

Proposed revision of "Shift_JIS" and "Windows-31J"
http://www2.xml.gr.jp/log.html?MLID=3Dxmlmoji&N=3D142

Unfortunately the names we use are already assigned :( =A0Obviously exist=

ing behavior isn't going to change.

What are "the names"?

So my thinking would be to us a name that we do recognize (though as an a=

lias), and point out that we sometimes refer to this code page by other nam=
es, though those names may have different meanings to other applications.

It's not a great solution to the single name meaning two things problem, =

but I'm not sure there is a perfect solution to this issue.

This is probably an issue for a few of the code pages. =A0I don't think a=

"Web specific" registry helps much because we use these names for pretty m=
uch everything, so any application on a windows/.Net box reading "shift_jis=
" content could get "Windows-31J" behavior. =A0(Presumably some coudl also =
get shift_jis behavior if they used their own conversion code instead of th=
e system APIs).

If I have a portable software, it should work on Unix as the same as
it does on Windows.
So the expectation that "shift_jis" on Windows means "Windows-31J" seems wr=
ong.

Such automatic overrides are needed for documents which declares its
encoding in the document or related metadata.
Such situation is not a Web specific, but not universal.

--
NARUSE, Yui <***@airemix.jp>

Shawn Steele

2010-09-07 05:02:55 UTC

Post by NARUSE, Yui
In my understanding, Windows-31J refers Microsoft Windows Codepage 932.
Sp it should be helpful to describe microsoft behavior.

One would think so :) I have no clue what the history is, but it seems like it could be cleaned up a bit.

Post by NARUSE, Yui
Proposed revision of "Shift_JIS" and "Windows-31J"
http://www2.xml.gr.jp/log.html?MLID=3Dxmlmoji&N=3D142

Unfortunately that link doesn't work for me :(

Post by NARUSE, Yui

Unfortunately the names we use are already assigned :( =A0Obviously exist=

ing behavior isn't going to change.

Post by NARUSE, Yui
What are "the names"?

"shift_jis". If you ask windows/mlang/.Net for shift_jis, you get 31J behavior. Ditto for iso-2022-jp/50220. Also, if you ask mlang/.Net what the "name" is for those, it returns "shift_jis" and iso-2022-jp, not some windows-specific name.

Post by NARUSE, Yui
If I have a portable software, it should work on Unix as the same as
it does on Windows.
So the expectation that "shift_jis" on Windows means "Windows-31J" seems wr=
ong.

That's the fundemental problem. If you have portable software and run it on Unix and on Windows, and save your file using "shift_jis" you're going to have some odd discrepencies. Obviously that's not good, but it's pretty entrenched. Clearly we cannot expect Unix boxes to pretend shift_jis is Windows-31J (but some apps do), however it's also a tad unreasonable to expect Windows boxes to suddenly be very strict when they encounter "shift_jis" as that would break a very large number of documents that currently "work."

My feeling that this is a fairly annoying pain, and I could probably invent a number of transition schemes that might get some sort of reasonable parity and migrate documents over a decade or two. However, I think that would still be a painful process, and that everyone's energy would be better spent encouraging use of a more consistent encoding, such as UTF-8, that avoids most of the problems with code pages evolving in different directions.

Post by NARUSE, Yui
Such automatic overrides are needed for documents which declares its
encoding in the document or related metadata.

My suggestion is not to make such a replacement "automatic", but rather noting "somewhere" (like the standards or registry) that this name misrepresentation happens sometime. Then the app developer can figure out what to do for their app and user base.

-Shawn

Anne van Kesteren

2010-09-07 06:38:38 UTC

On Tue, 07 Sep 2010 07:02:55 +0200, Shawn Steele

Post by Shawn Steele

Post by NARUSE, Yui
If I have a portable software, it should work on Unix as the same as
it does on Windows.
So the expectation that "shift_jis" on Windows means "Windows-31J"
seems wrong.

That's the fundemental problem. If you have portable software and run
it on Unix and on Windows, and save your file using "shift_jis" you're
going to have some odd discrepencies. Obviously that's not good, but
it's pretty entrenched. Clearly we cannot expect Unix boxes to pretend
shift_jis is Windows-31J (but some apps do), however it's also a tad
unreasonable to expect Windows boxes to suddenly be very strict when
they encounter "shift_jis" as that would break a very large number of
documents that currently "work."

I think we can expect all browsers to at least start "pretending" that
shift_jis is Windows-31J. And similarly for all other encodings. And maybe
changes to browsers find their way back upstream, but that is outside my
interest area.

Post by Shawn Steele
My feeling that this is a fairly annoying pain, and I could probably
invent a number of transition schemes that might get some sort of
reasonable parity and migrate documents over a decade or two. However,
I think that would still be a painful process, and that everyone's
energy would be better spent encouraging use of a more consistent
encoding, such as UTF-8, that avoids most of the problems with code
pages evolving in different directions.

Sure, we encourage people to use UTF-8 all over the place. However, there
is a segment of the web that is not being updated as much (or at all) and
still wants to be rendered. All browsers should render that segment the
same way. And new browsers should not have to reverse engineer existing
browsers to figure out how to render that segment correctly. (Where
correctly means e.g. interpreting shift_jis as Windows-31J.)

Post by Shawn Steele

Post by NARUSE, Yui
Such automatic overrides are needed for documents which declares its
encoding in the document or related metadata.

My suggestion is not to make such a replacement "automatic", but rather
noting "somewhere" (like the standards or registry) that this name
misrepresentation happens sometime. Then the app developer can figure
out what to do for their app and user base.

If you develop a new browser you do not really want to have to figure such
things out. It is also nigh-on impossible given the amount of content out
there, global market, etc. So we should document what we think is best so
others do not have to figure it out all over again.

--
Anne van Kesteren
http://annevankesteren.nl/

Ned Freed

2010-09-07 23:01:00 UTC

Post by Anne van Kesteren
On Tue, 07 Sep 2010 07:02:55 +0200, Shawn Steele

Post by Shawn Steele

Post by NARUSE, Yui
If I have a portable software, it should work on Unix as the same as
it does on Windows.
So the expectation that "shift_jis" on Windows means "Windows-31J"
seems wrong.

That's the fundemental problem. If you have portable software and run
it on Unix and on Windows, and save your file using "shift_jis" you're
going to have some odd discrepencies. Obviously that's not good, but
it's pretty entrenched. Clearly we cannot expect Unix boxes to pretend
shift_jis is Windows-31J (but some apps do), however it's also a tad
unreasonable to expect Windows boxes to suddenly be very strict when
they encounter "shift_jis" as that would break a very large number of
documents that currently "work."

Adding another data point: We're under considerable pressure from our Japanese
customers to just add the 31J stuff to our shift_jis and iso-2022-jp tables and
be done with it. They will accept nothing less than the ability to use the
additional character and send them out labelled as iso-2022-jp, or less often,
shift_jis.

Post by Anne van Kesteren
I think we can expect all browsers to at least start "pretending" that
shift_jis is Windows-31J. And similarly for all other encodings. And maybe
changes to browsers find their way back upstream, but that is outside my
interest area.

I'll probably get chided for saying this, but it sure seems to me that this
battle is already lost and we should suck it up and move on. It's always been
permissible to add characters to a chqrset, even though there are always going
to be implementations that are slow to support, or may never be upgraded to
support, the new characters.

So, unless there are cases where a code point has been used in conflicting
ways, why don't we just add the additional characters to shift_jis and
iso-2022-jp? (Perhaps a revision to RFC 1468 is in order.)

Ned

Shawn Steele

2010-09-07 23:39:01 UTC

That's sort of the message I hear from different directions. (like end users & customers and developers)

I don't know enough about the standard version of shift_jis/iso-2022-jp/etc. :) (I know some of you are ROFL now :), I'll pause for a moment and let you collect yourselves ;0)....

But I gather there may be some conflicting behavior as well, and that the windows version isn't just additions?

My understanding (hearsay), is also that "ours" isn't the only variation of these code pages, though our version certainly gets a lot of attention.

-Shawn

-----Original Message-----
From: Ned Freed [mailto:***@mrochek.com]
Sent: Tuesday, September 07, 2010 4:01 PM
To: Anne van Kesteren
Cc: NARUSE, Yui; Shawn Steele; 'ietf-***@iana.org'
Subject: Re: Registration of some code pages

Post by Anne van Kesteren
On Tue, 07 Sep 2010 07:02:55 +0200, Shawn Steele

Post by Shawn Steele

Post by NARUSE, Yui
If I have a portable software, it should work on Unix as the same
as it does on Windows.
So the expectation that "shift_jis" on Windows means "Windows-31J"
seems wrong.

That's the fundemental problem. If you have portable software and
run it on Unix and on Windows, and save your file using "shift_jis"
you're going to have some odd discrepencies. Obviously that's not
good, but it's pretty entrenched. Clearly we cannot expect Unix
boxes to pretend shift_jis is Windows-31J (but some apps do),
however it's also a tad unreasonable to expect Windows boxes to
suddenly be very strict when they encounter "shift_jis" as that
would break a very large number of documents that currently "work."

Adding another data point: We're under considerable pressure from our Japanese customers to just add the 31J stuff to our shift_jis and iso-2022-jp tables and be done with it. They will accept nothing less than the ability to use the additional character and send them out labelled as iso-2022-jp, or less often, shift_jis.

Post by Anne van Kesteren
I think we can expect all browsers to at least start "pretending" that
shift_jis is Windows-31J. And similarly for all other encodings. And
maybe changes to browsers find their way back upstream, but that is
outside my interest area.

I'll probably get chided for saying this, but it sure seems to me that this battle is already lost and we should suck it up and move on. It's always been permissible to add characters to a chqrset, even though there are always going to be implementations that are slow to support, or may never be upgraded to support, the new characters.

So, unless there are cases where a code point has been used in conflicting ways, why don't we just add the additional characters to shift_jis and iso-2022-jp? (Perhaps a revision to RFC 1468 is in order.)

Ned

Bjoern Hoehrmann

2010-09-07 23:54:02 UTC

I'll probably get chided for saying this, but it sure seems to me th=

at this

battle is already lost and we should suck it up and move on. It's al=

ways been

permissible to add characters to a chqrset, even though there are al=

ways going

to be implementations that are slow to support, or may never be upgr=

aded to

support, the new characters.
So, unless there are cases where a code point has been used in confl=

icting

ways, why don't we just add the additional characters to shift_jis a=

nd

iso-2022-jp? (Perhaps a revision to RFC 1468 is in order.)

As I understand it, there are indeed implementations that support the=
se
names but do not agree on the correspondence between octet sequences =
and
sequences of Unicode code points a typical example is backslash versu=
s
yen sign in shift_jis; http://www.w3.org/Submission/japanese-xml/ has=
a
few more details.
--=20
Bj=F6rn H=F6hrmann =B7 mailto:***@hoehrmann.de =B7 http://bjoern.h=
oehrmann.de
Am Badedeich 7 =B7 Telefon: +49(0)160/4415681 =B7 http://www.bjoernsw=
orld.de
25899 Dageb=FCll =B7 PGP Pub. KeyID: 0xA4357E78 =B7 http://www.websit=
edev.de/=20

Masatoshi Kimura

2010-09-08 11:56:51 UTC

Post by Ned Freed
So, unless there are cases where a code point has been used in
conflicting ways, why don't we just add the additional characters to
shift_jis and iso-2022-jp? (Perhaps a revision to RFC 1468 is in
order.)

Because it is not enough to add the characters.
Some Windows-31J characters have incompatible Unicode mappings with
corresponding Shift_JIS characters.
The most notable example is wave dash. (Shift_JIS 0x8160.) Although this
character supposed to correspond to U+301C, Windows-31J converts 0x8160
to U+FF5E FULLWIDTH TILDE.
Some existing implementations will be likely to be broken if the
mappings are changed.

I wonder why your customers didn't tell this fact.

--
***@nifty.ne.jp

Shawn Steele

2010-09-08 16:46:19 UTC

Out of curiosity, is anyone aware of differences between 31J and windows' implementation?

-----Original Message-----
From: Masatoshi Kimura [mailto:***@nifty.ne.jp]
Sent: Wednesday, September 08, 2010 4:57 AM
To: Ned Freed
Cc: Anne van Kesteren; NARUSE, Yui; Shawn Steele; 'ietf-***@iana.org'
Subject: Re: Registration of some code pages

Post by Ned Freed
So, unless there are cases where a code point has been used in
conflicting ways, why don't we just add the additional characters to
shift_jis and iso-2022-jp? (Perhaps a revision to RFC 1468 is in
order.)

Because it is not enough to add the characters.
Some Windows-31J characters have incompatible Unicode mappings with corresponding Shift_JIS characters.
The most notable example is wave dash. (Shift_JIS 0x8160.) Although this character supposed to correspond to U+301C, Windows-31J converts 0x8160 to U+FF5E FULLWIDTH TILDE.
Some existing implementations will be likely to be broken if the mappings are changed.

I wonder why your customers didn't tell this fact.
--
***@nifty.ne.jp

NARUSE, Yui

2010-09-08 18:41:58 UTC

Post by Shawn Steele
Out of curiosity, is anyone aware of differences between 31J and windows' implementation?

The definition of Windows-31J is following:

Name: Windows-31J
MIBenum: 2024
Source: Windows Japanese. A further extension of Shift_JIS
to include NEC special characters (Row 13), NEC
selection of IBM extensions (Rows 89 to 92), and IBM
extensions (Rows 115 to 119). The CCS's are
JIS X0201:1997, JIS X0208:1997, and these extensions.
This charset can be used for the top-level media type "text",
but it is of limited or specialized use (see RFC2278).
PCL Symbol Set id: 19K
Alias: csWindows31J

So
* it doesn't include User Defined Characters
* it's not clear about best fit chars
* Original CP932 has some odd mapping like U+0080 and U+00FF
http://icu-project.org/repos/icu/data/trunk/charset/data/ucm/windows-932-2000.ucm

--
NARUSE, Yui <***@airemix.jp>

Shawn Steele

2010-09-07 23:44:24 UTC

If you develop a new browser you do not really want to have to figure such things out. It is also nigh-on impossible given the amount of content out there, global market, etc. So we should document what we think is best so others do not have to figure it out all over again.

Is there a "best" solution?

"Clearly" if you're an app dev talking to a windows box, you "should" probably use the windows variations. However if you're a banking app on a Unix box, that's not necessarily the right solution. Or if you have to talk to a service that's on some mainframe... or windows server...

So I think it's helpful to document what the major variations are to help developers figure those things out. As soon as someone said "web browsers should always treat shift_jis as if it were Windows-31J", then somebody's web service is going to break because they have a Unix back end.

I completely agree th

NARUSE, Yui

2010-09-04 19:53:48 UTC

Specifically, I=E2=80=99m wondering about registering =E2=80=9Cwin=

dows 950=E2=80=9D, but somehow

annotating it that Microsoft typically redirects =E2=80=9Cbig5=

=E2=80=9D to that

behavior. So an alias isn=E2=80=99t really appropriate.

Yes, Big5 vs. CP950 should be the problem.
This is also the problem what is the encoding which people call it as=
"big5".

Another problem is EUC-KR vs. CP949
http://wiki.whatwg.org/wiki/Web_Encodings says:
* EUC-KR is CP51949
* ks_c_5601-1987 is CP949
but HTML5 says EUC-KR is CP949
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.h=
tml#character-encodings-0

--=20
NARUSE, Yui <***@airemix.jp>

Shawn Steele

2010-09-06 06:16:31 UTC

Yes, I think there are about 4 or 5 of these that aren't handled by the windows-1252 type code pages.

I suppose an alternative would be to annotate the Big5, etc. entries to note that some applications use CP950, etc. instead. (I'm using the generic word "applications" here because it isn't specific to the web, or email, but could be many applications. Additionally it isn't only a single vendor as other vendors seeking parity also sometimes do the same thing).

-Shawn

 
http://blogs.msdn.com/shawnste

________________________________________
From: NARUSE, Yui [***@airemix.jp]
Sent: Saturday, September 04, 2010 12:53 PM
To: Shawn Steele
Cc: Anne van Kesteren; 'ietf-***@iana.org'
Subject: Re: Registration of some code pages

Specifically, I’m wondering about registering “windows 950”, but somehow
annotating it that Microsoft typically redirects “big5” to that
behavior. So an alias isn’t really appropriate.

Yes, Big5 vs. CP950 should be the problem.
This is also the problem what is the encoding which people call it as "big5".

Another problem is EUC-KR vs. CP949
http://wiki.whatwg.org/wiki/Web_Encodings says:
* EUC-KR is CP51949
* ks_c_5601-1987 is CP949
but HTML5 says EUC-KR is CP949
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#character-encodings

Bjoern Hoehrmann

2010-09-04 00:53:32 UTC

Ideally the registry (or maybe a new one specific to the Web) reflec=

ts =20

what implementations do. If we all agree that "big5" means "windows =

950" =20

than "big5" should just mean that even though originally the intenti=

on =20

might have been different. After all, that is what the running code =

is =20

doing. That would give the most clear guidance to new implementors a=

nd =20

allows existing implementors to converge.

These days we have legacy encodings primarily to support legacy syste=
ms
and legacy content; where systems recognize encodings by some identif=
ier
and that identifier means different things for different systems, the=
n
there may be little interest in convergence. Web browsers generally c=
on-
sume the same content and are supposed to be interchangable, so there=
is
considerable interest and willingness to converge, but that is not th=
e
case in systems where generator and consumer are more tightly coupled=
,
slowly upgrading to some Unicode encoding is often a better option fo=
r
such systems than fiddling with intricate legacy encoding details.

Put simply, if my system has always turned 0x5C with some label into =
a
yen sign, I would not change it so it turns it into a backslash inste=
ad,
as that would have unforseeable security and compatibility implicatio=
ns.

Easing the pain in dealing with such differences and aiding those who
have the opportunity and will to converge requires documentation. If
someone comes up with a good overview in some referencable form (like
an Informational RFC detailing what popular implementations like .NET=
,
Sun's JDK, iconv, ICU, MLang, and so on, do) that would be useful in-
formation to link in the registry.

Information fragments (like only "Some Microsoft products do this" or
"Some popular web browsers do that") is of little utility there, and
for labels that have existed for considerable time it should not be a
goal of the registry to aid in convergence as that is unrealistic. In
particular note that we have very few people on this list, so there'd
be little review of changes in the registry.

What I would support is giving well-defined behaviors proper names in
the registry (with the possible caveat that the assignment is for do-
cumentational purposes only and the name should not be supported di-
rectly) if the behavior is not expected to change, so you can documen=
t
your library by saying "label 'yyy' is treated as 'xxx'", although th=
at
would have the problem Shawn already mentioned.
--=20
Bj=F6rn H=F6hrmann =B7 mailto:***@hoehrmann.de =B7 http://bjoern.h=
oehrmann.de
Am Badedeich 7 =B7 Telefon: +49(0)160/4415681 =B7 http://www.bjoernsw=
orld.de
25899 Dageb=FCll =B7 PGP Pub. KeyID: 0xA4357E78 =B7 http://www.websit=
edev.de/=20

Anne van Kesteren

2010-09-04 08:29:19 UTC

These days we have legacy encodings primarily to support legacy systems
and legacy content; where systems recognize encodings by some identifier
and that identifier means different things for different systems, then
there may be little interest in convergence. Web browsers generally con-
sume the same content and are supposed to be interchangable, so there is
considerable interest and willingness to converge, but that is not the
case in systems where generator and consumer are more tightly coupled,
slowly upgrading to some Unicode encoding is often a better option for
such systems than fiddling with intricate legacy encoding details.

This is why I proposed having a separate registry for the web. For web
browsers if you wish.

--
Anne van Kesteren
http://annevankesteren.nl/

21 Replies
1 View
Permalink to this page
Disable enhanced parsing

Thread Navigation

Shawn Steele 2010-09-01 23:48:21 UTC

Anne van Kesteren 2010-09-03 10:18:47 UTC

Shawn Steele 2010-09-03 16:19:49 UTC

Anne van Kesteren 2010-09-03 17:29:19 UTC

Shawn Steele 2010-09-03 19:11:42 UTC

Bjoern Hoehrmann 2010-09-04 00:54:37 UTC

NARUSE, Yui 2010-09-04 18:56:22 UTC

Shawn Steele 2010-09-06 06:12:39 UTC

NARUSE, Yui 2010-09-06 09:50:33 UTC

Shawn Steele 2010-09-07 05:02:55 UTC

Anne van Kesteren 2010-09-07 06:38:38 UTC

Ned Freed 2010-09-07 23:01:00 UTC

Shawn Steele 2010-09-07 23:39:01 UTC

Bjoern Hoehrmann 2010-09-07 23:54:02 UTC

Masatoshi Kimura 2010-09-08 11:56:51 UTC

Shawn Steele 2010-09-08 16:46:19 UTC

NARUSE, Yui 2010-09-08 18:41:58 UTC

Shawn Steele 2010-09-07 23:44:24 UTC

NARUSE, Yui 2010-09-04 19:53:48 UTC

Shawn Steele 2010-09-06 06:16:31 UTC

Bjoern Hoehrmann 2010-09-04 00:53:32 UTC

Anne van Kesteren 2010-09-04 08:29:19 UTC

about - legalese

Loading...