Discussion:
shift_jis / windows-31J
Shawn Steele
2010-11-10 22:52:42 UTC
Permalink
A dozen years ago windows-31J was created because people noticed that there were lots of different flavors of shift_jis floating around. Uniquely identifying them may have made sense, however the windows-31J term has never really been widely adopted for the windows code page 932 behavior.

So I’d like to propose the following updates, loosly based on discussion about variants some time ago. I’d be happy to accept other suggestions that help users discover that some test is tagged with the less-specific shift_jis name rather than the more specific vendor charset name.

Name: Windows-31J
MIBenum: 2024
Source: Windows Japanese. A variant of Shift_JIS to include
NEC special characters (Row 13), NEC selection of IBM
extensions (Rows 89 to 92), and IBM extensions (Rows
115 to 119). The CCS's are JIS X0201:1997,
JIS X0208:1997, and these extensions. This charset
can be used for the top-level media type "text", but
it is of limited or specialized use (see RFC2278).
PCL Symbol Set id: 19K. Windows-31J text is commonly
declared with the shift_jis name of the parent charset.
Alias: csWindows31J
Alias: shift_jis+cp932


Name: Shift_JIS (preferred MIME name)

MIBenum: 17

Source: This charset is an extension of csHalfWidthKatakana by

adding graphic characters in JIS X 0208. The CCS's are

JIS X0201:1997 and JIS X0208:1997. The

complete definition is shown in Appendix 1 of JIS

X0208:1997.

This charset can be used for the top-level media type "text".

Several vendor specific charsets that derive from shift_jis

often use the shift_jis name instead of a more specific

vendor charset name.

Alias: MS_Kanji

Alias: csShiftJIS




- Shawn

 
http://blogs.msdn.com/shawnste
(Selfhost 7872)
NARUSE, Yui
2010-11-11 09:57:03 UTC
Permalink
Post by Shawn Steele
A dozen years ago windows-31J was created because people noticed that there were lots of different flavors of shift_jis floating around. Uniquely identifying them may have made sense, however the windows-31J term has never really been widely adopted for the windows code page 932 behavior.
So I’d like to propose the following updates, loosly based on discussion about variants some time ago. I’d be happy to accept other suggestions that help users discover that some test is tagged with the less-specific shift_jis name rather than the more specific vendor charset name.
Name: Windows-31J
MIBenum: 2024
Source: Windows Japanese. A variant of Shift_JIS to include
NEC special characters (Row 13), NEC selection of IBM
extensions (Rows 89 to 92), and IBM extensions (Rows
115 to 119). The CCS's are JIS X0201:1997,
JIS X0208:1997, and these extensions. This charset
can be used for the top-level media type "text", but
it is of limited or specialized use (see RFC2278).
PCL Symbol Set id: 19K. Windows-31J text is commonly
declared with the shift_jis name of the parent charset.
Alias: csWindows31J
Alias: shift_jis+cp932
Name: Shift_JIS (preferred MIME name)
MIBenum: 17
Source: This charset is an extension of csHalfWidthKatakana by
adding graphic characters in JIS X 0208. The CCS's are
JIS X0201:1997 and JIS X0208:1997. The
complete definition is shown in Appendix 1 of JIS
X0208:1997.
This charset can be used for the top-level media type "text".
Several vendor specific charsets that derive from shift_jis
often use the shift_jis name instead of a more specific
vendor charset name.
Alias: MS_Kanji
Alias: csShiftJIS
I object to create new alias name.
Moreover XML doesn't allow "+" for EncName.
http://www.w3.org/TR/REC-xml/#NT-EncName
If add aliases to Windows-31J, they should be CP932, MS932, or Windows-932.

I agree with adding more description.
--
NARUSE, Yui <***@airemix.jp>
Martin J. Dürst
2010-11-11 11:10:45 UTC
Permalink
Hello Shawn, Yui,
Post by NARUSE, Yui
A dozen years ago windows-31J was created because people noticed t=
hat
Post by NARUSE, Yui
there were lots of different flavors of shift_jis floating around.
Uniquely identifying them may have made sense, however the windows=
-31J
Post by NARUSE, Yui
term has never really been widely adopted for the windows code pag=
e
Post by NARUSE, Yui
932 behavior.
So I=E2=80=99d like to propose the following updates, loosly based=
on
Post by NARUSE, Yui
discussion about variants some time ago. I=E2=80=99d be happy to a=
ccept other
Post by NARUSE, Yui
suggestions that help users discover that some test is tagged with=
the

Shouldn't that be 'text' instead of 'test'?
Post by NARUSE, Yui
less-specific shift_jis name rather than the more specific vendor
charset name.
Name: Windows-31J
MIBenum: 2024
Source: Windows Japanese. A variant of Shift_JIS to include
NEC special characters (Row 13), NEC selection of IBM
extensions (Rows 89 to 92), and IBM extensions (Rows
115 to 119). The CCS's are JIS X0201:1997,
JIS X0208:1997, and these extensions. This charset
can be used for the top-level media type "text", but
it is of limited or specialized use (see RFC2278).
I think when you say 'it can be used for the top-level media type=
=20
"text", you also need to say something about that it is not 7-bit.

Anyway, it seems that you are using an old (or no) template, I think =
it=20
would be best to use the newest template.
Post by NARUSE, Yui
PCL Symbol Set id: 19K.
I had no clue what "PCL Symbol set" was. Is this important? It doesn'=
t=20
turn up in other charset registrations.
Post by NARUSE, Yui
Windows-31J text is commonly
declared with the shift_jis name of the parent charset.
I'd suggest to change "commonly" to "often". To me, "commonly" has to=
o=20
much of a touch of "that's the right thing to do".
Post by NARUSE, Yui
Alias: csWindows31J
Alias: shift_jis+cp932
Name: Shift_JIS (preferred MIME name)
MIBenum: 17
Source: This charset is an extension of csHalfWidthKatakana by
adding graphic characters in JIS X 0208. The CCS's are
JIS X0201:1997 and JIS X0208:1997. The
complete definition is shown in Appendix 1 of JIS
X0208:1997.
This charset can be used for the top-level media type "text".
Several vendor specific charsets that derive from shift_jis
often use the shift_jis name instead of a more specific
vendor charset name.
Alias: MS_Kanji
Alias: csShiftJIS
I'm not sure why the registration for Shift_JIS turns up here. Are yo=
u=20
also updating that?
Post by NARUSE, Yui
I object to create new alias name.
Yui, can you say why you object?
Post by NARUSE, Yui
Moreover XML doesn't allow "+" for EncName.
http://www.w3.org/TR/REC-xml/#NT-EncName
I agree that this is a serious show-stopper.

Regards, Martin.
Post by NARUSE, Yui
If add aliases to Windows-31J, they should be CP932, MS932, or Wind=
ows-932.
Post by NARUSE, Yui
I agree with adding more description.
--=20
#-# Martin J. D=C3=BCrst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
Shawn Steele
2010-11-11 17:31:10 UTC
Permalink
Anyway, it seems that you are using an old (or no) template, I think it
would be best to use the newest template.
Post by Shawn Steele
PCL Symbol Set id: 19K.
I had no clue what "PCL Symbol set" was. Is this important? It doesn't
turn up in other charset registrations.
Me either. :) I was trying to make the minimal change to express the idea.
Moreover XML doesn't allow "+" for EncName.
http://www.w3.org/TR/REC-xml/#NT-EncName
I picked the syntax based on a previous thread a couple years ago, I didn't realize this was a problem.

My more general question is "how do I say 'shift_jis points to windows-31J' on some systems, despite what the charset registry has said for years?"

-Shawn
Markus Scherer
2010-11-11 18:10:36 UTC
Permalink
Post by Shawn Steele
Post by NARUSE, Yui
Moreover XML doesn't allow "+" for EncName.
http://www.w3.org/TR/REC-xml/#NT-EncName
I picked the syntax based on a previous thread a couple years ago, I didn't
realize this was a problem.
My more general question is "how do I say 'shift_jis points to windows-31J'
on some systems, despite what the charset registry has said for years?"
The ICU converter alias list has the following aliases tagged with
"WINDOWS": Shift_JIS, MS_Kanji, csShiftJIS, csWindows31J, cp932, windows-932.
I don't know which of these names Windows actually recognizes. If there is
at least one name that Windows recognizes (MS_Kanji??) and that does not
collide with an IANA standard-Shift-JIS alias, you could use that.

markus
Shawn Steele
2010-11-11 19:02:19 UTC
Permalink
The "bigger" problem isn't finding a name that we recognize, although that's big too, but rather if I do:

using System;
using System.Text;

class Example
{
static void Main()
{
Console.WriteLine(Encoding.GetEncoding(932).WebName);
Encoding.GetEncoding("csWindows-31J");
Encoding.GetEncoding("Windows-31J");
}
}

Then I'll get "shift_jis" as the encoding name. (WebName's effectively as close as .Net gets to the IANA charset names.) That cannot change without breaking tons of stuff. C# does happen to recognize csWindows-31J, but the next line will throw an exception. I'd have to dig more to see if MLang recognized the csWindows-31J, but that wouldn't really solve the problem.

So, IANA could decide that Microsoft's variant should have some name (say xxxx or maybe Windows-31J, or use the csWindows-31J we almost know about (not all products do)). However, we'd still return "shift_jis" when you asked for the name. We pretty much can't change that because if you tag your .Net generated document with Encoding.WebName (like maybe an asp.net server), and you upgrade, then I won't be able to read it if I haven't upgraded. Certainly that'd be a huge migration pain, and we'd much, much, much rather people migrate to UTF-8 or UTF-16 than spend any more time in old encodings.

Our partners and competitors would like to interoperate with our encodings, but the shift_jis name is a bit misleading since ours is a variant. "Everyone" knows that (or quickly discovers it), but it would be nice if the that was a bit better documented in the registry.

-Shawn

From: Markus Scherer [mailto:***@gmail.com]
Sent: Thursday, November 11, 2010 10:11 AM
To: Shawn Steele
Cc: "Martin J. Dürst"; NARUSE, Yui; ietf-***@mail.apps.ietf.org
Subject: Re: shift_jis / windows-31J
Post by NARUSE, Yui
Moreover XML doesn't allow "+" for EncName.
http://www.w3.org/TR/REC-xml/#NT-EncName
I picked the syntax based on a previous thread a couple years ago, I didn't realize this was a problem.

My more general question is "how do I say 'shift_jis points to windows-31J' on some systems, despite what the charset registry has said for years?"

The ICU converter alias list has the following aliases tagged with "WINDOWS": Shift_JIS, MS_Kanji, csShiftJIS, csWindows31J, cp932, windows-932. I don't know which of these names Windows actually recognizes. If there is at least one name that Windows recognizes (MS_Kanji??) and that does not collide with an IANA standard-Shift-JIS alias, you could use that.

markus
Bjoern Hoehrmann
2010-11-11 19:14:26 UTC
Permalink
Then I'll get "shift_jis" as the encoding name. (WebName's effectiv=
ely
as close as .Net gets to the IANA charset names.) That cannot chang=
e
without breaking tons of stuff. C# does happen to recognize
csWindows-31J, but the next line will throw an exception. I'd have =
to
dig more to see if MLang recognized the csWindows-31J, but that woul=
dn't
really solve the problem.
http://lists.w3.org/Archives/Public/www-archive/2008Jun/0155.html

...
| 932 | csShiftJIS |
| 932 | csWindows31J |
| 932 | ms_Kanji |
| 932 | shift-jis |
| 932 | shift_jis |
| 932 | sjis |
| 932 | x-ms-cp932 |
| 932 | x-sjis |
...
--=20
Bj=F6rn H=F6hrmann =B7 mailto:***@hoehrmann.de =B7 http://bjoern.h=
oehrmann.de
Am Badedeich 7 =B7 Telefon: +49(0)160/4415681 =B7 http://www.bjoernsw=
orld.de
25899 Dageb=FCll =B7 PGP Pub. KeyID: 0xA4357E78 =B7 http://www.websit=
edev.de/=20
Shawn Steele
2010-11-11 22:33:42 UTC
Permalink
Well, that saves me looking at the MLang source :). But whichever name you used to get the encoding/code page, we'd still call 932 "shift_jis" when you asked for the web name, so your asp.net server would say a page was shift_jis. W3C is getting around that with the variations for HTML5, but that doesn't help other applications, MIME, random documents, etc. :(

-Shawn

-----Original Message-----
From: Bjoern Hoehrmann [mailto:***@gmx.net]
Sent: Thursday, November 11, 2010 11:14 AM
To: Shawn Steele
Cc: ietf-***@iana.org
Subject: Re: shift_jis / windows-31J
Post by Shawn Steele
Then I'll get "shift_jis" as the encoding name. (WebName's effectively
as close as .Net gets to the IANA charset names.) That cannot change
without breaking tons of stuff. C# does happen to recognize
csWindows-31J, but the next line will throw an exception. I'd have to
dig more to see if MLang recognized the csWindows-31J, but that
wouldn't really solve the problem.
http://lists.w3.org/Archives/Public/www-archive/2008Jun/0155.html

...
| 932 | csShiftJIS |
| 932 | csWindows31J |
| 932 | ms_Kanji |
| 932 | shift-jis |
| 932 | shift_jis |
| 932 | sjis |
| 932 | x-ms-cp932 |
| 932 | x-sjis |
...
--
Björn Höhrmann · mailto:***@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Anne van Kesteren
2010-11-12 10:34:23 UTC
Permalink
On Thu, 11 Nov 2010 23:33:42 +0100, Shawn Steele
Post by Shawn Steele
Well, that saves me looking at the MLang source :). But whichever name
you used to get the encoding/code page, we'd still call 932 "shift_jis"
when you asked for the web name, so your asp.net server would say a page
was shift_jis. W3C is getting around that with the variations for
HTML5, but that doesn't help other applications, MIME, random documents,
etc. :(
Ideally we put the HTML5 mapping in the registry instead. And have some
kind of "compatible with the web/windows" profile that will give you that
variation if people are not willing to just make it the default for
everything.
--
Anne van Kesteren
http://annevankesteren.nl/
NARUSE, Yui
2010-11-11 18:10:46 UTC
Permalink
Post by Shawn Steele
Post by NARUSE, Yui
Moreover XML doesn't allow "+" for EncName.
http://www.w3.org/TR/REC-xml/#NT-EncName
I picked the syntax based on a previous thread a couple years ago, I
didn't realize this was a problem.
My more general question is "how do I say 'shift_jis points to
windows-31J' on some systems, despite what the charset registry has
said for years?"
In HTML5's case, they introduce "Character encoding overrides".
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#character-encodings-0
--
NARUSE, Yui <***@airemix.jp>
Shawn Steele
2010-11-11 18:29:03 UTC
Permalink
Yup. And that probably solves the concerns for HTML.

I'm being asked to document things like a character set selector attribute that's a byte. It has entries like:
0x80 Specifies the JIS character set. (IANA name shift_jis)

We all know that "Microsoft's" shift_jis is really Windows-31J, but the on-the-surface reasonable request to replace this shift_jis with Windows-31J would mean that we'd be specifying an identifier that our software didn't recognize. That doesn't help solve the problem. Even when we do recognize windows-31J, we'd tell you that the name was shift_jis (round tripping.)

This kind of documentation shows up "everywhere", so it'd be nice if people got to shift_jis in the registry and saw "gee, Microsoft uses a variation".

At this point it's rather a mess, and the behavior's pretty stuck. If it is desirable for the registry to point people in the right direction, then doing something like what HTML did, at the registry level, would be most helpful.

Shift-JIS currently says:

Name: Shift_JIS (preferred MIME name)
MIBenum: 17
Source: This charset is an extension of csHalfWidthKatakana by
adding graphic characters in JIS X 0208. The CCS's are
JIS X0201:1997 and JIS X0208:1997. The
complete definition is shown in Appendix 1 of JIS
X0208:1997.
This charset can be used for the top-level media type "text".
Alias: MS_Kanji
Alias: csShiftJIS

I'd be happy with some sort of "Microsoft has a variant note". Adding a sentence at the end:

Name: Shift_JIS (preferred MIME name)
MIBenum: 17
Source: This charset is an extension of csHalfWidthKatakana by
adding graphic characters in JIS X 0208. The CCS's are
JIS X0201:1997 and JIS X0208:1997. The
complete definition is shown in Appendix 1 of JIS
X0208:1997.
This charset can be used for the top-level media type "text".
Microsoft products often use the shift_jis name to describe
the Windows-31J variant.
Alias: MS_Kanji
Alias: csShiftJIS

Maybe the opposite in the windows-31j entry "Microsoft products often use shift_jis to describe this variant."

-Shawn

-----Original Message-----
From: NARUSE, Yui [mailto:***@airemix.jp]
Sent: Thursday, November 11, 2010 10:11 AM
To: Shawn Steele
Cc: "Martin J. Dürst"; ietf-***@mail.apps.ietf.org
Subject: Re: shift_jis / windows-31J
Post by Shawn Steele
Post by NARUSE, Yui
Moreover XML doesn't allow "+" for EncName.
http://www.w3.org/TR/REC-xml/#NT-EncName
I picked the syntax based on a previous thread a couple years ago, I
didn't realize this was a problem.
My more general question is "how do I say 'shift_jis points to
windows-31J' on some systems, despite what the charset registry has
said for years?"
In HTML5's case, they introduce "Character encoding overrides".
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#character-encodings-0

--
NARUSE, Yui <***@airemix.jp>
MURATA Makoto
2010-11-11 23:57:37 UTC
Permalink
Post by Shawn Steele
0x80 Specifies the JIS character set. (IANA name shift_jis)
We all know that "Microsoft's" shift_jis is really Windows-31J, but the on-the-surface reasonable
request to replace this shift_jis with Windows-31J would mean that we'd
be specifying an identifier that our software didn't recognize. That
doesn't help solve the problem. Even when we do recognize windows-31J,
we'd tell you that the name was shift_jis (round tripping.)
I do not think so, since you specify 0x80 in data rather than
"windows-31J" or "shift_jis" in this particular case.
Post by Shawn Steele
This kind of documentation shows up "everywhere", so it'd be nice if people got to shift_jis in the
registry and saw "gee, Microsoft uses a variation".
At this point it's rather a mess, and the behavior's pretty stuck. If it is desirable for the registry
to point people in the right direction, then doing something like what
HTML did, at the registry level, would be most helpful.
I agree with this idea. Except the addition of a new alias, I agree.

I reformulated your proposal using the latest registration template.
Here goes.

--------------------------------------------------------------------------------


Charset name: Windows-31J
Charset aliases: csWindows31J
MIBenum: 2024

Suitability for use in MIME text:

This charset can be used for the top-level media type "text".

Published specification(s):

http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx

ISO 10646 equivalency table:

http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT

Additional information:

Windows Japanese. A variant of Shift_JIS to include NEC special
characters (Row 13), NEC selection of IBM extensions (Rows 89 to 92),
and IBM extensions (Rows 115 to 119). The CCS's are JIS X0201:1997,
JIS X0208:1997, and these extensions. Windows-31J text is commonly
declared with the shift_jis name of the parent charset.

Person & email address to contact for further information: ??

Intended usage: LIMITED USE

--------------------------------------------------------------------------------

Charset name: Shift_JIS

MIBenum: 17

Charset aliases: MS_Kanji and csShiftJIS

Suitability for use in MIME text:
This charset can be used for the top-level media type "text".

Published specification(s): Appendix 1 of JIS X0208:1997.

ISO 10646 equivalency table:

There are no authoritative definitions and several variations
exist. An obsolete variation is available at:

http://unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT

Additional information:

This charset is an extension of csHalfWidthKatakana by adding graphic
characters in JIS X 0208. The CCS's are JIS X0201:1997 and JIS
X0208:1997.

Several vendor specific charsets that derive from shift_jis often use
the shift_jis name instead of a more specific vendor charset name.

Person & email address to contact for further information: ?

Intended usage: LIMITED USE
--
MURATA Makoto <***@hokkaido.email.ne.jp>
Shawn Steele
2010-11-12 00:15:03 UTC
Permalink
Post by MURATA Makoto
I reformulated your proposal using the latest registration template.
Here goes.
This suggestion works for me, added "and the Windows-31J name may not be recognized." And my contact information to the windows-31J record.

To the shift_jis, it'd be nice to have something like "Windows-31J and ??? are examples." No clue who contacts should be, can it point to something general, like the charsets alias, JIS, or Unicode?

-Shawn


--------------------------------------------------------------------------------


Charset name: Windows-31J
Charset aliases: csWindows31J
MIBenum: 2024

Suitability for use in MIME text:

This charset can be used for the top-level media type "text".

Published specification(s):

http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx

ISO 10646 equivalency table:

http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT

Additional information:

Windows Japanese. A variant of Shift_JIS to include NEC special characters (Row 13), NEC selection of IBM extensions (Rows 89 to 92), and IBM extensions (Rows 115 to 119). The CCS's are JIS X0201:1997, JIS X0208:1997, and these extensions. Windows-31J text is commonly declared with the shift_jis name of the parent charset, and the Windows-31J name may not be recognized.

Person & email address to contact for further information:

Shawn Steele
Email: ***@microsoft.com

Microsoft Corporation
One Microsoft Way,
Redmond, WA 98052
U.S.A.

Intended usage: LIMITED USE

--------------------------------------------------------------------------------

Charset name: Shift_JIS

MIBenum: 17

Charset aliases: MS_Kanji and csShiftJIS

Suitability for use in MIME text:
This charset can be used for the top-level media type "text".

Published specification(s): Appendix 1 of JIS X0208:1997.

ISO 10646 equivalency table:

There are no authoritative definitions and several variations exist. An obsolete variation is available at:

http://unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT

Additional information:

This charset is an extension of csHalfWidthKatakana by adding graphic characters in JIS X 0208. The CCS's are JIS X0201:1997 and JIS X0208:1997.

Several vendor specific charsets that derive from shift_jis often use the shift_jis name instead of a more specific vendor charset name. Windows-31J and ??? are examples.

Person & email address to contact for further information: ?

Intended usage: LIMITED USE
MURATA Makoto
2010-11-12 01:57:42 UTC
Permalink
Post by Shawn Steele
No clue who contacts should be, can it point to something general, like the charsets alias, JIS, or Unicode?
I guess that the Japanese Industrial Standards Committee is responsible.

http://www.jisc.go.jp/eng/index.html
--
MURATA Makoto <***@hokkaido.email.ne.jp>
Shawn Steele
2010-11-12 00:26:38 UTC
Permalink
Post by MURATA Makoto
I agree with this idea. Except the addition of a new alias, I agree.
I was "just" using a suggestion from a couple years ago. I think I like what you suggested better. I don't see much value to the new alias, the older suggestion was just to link the two somehow.

-Shawn
Martin J. Dürst
2010-11-12 09:16:41 UTC
Permalink
Hello Makoto,
Post by MURATA Makoto
I'm being asked to document things like a character set selector a=
=090x80 Specifies the JIS character set. (IANA name shift_jis)
We all know that "Microsoft's" shift_jis is really Windows-31J, bu=
t the on-the-surface reasonable
Post by MURATA Makoto
request to replace this shift_jis with Windows-31J would mean that=
we'd
Post by MURATA Makoto
be specifying an identifier that our software didn't recognize. T=
hat
Post by MURATA Makoto
doesn't help solve the problem. Even when we do recognize windows=
-31J,
Post by MURATA Makoto
we'd tell you that the name was shift_jis (round tripping.)
I do not think so, since you specify 0x80 in data rather than
"windows-31J" or "shift_jis" in this particular case.
This kind of documentation shows up "everywhere", so it'd be nice =
if people got to shift_jis in the
Post by MURATA Makoto
registry and saw "gee, Microsoft uses a variation".
At this point it's rather a mess, and the behavior's pretty stuck.=
If it is desirable for the registry
Post by MURATA Makoto
to point people in the right direction, then doing something like =
what
Post by MURATA Makoto
HTML did, at the registry level, would be most helpful.
I agree with this idea. Except the addition of a new alias, I agre=
e.
Post by MURATA Makoto
I reformulated your proposal using the latest registration template=
.
Post by MURATA Makoto
Here goes.
-------------------------------------------------------------------=
-------------
Post by MURATA Makoto
Charset name: Windows-31J
Charset aliases: csWindows31J
MIBenum: 2024
This charset can be used for the top-level media type "text".
The recent windows-874 registration used the following:
Suitability for use in MIME text:

Yes, windows-874 is suitable for use with subtypes of the "text"
Content-Type. Note that windows-874 is an 8-bit charset. Care should
be taken to choose an appropriate Content-Transfer-Encoding.
I suggest you use something similar, in particular also mentioning th=
at=20
this is an 8-bit charset. I'm sure Ned would insist on that if he had=
time.
Post by MURATA Makoto
http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx
http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
Windows Japanese. A variant of Shift_JIS to include NEC special
characters (Row 13), NEC selection of IBM extensions (Rows 89 to 92=
),
Post by MURATA Makoto
and IBM extensions (Rows 115 to 119). The CCS's are JIS X0201:1997=
,
Post by MURATA Makoto
JIS X0208:1997, and these extensions. Windows-31J text is commonly
declared with the shift_jis name of the parent charset.
I would probably change this to "On Windows systems, Windows-31J text=
is=20
commonly declared...."
Post by MURATA Makoto
Person& email address to contact for further information: ??
Intended usage: LIMITED USE
-------------------------------------------------------------------=
-------------
Post by MURATA Makoto
Charset name: Shift_JIS
MIBenum: 17
Charset aliases: MS_Kanji and csShiftJIS
The "MS_Kanji" alias is really quite unfortunate, but I don't think w=
e=20
can remove it.
Post by MURATA Makoto
This charset can be used for the top-level media type "text".
Published specification(s): Appendix 1 of JIS X0208:1997.
There are no authoritative definitions and several variations
http://unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.T=
XT

This is in my opinion somewhat too pessimistic. Essentially, the=20
correspondence is defined in JIS X0208:1997. All the Kanji are mapped=
as=20
described in Appendix 6. There is some variation for some of the=20
punctuation listed in the first column of Table 2 of Appendix 5;=20
otherwise, the names given in Appendix 5 are used in preference to th=
ose=20
given in Appendix 4, where available.

If the above is true, it might be better to write it in that way, rat=
her=20
than just to imply that anything goes.

Regards, Martin.
Post by MURATA Makoto
This charset is an extension of csHalfWidthKatakana by adding graph=
ic
Post by MURATA Makoto
characters in JIS X 0208. The CCS's are JIS X0201:1997 and JIS
X0208:1997.
Several vendor specific charsets that derive from shift_jis often u=
se
Post by MURATA Makoto
the shift_jis name instead of a more specific vendor charset name.
Person& email address to contact for further information: ?
Intended usage: LIMITED USE
--=20
#-# Martin J. D=FCrst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
Shawn Steele
2010-11-12 18:59:14 UTC
Permalink
Updated with Martin's comment for the MIME text suitability, added link to JIS web page, and tried to incorporate Martin's comments regarding the JIS standards mapping correspondence, however I am not as familiar with that document, so I could have erred.
I would probably change this to "On Windows systems, Windows-31J text is commonly declared...."
I've hesitated about this in pretty much every email on this topic. I don't mind being specific (though I'd say more like "On Microsoft systems" since non-Windows Microsoft products would use the same mappings), but my hesitation is for two reasons:
1) I wasn't sure if the charset registry wanted to directly identify companies/products by name this way.
2) I think it is likely that some of the windows behavior has "leaked" to other systems. For example, HTML 5's new mapping recommendations would probably encourage such leaking of meaning, particularly in HTML.

I don't feel strongly about it either way, so whatever the charset group wants :)

-Shawn

--------------------------------------------------------------------------------


Charset name: Windows-31J
Charset aliases: csWindows31J
MIBenum: 2024

Suitability for use in MIME text:

Yes, Windows-31J is suitable for use with subtypes of the "text" Content-Type. Note that Windows-31J is an 8-bit charset. Care should be taken to choose an appropriate Content-Transfer-Encoding.

Published specification(s):

http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx

ISO 10646 equivalency table:

http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT

Additional information:

Windows Japanese. A variant of Shift_JIS to include NEC special characters (Row 13), NEC selection of IBM extensions (Rows 89 to 92), and IBM extensions (Rows 115 to 119). The CCS's are JIS X0201:1997, JIS X0208:1997, and these extensions. Windows-31J text is commonly declared with the shift_jis name of the parent charset, and the Windows-31J name may not be recognized.

Person & email address to contact for further information:

Shawn Steele
Email: ***@microsoft.com

Microsoft Corporation
One Microsoft Way,
Redmond, WA 98052
U.S.A.

Intended usage: LIMITED USE

--------------------------------------------------------------------------------

Charset name: Shift_JIS

MIBenum: 17

Charset aliases: MS_Kanji and csShiftJIS

Suitability for use in MIME text:
This charset can be used for the top-level media type "text".

Published specification(s): Appendix 1 of JIS X0208:1997.

ISO 10646 equivalency table:

The correspondence is defined in JIS X0208:1997, the Kanji mapping is described in Appendix 6. Column 1 of Table 2 of Appendix 5 lists some variation of punctuation, and the names given in Appendix 5 are preferred to those in Appendix 4, when available.

In computer readable formats several variations exist. An obsolete variation is available at:

http://unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT

Additional information:

This charset is an extension of csHalfWidthKatakana by adding graphic characters in JIS X 0208. The CCS's are JIS X0201:1997 and JIS X0208:1997.

Several vendor specific charsets that derive from shift_jis often use the shift_jis name instead of a more specific vendor charset name. Windows-31J and ??? are examples.

Person & email address to contact for further information:
Japanese Industrial Standards Committee
http://www.jisc.go.jp/eng/index.html

Intended usage: LIMITED USE
NARUSE, Yui
2010-11-16 14:39:10 UTC
Permalink
Post by MURATA Makoto
Several vendor specific charsets that derive from shift_jis often use
the shift_jis name instead of a more specific vendor charset name.
Windows-31J and ??? are examples.
If you want another example of a variant of Shift-JIS,
MacJapanese is the one by Apple.
http://unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT
--
NARUSE, Yui <***@airemix.jp>
Shawn Steele
2010-11-18 21:53:48 UTC
Permalink
There were some comments, but they didn't seem to impact the idea here very much.

I've updated the Windows-31J with a note about the 0x5c behavior.

I also added MacJapanese and Java SJIS to the variations comment for shift_jis, but they aren't registered.

Any further comments? I'd like to submit this for the 2 week review and then submit them. I am not ignoring Anne's request to "solve the problem here", but I think that's a more complex issue, and that these changes proposed below would more easily address some of the common confusion in this area. I'm not opposed to someone following up on a better/more complete fix, but I think it'll be hard to get consensus.

-Shawn

--------------------------------------------------------------------------------


Charset name: Windows-31J
Charset aliases: csWindows31J
MIBenum: 2024

Suitability for use in MIME text:

Yes, Windows-31J is suitable for use with subtypes of the "text" Content-Type. Note that Windows-31J is an 8-bit charset. Care should be taken to choose an appropriate Content-Transfer-Encoding.

Published specification(s):

http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx

ISO 10646 equivalency table:

http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT

Additional information:

Windows Japanese. A variant of Shift_JIS to include NEC special characters (Row 13), NEC selection of IBM extensions (Rows 89 to 92), and IBM extensions (Rows 115 to 119). The CCS's are JIS X0201:1997, JIS X0208:1997, and these extensions. Windows-31J text is commonly declared with the shift_jis name of the parent charset, and the Windows-31J name may not be recognized.

In practice 0x5C in Windows-31J is mapped to U+005C in Unicode, but usually displayed as a yen sign glyph.

Person & email address to contact for further information:

Shawn Steele
Email: ***@microsoft.com

Microsoft Corporation
One Microsoft Way,
Redmond, WA 98052
U.S.A.

Intended usage: LIMITED USE

--------------------------------------------------------------------------------

Charset name: Shift_JIS

MIBenum: 17

Charset aliases: MS_Kanji and csShiftJIS

Suitability for use in MIME text:
This charset can be used for the top-level media type "text".

Published specification(s): Appendix 1 of JIS X0208:1997.

ISO 10646 equivalency table:

The correspondence is defined in JIS X0208:1997, the Kanji mapping is described in Appendix 6. Column 1 of Table 2 of Appendix 5 lists some variation of punctuation, and the names given in Appendix 5 are preferred to those in Appendix 4, when available.

In computer readable formats several variations exist. An obsolete variation is available at:

http://unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT

Additional information:

This charset is an extension of csHalfWidthKatakana by adding graphic characters in JIS X 0208. The CCS's are JIS X0201:1997 and JIS X0208:1997.

Several vendor specific charsets that derive from shift_jis often use the shift_jis name instead of a more specific vendor charset name. Windows-31J is one example, MacJapanese and Java SJIS are others. A common variation is to convert shift_jis 0x5c to U+005c Unicode, but display it as the Yen sign. Windows-31J examples.

Person & email address to contact for further information:
Japanese Industrial Standards Committee
http://www.jisc.go.jp/eng/index.html

Intended usage: LIMITED USE
MURATA Makoto
2010-11-21 02:30:27 UTC
Permalink
Post by Shawn Steele
I am not ignoring Anne's request to "solve the problem here", but I think that's a more complex issue,
and that these changes proposed below would more easily address some of
the common confusion in this area. I'm not opposed to someone following
up on a better/more complete fix, but I think it'll be hard to get
consensus.
I agree. We already have serious interoperability problems, and
a few paragraphs in the IANA registration will not solve them
at this stage of the game. (I gave up long time ago.) All we can do
is to admit the reality and document it well using the right terminology.

Cheers,
--
MURATA Makoto <***@hokkaido.email.ne.jp>
Shawn Steele
2010-11-30 01:29:44 UTC
Permalink
2 weeks included a holiday in the US, so I'm just pinging. Barring objections, I'll submit these updates on Friday.

-Shawn

-----Original Message-----
From: Shawn Steele
Sent: Thursday, November 18, 2010 1:54 PM
To: Shawn Steele; MURATA Makoto; Anne van Kesteren
Cc: NARUSE, Yui; Martin J. Durst; ietf-***@mail.apps.ietf.org; Chris Rae; Peter Constable; "Martin J. Dürst"
Subject: RE: shift_jis / windows-31J

There were some comments, but they didn't seem to impact the idea here very much.

I've updated the Windows-31J with a note about the 0x5c behavior.

I also added MacJapanese and Java SJIS to the variations comment for shift_jis, but they aren't registered.

Any further comments? I'd like to submit this for the 2 week review and then submit them. I am not ignoring Anne's request to "solve the problem here", but I think that's a more complex issue, and that these changes proposed below would more easily address some of the common confusion in this area. I'm not opposed to someone following up on a better/more complete fix, but I think it'll be hard to get consensus.

-Shawn

--------------------------------------------------------------------------------


Charset name: Windows-31J
Charset aliases: csWindows31J
MIBenum: 2024

Suitability for use in MIME text:

Yes, Windows-31J is suitable for use with subtypes of the "text" Content-Type. Note that Windows-31J is an 8-bit charset. Care should be taken to choose an appropriate Content-Transfer-Encoding.

Published specification(s):

http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx

ISO 10646 equivalency table:

http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT

Additional information:

Windows Japanese. A variant of Shift_JIS to include NEC special characters (Row 13), NEC selection of IBM extensions (Rows 89 to 92), and IBM extensions (Rows 115 to 119). The CCS's are JIS X0201:1997, JIS X0208:1997, and these extensions. Windows-31J text is commonly declared with the shift_jis name of the parent charset, and the Windows-31J name may not be recognized.

In practice 0x5C in Windows-31J is mapped to U+005C in Unicode, but usually displayed as a yen sign glyph.

Person & email address to contact for further information:

Shawn Steele
Email: ***@microsoft.com

Microsoft Corporation
One Microsoft Way,
Redmond, WA 98052
U.S.A.

Intended usage: LIMITED USE

--------------------------------------------------------------------------------

Charset name: Shift_JIS

MIBenum: 17

Charset aliases: MS_Kanji and csShiftJIS

Suitability for use in MIME text:
This charset can be used for the top-level media type "text".

Published specification(s): Appendix 1 of JIS X0208:1997.

ISO 10646 equivalency table:

The correspondence is defined in JIS X0208:1997, the Kanji mapping is described in Appendix 6. Column 1 of Table 2 of Appendix 5 lists some variation of punctuation, and the names given in Appendix 5 are preferred to those in Appendix 4, when available.

In computer readable formats several variations exist. An obsolete variation is available at:

http://unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT

Additional information:

This charset is an extension of csHalfWidthKatakana by adding graphic characters in JIS X 0208. The CCS's are JIS X0201:1997 and JIS X0208:1997.

Several vendor specific charsets that derive from shift_jis often use the shift_jis name instead of a more specific vendor charset name. Windows-31J is one example, MacJapanese and Java SJIS are others. A common variation is to convert shift_jis 0x5c to U+005c Unicode, but display it as the Yen sign. Windows-31J examples.

Person & email address to contact for further information:
Japanese Industrial Standards Committee
http://www.jisc.go.jp/eng/index.html

Intended usage: LIMITED USE
Shawn Steele
2010-12-03 21:43:10 UTC
Permalink
Submitted....

-----Original Message-----
From: Shawn Steele
Sent: Monday, November 29, 2010 5:30 PM
To: MURATA Makoto; Anne van Kesteren
Cc: NARUSE, Yui; Martin J. Durst; ietf-***@mail.apps.ietf.org; Chris Rae; Peter Constable; "Martin J. Dürst"
Subject: RE: shift_jis / windows-31J

2 weeks included a holiday in the US, so I'm just pinging. Barring objections, I'll submit these updates on Friday.

-Shawn

-----Original Message-----
From: Shawn Steele
Sent: Thursday, November 18, 2010 1:54 PM
To: Shawn Steele; MURATA Makoto; Anne van Kesteren
Cc: NARUSE, Yui; Martin J. Durst; ietf-***@mail.apps.ietf.org; Chris Rae; Peter Constable; "Martin J. Dürst"
Subject: RE: shift_jis / windows-31J

There were some comments, but they didn't seem to impact the idea here very much.

I've updated the Windows-31J with a note about the 0x5c behavior.

I also added MacJapanese and Java SJIS to the variations comment for shift_jis, but they aren't registered.

Any further comments? I'd like to submit this for the 2 week review and then submit them. I am not ignoring Anne's request to "solve the problem here", but I think that's a more complex issue, and that these changes proposed below would more easily address some of the common confusion in this area. I'm not opposed to someone following up on a better/more complete fix, but I think it'll be hard to get consensus.

-Shawn

--------------------------------------------------------------------------------


Charset name: Windows-31J
Charset aliases: csWindows31J
MIBenum: 2024

Suitability for use in MIME text:

Yes, Windows-31J is suitable for use with subtypes of the "text" Content-Type. Note that Windows-31J is an 8-bit charset. Care should be taken to choose an appropriate Content-Transfer-Encoding.

Published specification(s):

http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx

ISO 10646 equivalency table:

http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT

Additional information:

Windows Japanese. A variant of Shift_JIS to include NEC special characters (Row 13), NEC selection of IBM extensions (Rows 89 to 92), and IBM extensions (Rows 115 to 119). The CCS's are JIS X0201:1997, JIS X0208:1997, and these extensions. Windows-31J text is commonly declared with the shift_jis name of the parent charset, and the Windows-31J name may not be recognized.

In practice 0x5C in Windows-31J is mapped to U+005C in Unicode, but usually displayed as a yen sign glyph.

Person & email address to contact for further information:

Shawn Steele
Email: ***@microsoft.com

Microsoft Corporation
One Microsoft Way,
Redmond, WA 98052
U.S.A.

Intended usage: LIMITED USE

--------------------------------------------------------------------------------

Charset name: Shift_JIS

MIBenum: 17

Charset aliases: MS_Kanji and csShiftJIS

Suitability for use in MIME text:
This charset can be used for the top-level media type "text".

Published specification(s): Appendix 1 of JIS X0208:1997.

ISO 10646 equivalency table:

The correspondence is defined in JIS X0208:1997, the Kanji mapping is described in Appendix 6. Column 1 of Table 2 of Appendix 5 lists some variation of punctuation, and the names given in Appendix 5 are preferred to those in Appendix 4, when available.

In computer readable formats several variations exist. An obsolete variation is available at:

http://unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT

Additional information:

This charset is an extension of csHalfWidthKatakana by adding graphic characters in JIS X 0208. The CCS's are JIS X0201:1997 and JIS X0208:1997.

Several vendor specific charsets that derive from shift_jis often use the shift_jis name instead of a more specific vendor charset name. Windows-31J is one example, MacJapanese and Java SJIS are others. A common variation is to convert shift_jis 0x5c to U+005c Unicode, but display it as the Yen sign. Windows-31J examples.

Person & email address to contact for further information:
Japanese Industrial Standards Committee
http://www.jisc.go.jp/eng/index.html

Intended usage: LIMITED USE
Shawn Steele
2011-03-07 22:57:36 UTC
Permalink
I'm uncertain what the status is on these submissions... Tickets [IANA #411717] and [IANA #411716]

- Shawn

-----Original Message-----
From: Shawn Steele [mailto:***@microsoft.com]
Sent: Friday, December 03, 2010 1:43 PM
To: 'MURATA Makoto'; 'Anne van Kesteren'
Cc: 'NARUSE, Yui'; 'Martin J. Durst'; 'ietf-***@mail.apps.ietf.org'; Chris Rae; Peter Constable; '"Martin J. Dürst"'
Subject: RE: shift_jis / windows-31J

Submitted....

-----Original Message-----
From: Shawn Steele
Sent: Monday, November 29, 2010 5:30 PM
To: MURATA Makoto; Anne van Kesteren
Cc: NARUSE, Yui; Martin J. Durst; ietf-***@mail.apps.ietf.org; Chris Rae; Peter Constable; "Martin J. Dürst"
Subject: RE: shift_jis / windows-31J

2 weeks included a holiday in the US, so I'm just pinging. Barring objections, I'll submit these updates on Friday.

-Shawn

-----Original Message-----
From: Shawn Steele
Sent: Thursday, November 18, 2010 1:54 PM
To: Shawn Steele; MURATA Makoto; Anne van Kesteren
Cc: NARUSE, Yui; Martin J. Durst; ietf-***@mail.apps.ietf.org; Chris Rae; Peter Constable; "Martin J. Dürst"
Subject: RE: shift_jis / windows-31J

There were some comments, but they didn't seem to impact the idea here very much.

I've updated the Windows-31J with a note about the 0x5c behavior.

I also added MacJapanese and Java SJIS to the variations comment for shift_jis, but they aren't registered.

Any further comments? I'd like to submit this for the 2 week review and then submit them. I am not ignoring Anne's request to "solve the problem here", but I think that's a more complex issue, and that these changes proposed below would more easily address some of the common confusion in this area. I'm not opposed to someone following up on a better/more complete fix, but I think it'll be hard to get consensus.

-Shawn

--------------------------------------------------------------------------------


Charset name: Windows-31J
Charset aliases: csWindows31J
MIBenum: 2024

Suitability for use in MIME text:

Yes, Windows-31J is suitable for use with subtypes of the "text" Content-Type. Note that Windows-31J is an 8-bit charset. Care should be taken to choose an appropriate Content-Transfer-Encoding.

Published specification(s):

http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx

ISO 10646 equivalency table:

http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT

Additional information:

Windows Japanese. A variant of Shift_JIS to include NEC special characters (Row 13), NEC selection of IBM extensions (Rows 89 to 92), and IBM extensions (Rows 115 to 119). The CCS's are JIS X0201:1997, JIS X0208:1997, and these extensions. Windows-31J text is commonly declared with the shift_jis name of the parent charset, and the Windows-31J name may not be recognized.

In practice 0x5C in Windows-31J is mapped to U+005C in Unicode, but usually displayed as a yen sign glyph.

Person & email address to contact for further information:

Shawn Steele
Email: ***@microsoft.com

Microsoft Corporation
One Microsoft Way,
Redmond, WA 98052
U.S.A.

Intended usage: LIMITED USE

--------------------------------------------------------------------------------

Charset name: Shift_JIS

MIBenum: 17

Charset aliases: MS_Kanji and csShiftJIS

Suitability for use in MIME text:
This charset can be used for the top-level media type "text".

Published specification(s): Appendix 1 of JIS X0208:1997.

ISO 10646 equivalency table:

The correspondence is defined in JIS X0208:1997, the Kanji mapping is described in Appendix 6. Column 1 of Table 2 of Appendix 5 lists some variation of punctuation, and the names given in Appendix 5 are preferred to those in Appendix 4, when available.

In computer readable formats several variations exist. An obsolete variation is available at:

http://unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT

Additional information:

This charset is an extension of csHalfWidthKatakana by adding graphic characters in JIS X 0208. The CCS's are JIS X0201:1997 and JIS X0208:1997.

Several vendor specific charsets that derive from shift_jis often use the shift_jis name instead of a more specific vendor charset name. Windows-31J is one example, MacJapanese and Java SJIS are others. A common variation is to convert shift_jis 0x5c to U+005c Unicode, but display it as the Yen sign. Windows-31J examples.

Person & email address to contact for further information:
Japanese Industrial Standards Committee
http://www.jisc.go.jp/eng/index.html

Intended usage: LIMITED USE
MURATA Makoto
2011-04-06 02:11:26 UTC
Permalink
Post by Shawn Steele
I'm uncertain what the status is on these submissions... Tickets [IANA #411717] and [IANA #411716]
I also think that the submitted changes should be registered.

Regards,
--
MURATA Makoto <***@hokkaido.email.ne.jp>
Shawn Steele
2011-04-06 03:18:17 UTC
Permalink
Ned or Martin, I guess its up to you guys to look at?

-Shawn

 
http://blogs.msdn.com/shawnste


________________________________________
From: MURATA Makoto [***@hokkaido.email.ne.jp]
Sent: Tuesday, April 05, 2011 7:11 PM
To: Shawn Steele
Cc: 'Anne van Kesteren'; 'NARUSE, Yui'; 'Martin J. Durst'; 'ietf-***@mail.apps.ietf.org'; Chris Rae; Peter Constable
Subject: Re: shift_jis / windows-31J
Post by Shawn Steele
I'm uncertain what the status is on these submissions... Tickets [IANA #411717] and [IANA #411716]
I also think that the submitted changes should be registered.

Regards,
Shawn Steele
2011-05-19 16:51:54 UTC
Permalink
Ping... It is approaching 1/2 year since this was submitted (3 December 2010). I assume Ned or Martin have to do something next. There's been no further discussion or feedback, so I assume this has been overlooked? (Or maybe the Klingon in my signature earlier triggered a spam filter? ;-0)

The required consensus was achieved in the ietf-charsets mailing list last year, so I'm uncertain if there's anything else I can do to facilitate these updates. "Last time", if I remember right, this process was fairly responsive.

Thanks,

- Shawn

-----Original Message-----
From: Shawn Steele [mailto:***@microsoft.com]
Sent: Tuesday, April 05, 2011 8:18 PM
To: MURATA Makoto
Cc: 'Anne van Kesteren'; 'NARUSE, Yui'; 'Martin J. Durst'; 'ietf-***@mail.apps.ietf.org'; Chris Rae; Peter Constable; Ned Freed
Subject: RE: shift_jis / windows-31J

Ned or Martin, I guess its up to you guys to look at?

-Shawn

 
http://blogs.msdn.com/shawnste


________________________________________
From: MURATA Makoto [***@hokkaido.email.ne.jp]
Sent: Tuesday, April 05, 2011 7:11 PM
To: Shawn Steele
Cc: 'Anne van Kesteren'; 'NARUSE, Yui'; 'Martin J. Durst'; 'ietf-***@mail.apps.ietf.org'; Chris Rae; Peter Constable
Subject: Re: shift_jis / windows-31J
Post by Shawn Steele
I'm uncertain what the status is on these submissions... Tickets [IANA #411717] and [IANA #411716]
I also think that the submitted changes should be registered.

Regards,
--
MURATA Makoto <***@hokkaido.email.
Anne van Kesteren
2010-11-11 11:45:29 UTC
Permalink
Post by NARUSE, Yui
I object to create new alias name.
Agreed. These are legacy encodings. We should be trying to contain them
and fix the interoperability problems, not make the problems worse.
--
Anne van Kesteren
http://annevankesteren.nl/
Shawn Steele
2010-11-11 17:26:15 UTC
Permalink
My incentive is that I'm being requested to use the windows-31J name in our documentation since that can be interpreted as being "correct" per the standard. This request is being made by people that seek interoperability. However windows does not recognize the windows-31J name, and changing everything would break all of the existing uses of this code page. Eg: A "fixed" system would try to use windows-31J for their web page, and all the un-updated systems would fail to read it.

So if someone has any other suggestions on how to record that, in practice, shift_jis doesn't always give you exactly the shift_jis behavior on all systems, I'd love to hear it :)

I certainly agree that these are legacy encodings, however for that exact reason it's nearly impossible to "fix" the behavioral discrepencies between our products and the standards definitions because actually changing the encoding, or the encoding that the name points to would break millions of documents for millions of users.

-Shawn

 
http://blogs.msdn.com/shawnste


________________________________________
From: Anne van Kesteren [***@opera.com]
Sent: Thursday, November 11, 2010 3:45 AM
To: Shawn Steele; NARUSE, Yui
Cc: ietf-***@mail.apps.ietf.org
Subject: Re: shift_jis / windows-31J
Post by NARUSE, Yui
I object to create new alias name.
Agreed. These are legacy encodings. We should be trying to contain them
and fix the interoperability problems, not make the problems worse.
--
Anne van
Anne van Kesteren
2010-11-12 10:26:28 UTC
Permalink
On Thu, 11 Nov 2010 18:26:15 +0100, Shawn Steele
Post by Shawn Steele
I certainly agree that these are legacy encodings, however for that
exact reason it's nearly impossible to "fix" the behavioral
discrepencies between our products and the standards definitions because
actually changing the encoding, or the encoding that the name points to
would break millions of documents for millions of users.
Is that really true? shift_jis means "windows-31J" on Windows. For web
browsers it means that. "windows-31J" is also a superset, no? So what
would break?
--
Anne van Kesteren
http://annevankesteren.nl/
Shawn Steele
2010-11-12 18:42:26 UTC
Permalink
Is that really true? shift_jis means "windows-31J" on Windows. For web browsers it means that. "windows-31J" is also a superset, no?
I think that Windows-31J is mostly a superset. (I haven't compared code points, but given the variation of encoding implementations I wouldn't be surprised if there were other differences.)
So what would break?
?? If we restricted "shift_jis" to mean only the shift_jis and not windows-31J code points, then every document written in windows tagged "shift_jis" that contained those code points would rather suddenly fail to convert those characters on "fixed" systems. That's probably millions of documents.

If we tagged "new" (or updated) content as "windows-31J", then systems that had not been "fixed" would not be able to read the data because we don't recognize the name. Probably a tiny bit less breaking if we tagged new content as "csWindows-31J", but it'd still break other places that weren't used to expecting it. That'd include probably every ASP.Net server and their clients serving shift_jis/windows-31J content.

So changing the meaning of the names to windows, mlang,
Anne van Kesteren
2010-11-16 13:47:13 UTC
Permalink
On Fri, 12 Nov 2010 19:42:26 +0100, Shawn Steele
Post by Shawn Steele
Post by Anne van Kesteren
Is that really true? shift_jis means "windows-31J" on Windows. For web
browsers it means that. "windows-31J" is also a superset, no?
I think that Windows-31J is mostly a superset. (I haven't compared code
points, but given the variation of encoding implementations I wouldn't
be surprised if there were other differences.)
Post by Anne van Kesteren
So what would break?
?? If we restricted "shift_jis" to mean only the shift_jis and not
windows-31J code points, then every document written in windows tagged
"shift_jis" that contained those code points would rather suddenly fail
to convert those characters on "fixed" systems. That's probably
millions of documents.
If we tagged "new" (or updated) content as "windows-31J", then systems
that had not been "fixed" would not be able to read the data because we
don't recognize the name. Probably a tiny bit less breaking if we
tagged new content as "csWindows-31J", but it'd still break other places
that weren't used to expecting it. That'd include probably every
ASP.Net server and their clients serving shift_jis/windows-31J content.
So changing the meaning of the names to windows, mlang, and .Net is
pretty much a non-starter.
Agreed. I was suggesting we change the registry to match the meaning of
the names as used on the Web/Windows/etc. Sorry for the confusion.
--
Anne van Kesteren
http://annevankesteren.nl/
Shawn Steele
2010-11-16 14:30:26 UTC
Permalink
Does the updated version sent Friday work for you?

I think completely pointing shift_jis to the windows 932 behavior would maybe break others, but I don't know for sure.

- Shawn

Sent from my Windows Phone
Post by Shawn Steele
Is that really true? shift_jis means "windows-31J" on Windows. For web browsers it means that. "windows-31J" is also a superset, no?
I think that Windows-31J is mostly a superset. (I haven't compared code points, but given the variation of encoding implementations I wouldn't be surprised if there were other differences.)
So what would break?
?? If we restricted "shift_jis" to mean only the shift_jis and not windows-31J code points, then every document written in windows tagged "shift_jis" that contained those code points would rather suddenly fail to convert those characters on "fixed" systems. That's probably millions of documents.
If we tagged "new" (or updated) content as "windows-31J", then systems that had not been "fixed" would not be able to read the data because we don't recognize the name. Probably a tiny bit less breaking if we tagged new content as "csWindows-31J", but it'd still break other places that weren't used to expecting it. That'd include probably every ASP.Net server and their clients serving shift_jis/windows-31J content.
So changing the meaning of the names to windows, mlang, and .Net is pretty much a non-starter.
Agreed. I was suggesting we change the registry to match the meaning of the names as used on the Web/Windows/etc. Sorry for the confusion.
--
Anne van Kesteren
http://annevankesteren.nl/
NARUSE, Yui
2010-11-16 14:36:59 UTC
Permalink
Post by Shawn Steele
Does the updated version sent Friday work for you?
I think completely pointing shift_jis to the windows 932 behavior
would maybe break others, but I don't know for sure.
Java's Shift_JIS (SJIS) is different from Windows-31J (MS932).
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4556882

see also "Ambiguities in conversion from Shift-JIS to Unicode"
http://www.w3.org/Submission/japanese-xml/#ambiguity_of_yen
--
NARUSE, Yui <***@airemix.jp>
Anne van Kesteren
2010-11-16 14:57:49 UTC
Permalink
On Tue, 16 Nov 2010 15:30:26 +0100, Shawn Steele
Post by Shawn Steele
Does the updated version sent Friday work for you?
I think completely pointing shift_jis to the windows 932 behavior would
maybe break others, but I don't know for sure.
I do not think the way that was phrased will ever give us good
interoperability or will give a clue to new web browsers what they need to
implement.

Though maybe to address that a separate registry is needed just for web
browsers -- as I suggested in the past -- to avoid clashing with others
who do not wish to update their code to match.
--
Anne van Kesteren
http://annevankesteren.nl/
Shawn Steele
2010-11-16 18:03:18 UTC
Permalink
Problem is that there are 4+ implementations of shift_jis in "common" use, and none of them are likely to change, since it'd break their customers. :(

So I don't see a perfect solution here. HTML5 is fairly clear about browser behavior, but in other environments, I think the best we can do is point to the variants and allow the clients to decide which version they'd like to use.

-Shawn

-----Original Message-----
From: Anne van Kesteren [mailto:***@opera.com]
Sent: Tuesday, November 16, 2010 6:58 AM
To: Shawn Steele
Cc: NARUSE, Yui; ietf-***@mail.apps.ietf.org
Subject: Re: shift_jis / windows-31J
Post by Shawn Steele
Does the updated version sent Friday work for you?
I think completely pointing shift_jis to the windows 932 behavior would
maybe break others, but I don't know for sure.
I do not think the way that was phrased will ever give us good
interoperability or will give a clue to new web browsers what they need to
implement.

Though maybe to address that a separate registry is needed just for web
browsers -- as I suggested in the past -- to avoid clashing with others
who do not wish to update their code to match.


--
Anne van Kesteren
http://annevankesteren.nl
Anne van Kesteren
2010-11-16 22:14:09 UTC
Permalink
On Tue, 16 Nov 2010 19:03:18 +0100, Shawn Steele
Post by Shawn Steele
Problem is that there are 4+ implementations of shift_jis in "common"
use, and none of them are likely to change, since it'd break their
customers. :(
It would only break if the content they consumed contained code points
that mapped to invalid characters and they relied on that, though. And the
content was not relying on being mapped to the superset mapping of
Windows-31J instead, which seems far more likely given the dominance of
the Web and Windows.
Post by Shawn Steele
So I don't see a perfect solution here. HTML5 is fairly clear about
browser behavior, but in other environments, I think the best we can do
is point to the variants and allow the clients to decide which version
they'd like to use.
HTML5 is unfortunately only clear about HTML. That it also happens for
CSS, JavaScript, text/plain, etc. is not made at all clear. It's a
temporary hack to illustrate a problem that really ought to be solved here.
--
Anne van Kesteren
http://annevankesteren.nl/
Bjoern Hoehrmann
2010-11-16 23:00:04 UTC
Permalink
It would only break if the content they consumed contained code poin=
ts =20
that mapped to invalid characters and they relied on that, though. A=
nd the =20
content was not relying on being mapped to the superset mapping of =
=20
Windows-31J instead, which seems far more likely given the dominance=
of =20
the Web and Windows.
The prime example problem with shift_jis is the ambiguity of the octe=
t
0x5C which maps to a backslash for some and to the yen sign for other=
s.
As far as I am aware, 0x5C is not invalid and this particular problem
is not a matter of supersets and subsets, you get 0x5C and you do not
know whether you should interpret it as yen sign or backslash. And it=
's
not going to change, systems built around one interpretation will use
that interpretation, systems built around the other interpretation wi=
ll
stick with their interpretation aswell. If you have two web services
that exchange data they may be running on Windows and on the Web, but
they may not be using the Windows/Browser/whatever interpretation.
--=20
Bj=F6rn H=F6hrmann =B7 mailto:***@hoehrmann.de =B7 http://bjoern.h=
oehrmann.de
Am Badedeich 7 =B7 Telefon: +49(0)160/4415681 =B7 http://www.bjoernsw=
orld.de
25899 Dageb=FCll =B7 PGP Pub. KeyID: 0xA4357E78 =B7 http://www.websit=
edev.de/=20
NARUSE, Yui
2010-11-16 23:12:18 UTC
Permalink
Post by Anne van Kesteren
It would only break if the content they consumed contained code points
that mapped to invalid characters and they relied on that, though. And the
content was not relying on being mapped to the superset mapping of
Windows-31J instead, which seems far more likely given the dominance of
the Web and Windows.
The prime example problem with shift_jis is the ambiguity of the octet
0x5C which maps to a backslash for some and to the yen sign for others.
As far as I am aware, 0x5C is not invalid and this particular problem
is not a matter of supersets and subsets, you get 0x5C and you do not
know whether you should interpret it as yen sign or backslash. And it's
not going to change, systems built around one interpretation will use
that interpretation, systems built around the other interpretation will
stick with their interpretation aswell. If you have two web services
that exchange data they may be running on Windows and on the Web, but
they may not be using the Windows/Browser/whatever interpretation.
In practice, 0x5C in Shift-JIS is U+005C but yen sign glyph.

see also https://bugs.webkit.org/show_bug.cgi?id=24906
--
NARUSE, Yui <***@airemix.jp>
Bjoern Hoehrmann
2010-11-16 23:31:29 UTC
Permalink
Post by NARUSE, Yui
In practice, 0x5C in Shift-JIS is U+005C but yen sign glyph.
see also https://bugs.webkit.org/show_bug.cgi?id=3D24906
Well, that depends on what your favourite version of reality is, the
Unicode Consortium for instance published a mapping table for shift_-
jis which (now obsolete) mapped 0x5C to U+00A5 and there is software
out there that uses that mapping. I don't doubt there are also fonts
that simply map U+005C to a yen sign glyph, giving only an appearance
of incorrect or otherwise surprising mappings.
--=20
Bj=F6rn H=F6hrmann =B7 mailto:***@hoehrmann.de =B7 http://bjoern.h=
oehrmann.de
Am Badedeich 7 =B7 Telefon: +49(0)160/4415681 =B7 http://www.bjoernsw=
orld.de
25899 Dageb=FCll =B7 PGP Pub. KeyID: 0xA4357E78 =B7 http://www.websit=
edev.de/=20
Martin J. Dürst
2010-11-17 01:17:26 UTC
Permalink
Post by NARUSE, Yui
It would only break if the content they consumed contained code p=
oints
Post by NARUSE, Yui
that mapped to invalid characters and they relied on that, though=
.
Post by NARUSE, Yui
And the
content was not relying on being mapped to the superset mapping o=
f
Post by NARUSE, Yui
Windows-31J instead, which seems far more likely given the domina=
nce of
Post by NARUSE, Yui
the Web and Windows.
The prime example problem with shift_jis is the ambiguity of the o=
ctet
Post by NARUSE, Yui
0x5C which maps to a backslash for some and to the yen sign for ot=
hers.
Post by NARUSE, Yui
As far as I am aware, 0x5C is not invalid and this particular prob=
lem
Post by NARUSE, Yui
is not a matter of supersets and subsets, you get 0x5C and you do =
not
Post by NARUSE, Yui
know whether you should interpret it as yen sign or backslash. And=
it's
Post by NARUSE, Yui
not going to change, systems built around one interpretation will =
use
Post by NARUSE, Yui
that interpretation, systems built around the other interpretation=
will
Post by NARUSE, Yui
stick with their interpretation aswell. If you have two web servic=
es
Post by NARUSE, Yui
that exchange data they may be running on Windows and on the Web, =
but
Post by NARUSE, Yui
they may not be using the Windows/Browser/whatever interpretation.
In practice, 0x5C in Shift-JIS is U+005C but yen sign glyph.
Well, but first, please note that it's Shift_JIS, not Shift-JIS (it's=
=20
very easy to make that mistake).

As another example, with gcc on cygwin, in Shift_JIS 0x5C is mapped t=
o=20
something else than U+005C. If you want an SJIS program to compile, y=
ou=20
have to say something like

gcc -finput-charset=3DCP932

If you say gcc -finput-charset=3DShift_JIS, all your \n and similar=
=20
escapes produce errors.

That's different in Ruby, where we always map the 0x00-0x7F range=
=20
straight, so that things work out on a syntax level. (I guess we=20
wouldn't do that for ISO 646 related encodings, of course.) But there=
is=20
still a difference between Shift_JIS (no private use characters) and=
=20
Windows-31J (Microsoft). It's actually quite strict, in that if one=
=20
string is labeled Shift_JIS and the other is labeled Windows-31J, the=
y=20
don't compare as equal even if they are equal.

Regards, Martin.

--=20
#-# Martin J. D=FCrst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
Shawn Steele
2010-11-16 23:05:30 UTC
Permalink
HTML5 is unfortunately only clear about HTML. That it also happens for CSS,
JavaScript, text/plain, etc. is not made at all clear. It's a temporary hack to
illustrate a problem that really ought to be solved here.
I'm guessing that solving it here will take a while. Can we define the current disparity like proposed and then do
Martin J. Dürst
2010-11-17 00:27:04 UTC
Permalink
I agree with Shawn here. I think what we discussed earlier was that i=
f=20
there is an encoding label A (that would be Shift_JIS here) that in=
=20
common browser usage is actually interpreted as encoding B (that woul=
d=20
be Windows-31J here), then we would make sure that the charset regist=
ry=20
contained an entry for B. The "in a browser context, use B for A" wou=
ld=20
then be in the HTML5 spec or somewhere close.

In the current case, we are already have B registered, and Shawn is j=
ust=20
working on getting the relationships cleared up. If there are other=
=20
cases where B isn't registered, then I hope Anne can help us getting=
=20
these registered.

I think creating a separate registry for HTML5 doesn't make that much=
=20
sense, because there are as far as I know only very few cases with=
=20
exceptions.

In the long term, maybe introducing the concept of variants (having a=
=20
label such as Shift_JIS--windows or so) might help, but from the curr=
ent=20
discussion, it doesn't look like there is too much support for that.

Regards, Martin.
Problem is that there are 4+ implementations of shift_jis in "commo=
n" use, and none of them are likely to change, since it'd break their=
customers. :(
So I don't see a perfect solution here. HTML5 is fairly clear abou=
t browser behavior, but in other environments, I think the best we ca=
n do is point to the variants and allow the clients to decide which v=
ersion they'd like to use.
-Shawn
-----Original Message-----
Sent: Tuesday, November 16, 2010 6:58 AM
To: Shawn Steele
Subject: Re: shift_jis / windows-31J
Post by Shawn Steele
Does the updated version sent Friday work for you?
I think completely pointing shift_jis to the windows 932 behavior =
would
Post by Shawn Steele
maybe break others, but I don't know for sure.
I do not think the way that was phrased will ever give us good
interoperability or will give a clue to new web browsers what they =
need to
implement.
Though maybe to address that a separate registry is needed just for=
web
browsers -- as I suggested in the past -- to avoid clashing with ot=
hers
who do not wish to update their code to match.
--=20
#-# Martin J. D=C3=BCrst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
Ned Freed
2010-11-17 00:50:26 UTC
Permalink
I agree with Shawn here. I think what we discussed earlier was that if
there is an encoding label A (that would be Shift_JIS here) that in
common browser usage is actually interpreted as encoding B (that would
be Windows-31J here), then we would make sure that the charset registry
contained an entry for B. The "in a browser context, use B for A" would
then be in the HTML5 spec or somewhere close.
I think this is necessary but not sufficient. All of the charsets that
currently operate under the name "shift_jis" need to have their own
registrations, but once that's done the registration of "shift_jis" itself
needs to be updated to explain the ambiguity. I also think the "shift_jis"
registration should state that due to the abiguities, use of the "shift_jis"
label is not recommended.
In the current case, we are already have B registered, and Shawn is just
working on getting the relationships cleared up. If there are other
cases where B isn't registered, then I hope Anne can help us getting
these registered.
I think creating a separate registry for HTML5 doesn't make that much
sense, because there are as far as I know only very few cases with
exceptions.
I don't see how it makes any sense either, but for different reasons - I have
to say I regard these supposed boundaries between different usages as largely
nonexistant. Stuff always leaks. Always.

Ned
Bjoern Hoehrmann
2010-11-17 01:15:07 UTC
Permalink
I think this is necessary but not sufficient. All of the charsets th=
at
currently operate under the name "shift_jis" need to have their own
registrations, but once that's done the registration of "shift_jis" =
itself
needs to be updated to explain the ambiguity. I also think the "shif=
t_jis"
registration should state that due to the abiguities, use of the "sh=
ift_jis"
label is not recommended.
I don't see how it makes any sense either, but for different reasons=
- I have
to say I regard these supposed boundaries between different usages a=
s largely
nonexistant. Stuff always leaks. Always.
I agree with this. I think each of the variants should have a label, =
but
the label need not necessarily be listed as "Alias" as we currently u=
se
the field, and obviously for some it would make sense to clearly note
they exist mostly for documentary purposes (for instance, so I can pu=
t
in the documentation of my software I handle "shift_jis" as defined f=
or
"made-up-label"). And I do not see much value in identifying exactly
where some variant is used a lot, whether that is "Microsoft products=
"
or "Web browsers" and so on, that's something better stated in other
places, like the documentation for web browsers or summary documents.
--=20
Bj=F6rn H=F6hrmann =B7 mailto:***@hoehrmann.de =B7 http://bjoern.h=
oehrmann.de
Am Badedeich 7 =B7 Telefon: +49(0)160/4415681 =B7 http://www.bjoernsw=
orld.de
25899 Dageb=FCll =B7 PGP Pub. KeyID: 0xA4357E78 =B7 http://www.websit=
edev.de/=20
Shawn Steele
2010-11-17 05:03:32 UTC
Permalink
At this point I don't think it's interesting to say "don't use the name shift_jis". It would be much more productive to switch to utf-8 than to migrate all the uses of the name


- Shawn

Sent from my Windows Phone
Post by Ned Freed
I agree with Shawn here. I think what we discussed earlier was that if
there is an encoding label A (that would be Shift_JIS here) that in
common browser usage is actually interpreted as encoding B (that would
be Windows-31J here), then we would make sure that the charset registry
contained an entry for B. The "in a browser context, use B for A" would
then be in the HTML5 spec or somewhere close.
I think this is necessary but not sufficient. All of the charsets that
currently operate under the name "shift_jis" need to have their own
registrations, but once that's done the registration of "shift_jis" itself
needs to be updated to explain the ambiguity. I also think the "shift_jis"
registration should state that due to the abiguities, use of the "shift_jis"
label is not recommended.
In the current case, we are already have B registered, and Shawn is just
working on getting the relationships cleared up. If there are other
cases where B isn't registered, then I hope Anne can help us getting
these registered.
I think creating a separate registry for HTML5 doesn't make that much
sense, because there are as far as I know only very few cases with
exceptions.
I don't see how it makes any sense either, but for different reasons - I have
to say I regard these supposed boundaries between different usages as largely
nonexistant. Stuff always leaks. Always.
Ned
MURATA Makoto
2010-11-16 21:29:49 UTC
Permalink
Post by Shawn Steele
I think completely pointing shift_jis to the windows 932 behavior would maybe break others, but I don't know for sure.
I remember that some XML parser (i.e., Crimson) uses "shift_jis" as
defined in JIS X 0208. Since Java programs work on more than one
platform, some of them probably stick to JIS X0208 rather than
using Windows-31J.

Cheers,
--
MURATA Makoto <***@hokkaido.email.ne.jp>
Loading...