Discussion:
Registration of new charset CP50220
NARUSE Yui
2010-04-21 12:25:47 UTC
Permalink
Charset name: CP50220

Charset aliases: csCP50220

Suitability for use in MIME text:

Yes, CP50220 is suitable for use with subtypes of the "text" Content-Type.

Since the "CP50220" is 7bit encoding, Content-Transfer-Encoding is not needed.
Based64 or Quoted-Printable encoding MAY break this encoding.

Published specification(s):

CP50220 is consisted by following character sets:

reg# character set ESC sequence designated to
------------------------------------------------------
6 US-ASCII ESC ( B G0
13 JIS X 0201-Katakana ESC ( I G0
14 JIS X 0201-Roman ESC ( J G0
42 JIS X 0208-1978 ESC $ @ G0
87 JIS X 0208-1983 ESC $ B G0
13 JIS X 0201-Katakana ESC ) I G1
reg# character set shift in with designated to
------------------------------------------------------
6 US-ASCII SI G0
13 JIS X 0201-Katakana SO G0

* The beggining of a text is assumed to have "ESC ( B ESC ) I".
* Each line of CP50220 text MUST end with ASCII.
* There are two kinds of shifts: SI and SO. Shift functions
specify how to interpret the subsequent bytes.
* The shift SI (one byte with hexadecimal value 0F) declares that
subsequent bytes are interpreted in US-ASCII.
* The shift SO (one byte with hexadecimal value 0E) declares that
subsequent bytes are interpreted in JIS X 0201 Katakana.
* On receiving JIS X 0201-Katakana characters MAY be encoded as
* GL with the escape sequence: ESC ( I
* GL with the shifts: SI / SO
* GR
* On sending JIS X 0201-Katakana, it MUST be converted to related
character of JIS X 0208.
* The character set of CP50220 is based on Windows Codepage 932.
So a meaning and a map to Unicode of each character is refer to it.
http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx

ISO 10646 equivalency table:

This charset is ISO/IEC 2022 family.
Conversion of each character refers Windows Codepage 932:
http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx
http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
http://icu-project.org/repos/icu/data/trunk/charset/data/ucm/windows-932-2000.ucm

Additional information:

This is a request for a new registration of this charset.

CP50220 is a variant of ISO-2022-JP (like Windows-31J and Shift_JIS).
this charset is different from ISO-2022-JP in:
* CP50220 supports JIS X 0201-Katakana
* CP50220 supports characters extended by Windows Codepage 932
* Unicode mapping of some characters are different

Typical user of CP50220 is web browsers. When web browsers load
a page which are declared or auto-detected as "ISO-2022-JP", they
don't interpret it as true ISO-2022-JP registerd in IANA Character
Sets but as CP50220. When they post form data as "ISO-2022-JP",
the data is also encoded as CP50220. Note that though csISO2022JP
is alias of ISO-2022-JP in IANA Character Sets, on Windows it means
neither registered ISO-2022-JP nor CP50220 but means CP50221.

Another typical user is Japanese IRC network. They sometimes send
JIS X 0201-Katakana encoded in GR (JIS8).

The name "CP50220" is in use following applications:
* Citrus iconv (NetBSD and DragonFly uses this)
* Mojikan http://www.mirai-ii.co.jp/moji/mojikan/
* nkf 2.0.5
* Encode-EUCJPMS-0.06

Moreover applications which uses MLang.DLL or .NET Framework for
converting "ISO-2022-JP" implicitly uses this charset.

So this charset is widely used, but doesn't have its own name.

Why the name is not "Windows-50220" is some of applications which accept
the name "CP50220" don't support the name "Windows-50220".

CP50220 is for use of communicating with legacy system.
UTF-8 is preferred to CP50220 for new system.

Related references are:

"Remarks" of "GetEncodings Method" of "System.Text"
http://msdn.microsoft.com/en-us/library/system.text.encoding.getencodings.aspx

"Unicode$B$K$h$k(BJIS X0213$B<BAuF~Lg!=>pJs%7%9%F%`$N?7$?$JF|K\8l=hM}4D6-(B"
$BF|7P(BBP$B%=%U%H%W%l%9(B, ISBN 978-4891006082, 2008, p. 17-18, 20, 120-158

CP50220 - Legacy Encoding Project
http://legacy-encoding.sourceforge.jp/wiki/index.php?cp50220

This charset is also known as Windows Codepage 50220.

Person & email address to contact for further information:

NARUSE, Yui
Email: ***@ruby-lang.org

Intended usage: LIMITED USE
--
NARUSE, Yui
***@airemix.jp
Masatoshi Kimura
2010-04-21 16:21:23 UTC
Permalink
Post by NARUSE Yui
13 JIS X 0201-Katakana ESC ) I G1
Is there any implementaions which actually support this sequence?
At least Mozilla implementation doesn't. It doesn't support SI/SO either.
Post by NARUSE Yui
Another typical user is Japanese IRC network. They sometimes send
JIS X 0201-Katakana encoded in GR (JIS8).
What does it mean? You said this charset is 7-bit encoding.
Post by NARUSE Yui
Since the "CP50220" is 7bit encoding, Content-Transfer-Encoding is not needed.
Based64 or Quoted-Printable encoding MAY break this encoding.
This charset is ISO/IEC 2022 family.
For these reasons, I don't think it's appropriate to treat this charset
as an ISO/IEC 2022 family.
--
***@nifty.ne.jp
NARUSE Yui
2010-04-21 17:33:51 UTC
Permalink
Post by Masatoshi Kimura
Post by NARUSE Yui
13 JIS X 0201-Katakana ESC ) I G1
Is there any implementaions which actually support this sequence?
At least Mozilla implementation doesn't. It doesn't support SI/SO either.
Internet Explorer and Opera support 8-bit Katakana.
Internet Explorer supports Shift-in Katakana.
http://coq.no/character-tables/mime/iso-2022/en
Post by Masatoshi Kimura
Post by NARUSE Yui
Another typical user is Japanese IRC network. They sometimes send
JIS X 0201-Katakana encoded in GR (JIS8).
What does it mean? You said this charset is 7-bit encoding.
Post by NARUSE Yui
Since the "CP50220" is 7bit encoding, Content-Transfer-Encoding is not needed.
Based64 or Quoted-Printable encoding MAY break this encoding.
This charset is ISO/IEC 2022 family.
For these reasons, I don't think it's appropriate to treat this charset
as an ISO/IEC 2022 family.
Hmm you are right, I removed 8-bit and Shift-in Katakana.


And I added following line:
* CP50220 supports JIS X 0201-Katakana
* CP50220 supports characters extended by Windows Codepage 932
+ (NEC special characters and NEC selection of IBM extensions)
* Unicode mapping of some characters are different

----------
Charset name: CP50220

Charset aliases: csCP50220

Suitability for use in MIME text:

Yes, CP50220 is suitable for use with subtypes of the "text" Content-Type.

Since the "CP50220" is 7bit encoding, Content-Transfer-Encoding is not needed.
Based64 or Quoted-Printable encoding MAY break this encoding.

Published specification(s):

CP50220 is consisted by following character sets:

reg# character set ESC sequence designated to
------------------------------------------------------
6 US-ASCII ESC ( B G0
13 JIS X 0201-Katakana ESC ( I G0
14 JIS X 0201-Roman ESC ( J G0
42 JIS X 0208-1978 ESC $ @ G0
87 JIS X 0208-1983 ESC $ B G0

* The beggining of a text is assumed to have "ESC ( B ESC ) I".
* Each line of CP50220 text MUST end with ASCII.
* On receiving JIS X 0201-Katakana characters MAY be encoded
with the escape sequence: ESC ( I.
* On sending JIS X 0201-Katakana, it MUST be converted to related
character of JIS X 0208.
* The character set of CP50220 is based on Windows Codepage 932.
So a meaning and a map to Unicode of each character is refer to it.
http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx

ISO 10646 equivalency table:

This charset is ISO/IEC 2022 family.
Conversion of each character refers Windows Codepage 932:
http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx
http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
http://icu-project.org/repos/icu/data/trunk/charset/data/ucm/windows-932-2000.ucm

Additional information:

This is a request for a new registration of this charset.

CP50220 is a variant of ISO-2022-JP (like Windows-31J and Shift_JIS).
this charset is different from ISO-2022-JP in:
* CP50220 supports JIS X 0201-Katakana
* CP50220 supports characters extended by Windows Codepage 932
(NEC special characters and NEC selection of IBM extensions)
* Unicode mapping of some characters are different

Typical user of CP50220 is web browsers. When web browsers load
a page which are declared or auto-detected as "ISO-2022-JP", they
don't interpret it as true ISO-2022-JP registerd in IANA Character
Sets but as CP50220. When they post form data as "ISO-2022-JP",
the data is also encoded as CP50220. Note that though csISO2022JP
is alias of ISO-2022-JP in IANA Character Sets, on Windows it means
neither registered ISO-2022-JP nor CP50220 but means CP50221.

The name "CP50220" is in use following applications:
* Citrus iconv (NetBSD and DragonFly uses this)
* Mojikan http://www.mirai-ii.co.jp/moji/mojikan/
* nkf 2.0.5
* Encode-EUCJPMS-0.06

Moreover applications which uses MLang.DLL or .NET Framework for
converting "ISO-2022-JP" implicitly uses this charset.

So this charset is widely used, but doesn't have its own name.

Why the name is not "Windows-50220" is some of applications which accept
the name "CP50220" don't support the name "Windows-50220".

CP50220 is for use of communicating with legacy system.
UTF-8 is preferred to CP50220 for new system.

Related references are:

"Remarks" of "GetEncodings Method" of "System.Text"
http://msdn.microsoft.com/en-us/library/system.text.encoding.getencodings.aspx

"Unicode$B$K$h$k(BJIS X0213$B<BAuF~Lg!=>pJs%7%9%F%`$N?7$?$JF|K\8l=hM}4D6-(B"
$BF|7P(BBP$B%=%U%H%W%l%9(B, ISBN 978-4891006082, 2008, p. 17-18, 20, 120-158

CP50220 - Legacy Encoding Project
http://legacy-encoding.sourceforge.jp/wiki/index.php?cp50220

This charset is also known as Windows Codepage 50220.

Person & email address to contact for further information:

NARUSE, Yui
Email: ***@ruby-lang.org

Intended usage: LIMITED USE
--
NARUSE, Yui
***@airemix.jp
NARUSE, Yui
2010-08-28 09:07:37 UTC
Permalink
Hello,

I updated the proposal of CP50220 as Martin's point for CP51932.

Regards,

---
Charset name: CP50220

Charset aliases: csCP50220

Suitability for use in MIME text:

Yes, CP50220 is suitable for use with subtypes of the "text" Content-Type.

Since the "CP50220" is 7bit encoding, Content-Transfer-Encoding is not needed.
Based64 or Quoted-Printable encoding MAY break this encoding.

Published specification(s):

CP50220 is consisted by following character sets:

reg# character set ESC sequence designated to
------------------------------------------------------
6 US-ASCII ESC ( B G0
13 JIS X 0201-Katakana ESC ( I G0
14 JIS X 0201-Roman ESC ( J G0
42 JIS X 0208-1978 ESC $ @ G0
87 JIS X 0208-1983 ESC $ B G0

* The beggining of a text is assumed to have "ESC ( B ESC ) I".
* Each line of CP50220 text MUST end with ASCII.
* On receiving JIS X 0201-Katakana characters MAY be encoded
with the escape sequence: ESC ( I.
* On sending JIS X 0201-Katakana, it MUST be converted to related
character of JIS X 0208.
* The character set of CP50220 is based on Windows Codepage 932.
So a meaning and a map to Unicode of each character is refer to it.
http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx

ISO 10646 equivalency table:

This charset is ISO/IEC 2022 family.
Conversion of each character refers Windows Codepage 932:
http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx
http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
http://icu-project.org/repos/icu/data/trunk/charset/data/ucm/windows-932-2000.ucm

Additional information:

CP50220 is a variant of ISO-2022-JP (like Windows-31J and Shift_JIS).
this charset is different from ISO-2022-JP in:
* CP50220 supports JIS X 0201-Katakana
* CP50220 supports characters extended by Windows Codepage 932
(NEC special characters and NEC selection of IBM extensions)
* Unicode mapping of some characters are different

Typical user of CP50220 is web browsers. When web browsers load
a page which are declared or auto-detected as "ISO-2022-JP", they
don't interpret it as true ISO-2022-JP registerd in IANA Character
Sets but as CP50220. When they post form data as "ISO-2022-JP",
the data is also encoded as CP50220. Note that though csISO2022JP
is alias of ISO-2022-JP in IANA Character Sets, on Windows it means
neither registered ISO-2022-JP nor CP50220 but means CP50221.

The name "CP50220" is in use following applications:
* Citrus iconv (NetBSD and DragonFly uses this)
* Mojikan http://www.mirai-ii.co.jp/moji/mojikan/
* nkf 2.0.5
* Encode-EUCJPMS-0.06

Moreover applications which uses MLang.DLL or .NET Framework for
converting "ISO-2022-JP" implicitly uses this charset.

Why the name is not "Windows-50220" is some of applications which accept
the name "CP50220" don't support the name "Windows-50220".

CP50220 is for use of communicating with legacy system.
UTF-8 is preferred to CP50220 for new system.

Related references are:

"Remarks" of "GetEncodings Method" of "System.Text"
http://msdn.microsoft.com/en-us/library/system.text.encoding.getencodings.aspx

"Introduction to JIS X0213 Implementation based on Unicode -
A new Japanese Language Processing Environment for Information Systems",
Nikkei BP Soft Press, ISBN 978-4891006082, 2008, pp. 17-18, 20, 120-158 (in Japanese)

CP50220 - Legacy Encoding Project
http://legacy-encoding.sourceforge.jp/wiki/index.php?cp50220

This charset is also known as Windows Codepage 50220.

Person & email address to contact for further information:

NARUSE, Yui
Email: ***@ruby-lang.org

Intended usage: LIMITED USE
--
NARUSE, Yui <***@airemix.jp>
Shawn Steele
2010-08-30 19:35:06 UTC
Permalink
50220, 50221 & 50222 are all very closely related. The "only" difference is in how they handle (or don't) halfwidth katakana. On windows (which I assume is the interesting case since this is a windows code page) the difference is whether the encode or halfwidth katakana, and how. Decoding is identical (which might be most interesting for users of tagged content).

-Shawn

-----Original Message-----
From: NARUSE, Yui [mailto:***@airemix.jp]
Sent: Saturday, August 28, 2010 2:08 AM
To: ietf-charsets
Subject: Re: Registration of new charset CP50220

Hello,

I updated the proposal of CP50220 as Martin's point for CP51932.

Regards,

---
Charset name: CP50220

Charset aliases: csCP50220

Suitability for use in MIME text:

Yes, CP50220 is suitable for use with subtypes of the "text" Content-Type.

Since the "CP50220" is 7bit encoding, Content-Transfer-Encoding is not needed.
Based64 or Quoted-Printable encoding MAY break this encoding.

Published specification(s):

CP50220 is consisted by following character sets:

reg# character set ESC sequence designated to
------------------------------------------------------
6 US-ASCII ESC ( B G0
13 JIS X 0201-Katakana ESC ( I G0
14 JIS X 0201-Roman ESC ( J G0
42 JIS X 0208-1978 ESC $ @ G0
87 JIS X 0208-1983 ESC $ B G0

* The beggining of a text is assumed to have "ESC ( B ESC ) I".
* Each line of CP50220 text MUST end with ASCII.
* On receiving JIS X 0201-Katakana characters MAY be encoded
with the escape sequence: ESC ( I.
* On sending JIS X 0201-Katakana, it MUST be converted to related
character of JIS X 0208.
* The character set of CP50220 is based on Windows Codepage 932.
So a meaning and a map to Unicode of each character is refer to it.
http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx

ISO 10646 equivalency table:

This charset is ISO/IEC 2022 family.
Conversion of each character refers Windows Codepage 932:
http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx
http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
http://icu-project.org/repos/icu/data/trunk/charset/data/ucm/windows-932-2000.ucm

Additional information:

CP50220 is a variant of ISO-2022-JP (like Windows-31J and Shift_JIS).
this charset is different from ISO-2022-JP in:
* CP50220 supports JIS X 0201-Katakana
* CP50220 supports characters extended by Windows Codepage 932
(NEC special characters and NEC selection of IBM extensions)
* Unicode mapping of some characters are different

Typical user of CP50220 is web browsers. When web browsers load
a page which are declared or auto-detected as "ISO-2022-JP", they
don't interpret it as true ISO-2022-JP registerd in IANA Character
Sets but as CP50220. When they post form data as "ISO-2022-JP",
the data is also encoded as CP50220. Note that though csISO2022JP
is alias of ISO-2022-JP in IANA Character Sets, on Windows it means
neither registered ISO-2022-JP nor CP50220 but means CP50221.

The name "CP50220" is in use following applications:
* Citrus iconv (NetBSD and DragonFly uses this)
* Mojikan http://www.mirai-ii.co.jp/moji/mojikan/
* nkf 2.0.5
* Encode-EUCJPMS-0.06

Moreover applications which uses MLang.DLL or .NET Framework for
converting "ISO-2022-JP" implicitly uses this charset.

Why the name is not "Windows-50220" is some of applications which accept
the name "CP50220" don't support the name "Windows-50220".

CP50220 is for use of communicating with legacy system.
UTF-8 is preferred to CP50220 for new system.

Related references are:

"Remarks" of "GetEncodings Method" of "System.Text"
http://msdn.microsoft.com/en-us/library/system.text.encoding.getencodings.aspx

"Introduction to JIS X0213 Implementation based on Unicode -
A new Japanese Language Processing Environment for Information Systems",
Nikkei BP Soft Press, ISBN 978-4891006082, 2008, pp. 17-18, 20, 120-158 (in Japanese)

CP50220 - Legacy Encoding Project
http://legacy-encoding.sourceforge.jp/wiki/index.php?cp50220

This charset is also known as Windows Codepage 50220.

Person & email address to contact for further information:

NARUSE, Yui
Email: ***@ruby-lang.org

Intended usage: LIMITED USE
--
NARUSE, Yui <***@airemix.jp>
Masatoshi Kimura
2010-08-30 23:07:07 UTC
Permalink
The purpose of this registration is to "standardize" how to handle
errors when Web browsers encount illegal ISO-2022-JP sequences.
Mozilla encoder has changed a halfwidth katakana handling to match
the behavior.
https://bugzilla.mozilla.org/show_bug.cgi?id=563283
Post by Shawn Steele
Decoding is identical (which might be most interesting for users
of tagged content).
The fist version of the registration had included all decoding methods
which is supported by CP50220. (i.e. ESC ( J, SI, and 8bit)
However latter two methods were removed from the registration by two
reasons.

1. Some implementation (e.g. Mozilla's one) don't support them.
Should Mozilla decoder be changed to match the begavior?

2. The charset supposed to be a 7-bit. It's strange to include a 8-bit
character handling.
Changing the regstration to 8-bit is not a solution because it will
require the Content-Transfer-Encoding MIME header field. It is not
compatible with ISO-2022-JP. Old Microsoft Internet Mail/News had the bug.
Shawn Steele
2010-08-30 23:20:04 UTC
Permalink
Windows, .Net & MLang aren't going to change the behavior of these code pages, it would break people. Instead we'd encourage customers to use UTF-8, particularly if they're having problems.

I was sort of assuming that since you're using the Windows nomenclature, you're attempting to pin down the behavior for some sort of interoperability when you see the Windows names. It is, perhaps, odd for the "7 bit" form to do something when it sees 8 bit data, but I was just letting you know that's what it does :) I'm sure there are also other subtle discrepancies between the 5022x behavior and the official standards, but we're pretty much stuck with the existing behavior.

If Mozilla were to target the Windows CP50220 behavior specifically (as opposed to the more general iso-2022-jp), then how exactly they wanted to follow that behavior would be up to them. If they thought that just mapping it to iso-2022-jp was acceptable and more convenient, then that would be their choice, same way we may iso-2022-jp to 50220 even though it isn't a perfect match.

-Shawn

-----Original Message-----
From: Masatoshi Kimura [mailto:***@nifty.ne.jp]
Sent: Monday, August 30, 2010 4:07 PM
To: Shawn Steele
Cc: NARUSE, Yui; ietf-charsets
Subject: Re: Registration of new charset CP50220

The purpose of this registration is to "standardize" how to handle errors when Web browsers encount illegal ISO-2022-JP sequences.
Mozilla encoder has changed a halfwidth katakana handling to match the behavior.
https://bugzilla.mozilla.org/show_bug.cgi?id=563283
Decoding is identical (which might be most interesting for users > of tagged content).
The fist version of the registration had included all decoding methods which is supported by CP50220. (i.e. ESC ( J, SI, and 8bit) However latter two methods were removed from the registration by two reasons.

1. Some implementation (e.g. Mozilla's one) don't support them.
Should Mozilla decoder be changed to match the begavior?

2. The charset supposed to be a 7-bit. It's strange to include a 8-bit character handling.
Changing the regstration to 8-bit is not a solution because it will require the Content-Transfer-Encoding MIME header field. It is not compatible with ISO-2022-JP. Old Microsoft Internet Mail/News had the bug.
Shawn Steele
2010-08-30 23:30:46 UTC
Permalink
Corrected typo. Map != may.

-----Original Message-----
From: Shawn Steele [mailto:***@microsoft.com]
Sent: Monday, August 30, 2010 4:20 PM
To: Masatoshi Kimura
Cc: NARUSE, Yui; ietf-charsets
Subject: RE: Registration of new charset CP50220

Windows, .Net & MLang aren't going to change the behavior of these code pages, it would break people. Instead we'd encourage customers to use UTF-8, particularly if they're having problems.

I was sort of assuming that since you're using the Windows nomenclature, you're attempting to pin down the behavior for some sort of interoperability when you see the Windows names. It is, perhaps, odd for the "7 bit" form to do something when it sees 8 bit data, but I was just letting you know that's what it does :) I'm sure there are also other subtle discrepancies between the 5022x behavior and the official standards, but we're pretty much stuck with the existing behavior.

If Mozilla were to target the Windows CP50220 behavior specifically (as opposed to the more general iso-2022-jp), then how exactly they wanted to follow that behavior would be up to them. If they thought that just mapping it to iso-2022-jp was acceptable and more convenient, then that would be their choice, same way we map iso-2022-jp to 50220 even though it isn't a perfect match.

-Shawn

-----Original Message-----
From: Masatoshi Kimura [mailto:***@nifty.ne.jp]
Sent: Monday, August 30, 2010 4:07 PM
To: Shawn Steele
Cc: NARUSE, Yui; ietf-charsets
Subject: Re: Registration of new charset CP50220

The purpose of this registration is to "standardize" how to handle errors when Web browsers encount illegal ISO-2022-JP sequences.
Mozilla encoder has changed a halfwidth katakana handling to match the behavior.
https://bugzilla.mozilla.org/show_bug.cgi?id=563283
Decoding is identical (which might be most interesting for users > of tagged content).
The fist version of the registration had included all decoding methods which is supported by CP50220. (i.e. ESC ( J, SI, and 8bit) However latter two methods were removed from the registration by two reasons.

1. Some implementation (e.g. Mozilla's one) don't support them.
Should Mozilla decoder be changed to match the begavior?

2. The charset supposed to be a 7-bit. It's strange to include a 8-bit character handling.
Changing the regstration to 8-bit is not a solution because it will require the Content-Transfer-Encoding MIME header field. It is not compatible with ISO-2022-JP. Old Microsoft Internet Mail/News had the bug.
Martin J. Dürst
2010-08-31 02:53:55 UTC
Permalink
Two comments:

- If what we need (for HTML5, as far as I understand) isn't exactly
what Windows software is doing, then we should not use the name
CP50220 for the registration, but should come up with some other
name. But the origin of strange provisions such as "treat content
labeled as iso-8859-1 as if it were windows-1252" in HTML5 are
"because IE did so". So the browsers might as well follow IE exact=
ly,
not just almost, in which case, we could use the name CP50220.

- The charset registry currently has no way to express "On creation
(encoding), limited to 'foo', but on interpretation (decoding), al=
so
take into account 'bar'.". RFC 2978 defines a 'charset' as "a meth=
od
of converting a sequence of octets into a sequence of characters".
We may be able to deal with this by adding comments, but maybe in =
the
long term, this could be a change needed in an update to the RFC.

Regards, Martin.
Windows, .Net& MLang aren't going to change the behavior of these =
code pages, it would break people. Instead we'd encourage customers =
to use UTF-8, particularly if they're having problems.
I was sort of assuming that since you're using the Windows nomencla=
ture, you're attempting to pin down the behavior for some sort of int=
eroperability when you see the Windows names. It is, perhaps, odd fo=
r the "7 bit" form to do something when it sees 8 bit data, but I was=
just letting you know that's what it does :) I'm sure there are als=
o other subtle discrepancies between the 5022x behavior and the offic=
ial standards, but we're pretty much stuck with the existing behavior=
.
If Mozilla were to target the Windows CP50220 behavior specifically=
(as opposed to the more general iso-2022-jp), then how exactly they =
wanted to follow that behavior would be up to them. If they thought =
that just mapping it to iso-2022-jp was acceptable and more convenien=
t, then that would be their choice, same way we may iso-2022-jp to 50=
220 even though it isn't a perfect match.
-Shawn
-----Original Message-----
Sent: Monday, August 30, 2010 4:07 PM
To: Shawn Steele
Cc: NARUSE, Yui; ietf-charsets
Subject: Re: Registration of new charset CP50220
The purpose of this registration is to "standardize" how to handle =
errors when Web browsers encount illegal ISO-2022-JP sequences.
Mozilla encoder has changed a halfwidth katakana handling to match =
the behavior.
https://bugzilla.mozilla.org/show_bug.cgi?id=3D563283
Decoding is identical (which might be most interesting for use=
rs> of tagged content).
The fist version of the registration had included all decoding meth=
ods which is supported by CP50220. (i.e. ESC ( J, SI, and 8bit) Howev=
er latter two methods were removed from the registration by two reaso=
ns.
1. Some implementation (e.g. Mozilla's one) don't support them.
Should Mozilla decoder be changed to match the begavior?
2. The charset supposed to be a 7-bit. It's strange to include a 8-=
bit character handling.
Changing the regstration to 8-bit is not a solution because it will=
require the Content-Transfer-Encoding MIME header field. It is not c=
ompatible with ISO-2022-JP. Old Microsoft Internet Mail/News had the =
bug.
--=20
#-# Martin J. D=FCrst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
Shawn Steele
2010-08-31 04:40:21 UTC
Permalink
I really don't see how this makes sense for HTML 5. HTML 5 apps should really be UTF-8. If this is for some completeness of code pages in an HTML5 world, people should really look at how practical those code pages are. Sure, there's lots of non-Unicode stuff out there, but presumably HTML 5 is new stuff, or at least with the opportunity to be converted at the authoring side, which would reduce the chance of cross-platform decoding error greatly.

IMO: this registry is interesting for handling existing content, not for streamlining new content. It's unclear to me how adding this to the registry adds much value to the end users, but if others find it useful, then I don't mind it's inclusion. This isn't going to magically make some sort of perfect effor free decoding. "My" code (.Net & Windows) isn't even necessarily consistent throughout the years, and the deviations only get worse when you consider other platforms. People end up depending on bugs, and then get broken when the "bug" is fixed.

I don't know what the intent of this registration is, and I agree that the encoding / decoding difference might not be interesting here, I just thought it was worth mentioning the behavior :)

-Shawn

 
http://blogs.msdn.com/shawnste


________________________________________
From: "Martin J. Dürst" [***@it.aoyama.ac.jp]
Sent: Monday, August 30, 2010 7:53 PM
To: Shawn Steele
Cc: Masatoshi Kimura; NARUSE, Yui; ietf-charsets
Subject: Re: Registration of new charset CP50220

Two comments:

- If what we need (for HTML5, as far as I understand) isn't exactly
what Windows software is doing, then we should not use the name
CP50220 for the registration, but should come up with some other
name. But the origin of strange provisions such as "treat content
labeled as iso-8859-1 as if it were windows-1252" in HTML5 are
"because IE did so". So the browsers might as well follow IE exactly,
not just almost, in which case, we could use the name CP50220.

- The charset registry currently has no way to express "On creation
(encoding), limited to 'foo', but on interpretation (decoding), also
take into account 'bar'.". RFC 2978 defines a 'charset' as "a method
of converting a sequence of octets into a sequence of characters".
We may be able to deal with this by adding comments, but maybe in the
long term, this could be a change needed in an update to the RFC.

Regards, Martin.
Windows, .Net& MLang aren't going to change the behavior of these code pages, it would break people. Instead we'd encourage customers to use UTF-8, particularly if they're having problems.
I was sort of assuming that since you're using the Windows nomenclature, you're attempting to pin down the behavior for some sort of interoperability when you see the Windows names. It is, perhaps, odd for the "7 bit" form to do something when it sees 8 bit data, but I was just letting you know that's what it does :) I'm sure there are also other subtle discrepancies between the 5022x behavior and the official standards, but we're pretty much stuck with the existing behavior.
If Mozilla were to target the Windows CP50220 behavior specifically (as opposed to the more general iso-2022-jp), then how exactly they wanted to follow that behavior would be up to them. If they thought that just mapping it to iso-2022-jp was acceptable and more convenient, then that would be their choice, same way we may iso-2022-jp to 50220 even though it isn't a perfect match.
-Shawn
-----Original Message-----
Sent: Monday, August 30, 2010 4:07 PM
To: Shawn Steele
Cc: NARUSE, Yui; ietf-charsets
Subject: Re: Registration of new charset CP50220
The purpose of this registration is to "standardize" how to handle errors when Web browsers encount illegal ISO-2022-JP sequences.
Mozilla encoder has changed a halfwidth katakana handling to match the behavior.
https://bugzilla.mozilla.org/show_bug.cgi?id=563283
Decoding is identical (which might be most interesting for users> of tagged content).
The fist version of the registration had included all decoding methods which is supported by CP50220. (i.e. ESC ( J, SI, and 8bit) However latter two methods were removed from the registration by two reasons.
1. Some implementation (e.g. Mozilla's one) don't support them.
Should Mozilla decoder be changed to match the begavior?
2. The charset supposed to be a 7-bit. It's strange to include a 8-bit character handling.
Changing the regstration to 8-bit is not a solution because it will require the Content-Transfer-Encoding MIME header field. It is not compatible with ISO-2022-JP. Old Microsoft Internet Mail/News had the bug.
--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoy
Martin J. Dürst
2010-08-31 05:51:39 UTC
Permalink
Hello Shawn,

HTML 5 has basically two aspects:

1) (the one you mention) New stuff (canvas, video (maybe),...)

2) Backwards-compatibility, so that it works with older pages,
including trying to be as detailed as possible a spec to
relieve implementers from reverse-engineering IE6

This registration is for 2), not for 1).

Regards, Martin.
I really don't see how this makes sense for HTML 5. HTML 5 apps sh=
ould really be UTF-8. If this is for some completeness of code pages=
in an HTML5 world, people should really look at how practical those =
code pages are. Sure, there's lots of non-Unicode stuff out there, b=
ut presumably HTML 5 is new stuff, or at least with the opportunity t=
o be converted at the authoring side, which would reduce the chance o=
f cross-platform decoding error greatly.
IMO: this registry is interesting for handling existing content, no=
t for streamlining new content. It's unclear to me how adding this t=
o the registry adds much value to the end users, but if others find i=
t useful, then I don't mind it's inclusion. This isn't going to magi=
cally make some sort of perfect effor free decoding. "My" code (.Net=
& Windows) isn't even necessarily consistent throughout the years, a=
nd the deviations only get worse when you consider other platforms. =
People end up depending on bugs, and then get broken when the "bug" i=
s fixed.
I don't know what the intent of this registration is, and I agree t=
hat the encoding / decoding difference might not be interesting here,=
I just thought it was worth mentioning the behavior :)
-Shawn
=EF=A3=A2=EF=A3=90=EF=A3=A7=EF=A3=9B =EF=A3=A2=EF=A3=A3=EF=A3=97=
=EF=A3=94=EF=A3=99
http://blogs.msdn.com/shawnste
________________________________________
Sent: Monday, August 30, 2010 7:53 PM
To: Shawn Steele
Cc: Masatoshi Kimura; NARUSE, Yui; ietf-charsets
Subject: Re: Registration of new charset CP50220
- If what we need (for HTML5, as far as I understand) isn't exactly
what Windows software is doing, then we should not use the name
CP50220 for the registration, but should come up with some othe=
r
name. But the origin of strange provisions such as "treat conte=
nt
labeled as iso-8859-1 as if it were windows-1252" in HTML5 are
"because IE did so". So the browsers might as well follow IE ex=
actly,
not just almost, in which case, we could use the name CP50220.
- The charset registry currently has no way to express "On creation
(encoding), limited to 'foo', but on interpretation (decoding),=
also
take into account 'bar'.". RFC 2978 defines a 'charset' as "a m=
ethod
of converting a sequence of octets into a sequence of character=
s".
We may be able to deal with this by adding comments, but maybe =
in the
long term, this could be a change needed in an update to the RF=
C.
Regards, Martin.
Windows, .Net& MLang aren't going to change the behavior of thes=
e code pages, it would break people. Instead we'd encourage customer=
s to use UTF-8, particularly if they're having problems.
I was sort of assuming that since you're using the Windows nomencl=
ature, you're attempting to pin down the behavior for some sort of in=
teroperability when you see the Windows names. It is, perhaps, odd f=
or the "7 bit" form to do something when it sees 8 bit data, but I wa=
s just letting you know that's what it does :) I'm sure there are al=
so other subtle discrepancies between the 5022x behavior and the offi=
cial standards, but we're pretty much stuck with the existing behavio=
r.
If Mozilla were to target the Windows CP50220 behavior specificall=
y (as opposed to the more general iso-2022-jp), then how exactly they=
wanted to follow that behavior would be up to them. If they thought=
that just mapping it to iso-2022-jp was acceptable and more convenie=
nt, then that would be their choice, same way we may iso-2022-jp to 5=
0220 even though it isn't a perfect match.
-Shawn
-----Original Message-----
Sent: Monday, August 30, 2010 4:07 PM
To: Shawn Steele
Cc: NARUSE, Yui; ietf-charsets
Subject: Re: Registration of new charset CP50220
The purpose of this registration is to "standardize" how to handle=
errors when Web browsers encount illegal ISO-2022-JP sequences.
Mozilla encoder has changed a halfwidth katakana handling to match=
the behavior.
https://bugzilla.mozilla.org/show_bug.cgi?id=3D563283
Decoding is identical (which might be most interesting for =
users> of tagged content).
The fist version of the registration had included all decoding met=
hods which is supported by CP50220. (i.e. ESC ( J, SI, and 8bit) Howe=
ver latter two methods were removed from the registration by two reas=
ons.
1. Some implementation (e.g. Mozilla's one) don't support them.
Should Mozilla decoder be changed to match the begavior?
2. The charset supposed to be a 7-bit. It's strange to include a 8=
-bit character handling.
Changing the regstration to 8-bit is not a solution because it wil=
l require the Content-Transfer-Encoding MIME header field. It is not =
compatible with ISO-2022-JP. Old Microsoft Internet Mail/News had the=
bug.
--
#-# Martin J. D=C3=BCrst, Professor, Aoyama Gakuin University
--=20
#-# Martin J. D=C3=BCrst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
NARUSE, Yui
2010-08-31 16:01:11 UTC
Permalink
Post by Martin J. Dürst
Hello Shawn,
1) (the one you mention) New stuff (canvas, video (maybe),...)
2) Backwards-compatibility, so that it works with older pages,
including trying to be as detailed as possible a spec to
relieve implementers from reverse-engineering IE6
This registration is for 2), not for 1).
I thought about 2), so the encoding which I want to register should b=
e the one
which is used in current world wide web pages.

But true CP50220 supports much more features:
* JIS X 0208-1990/1997
* Shift-in Katakana
* 8-bit Katakana
* (I Shift-JIS Kanji
* Shift-in Shift-JIS Kanji
* 8-bit Shift-JIS Kanji
* (H =E2=80=98Swedish=E2=80=99 as JIS Roman
http://coq.no/character-tables/mime/iso-2022/en

Some of them may be included compatibility set (as my first draft),
but some of them may not be.

Such case is also in Windows-31J; CP932 has User Defined Characters,
but IANA Charset doesn't describe it.

--=20
NARUSE, Yui <***@airemix.jp>
Anne van Kesteren
2010-08-31 08:43:40 UTC
Permalink
On Tue, 31 Aug 2010 04:53:55 +0200, Martin J. D=FCrst =
Post by Martin J. Dürst
- If what we need (for HTML5, as far as I understand) isn't exactly
what Windows software is doing, then we should not use the name
CP50220 for the registration, but should come up with some other
name. But the origin of strange provisions such as "treat content
labeled as iso-8859-1 as if it were windows-1252" in HTML5 are
"because IE did so". So the browsers might as well follow IE exactl=
y,
Post by Martin J. Dürst
not just almost, in which case, we could use the name CP50220.
To be clear. We do not need it for HTML5 specifically. Browsers need thi=
s =

kind of information in general so they can do the same thing for legacy =
=

content. New browsers entering the market also need this information so =
=

they do not have to reverse engineer the market leader (as we are =

currently doing).
Post by Martin J. Dürst
- The charset registry currently has no way to express "On creation
(encoding), limited to 'foo', but on interpretation (decoding), als=
o
Post by Martin J. Dürst
take into account 'bar'.". RFC 2978 defines a 'charset' as "a metho=
d
Post by Martin J. Dürst
of converting a sequence of octets into a sequence of characters".
We may be able to deal with this by adding comments, but maybe in t=
he
Post by Martin J. Dürst
long term, this could be a change needed in an update to the RFC.
It is my understanding it can also apply to creation. Consider submittin=
g =

a form. I believe browsers treat e.g. ISO-8859-1 as Windows-1252 there. =
=

Yui mentioned this case specifically when submitting this registration.


-- =

Anne van Kesteren
http://annevankesteren.nl/
Martin J. Dürst
2010-08-31 09:05:35 UTC
Permalink
Hello Anne,
Post by Anne van Kesteren
On Tue, 31 Aug 2010 04:53:55 +0200, Martin J. D=FCrst
- If what we need (for HTML5, as far as I understand) isn't exactl=
y
Post by Anne van Kesteren
what Windows software is doing, then we should not use the name
CP50220 for the registration, but should come up with some other
name. But the origin of strange provisions such as "treat content
labeled as iso-8859-1 as if it were windows-1252" in HTML5 are
"because IE did so". So the browsers might as well follow IE exact=
ly,
Post by Anne van Kesteren
not just almost, in which case, we could use the name CP50220.
To be clear. We do not need it for HTML5 specifically. Browsers nee=
d
Post by Anne van Kesteren
this kind of information in general so they can do the same thing f=
or
Post by Anne van Kesteren
legacy content. New browsers entering the market also need this
information so they do not have to reverse engineer the market lead=
er
Post by Anne van Kesteren
(as we are currently doing).
Well, yes, browsers may need information like the correspondence betw=
een=20
iso-8859-1 and windows-1252 in general, but the charset registry won'=
t=20
provide it, it's HTML5 that provides it. All the charset registry can=
do=20
is make sure the charset label is registered, and with it, the charse=
t=20
is reasonably well defined.
Post by Anne van Kesteren
- The charset registry currently has no way to express "On creatio=
n
Post by Anne van Kesteren
(encoding), limited to 'foo', but on interpretation (decoding), al=
so
Post by Anne van Kesteren
take into account 'bar'.". RFC 2978 defines a 'charset' as "a meth=
od
Post by Anne van Kesteren
of converting a sequence of octets into a sequence of characters".
We may be able to deal with this by adding comments, but maybe in =
the
Post by Anne van Kesteren
long term, this could be a change needed in an update to the RFC.
It is my understanding it can also apply to creation. Consider
submitting a form. I believe browsers treat e.g. ISO-8859-1 as
Windows-1252 there. Yui mentioned this case specifically when submi=
tting
Post by Anne van Kesteren
this registration.
Good point, for what the browser does. What I was talking above is no=
t=20
what the browser does, but what the RFC says we can/should register.

Regards, Martin.


--=20
#-# Martin J. D=FCrst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
NARUSE, Yui
2010-08-31 16:10:07 UTC
Permalink
Post by Martin J. Dürst
Post by Anne van Kesteren
On Tue, 31 Aug 2010 04:53:55 +0200, Martin J. D=FCrst
Post by Martin J. Dürst
- If what we need (for HTML5, as far as I understand) isn't
exactly what Windows software is doing, then we should not use
the name CP50220 for the registration, but should come up with
some other name. But the origin of strange provisions such as
"treat content labeled as iso-8859-1 as if it were windows-1252"
in HTML5 are "because IE did so". So the browsers might as well
follow IE exactly, not just almost, in which case, we could use
the name CP50220.
To be clear. We do not need it for HTML5 specifically. Browsers
need this kind of information in general so they can do the same
thing for legacy content. New browsers entering the market also
need this information so they do not have to reverse engineer the
market leader (as we are currently doing).
Well, yes, browsers may need information like the correspondence
between iso-8859-1 and windows-1252 in general, but the charset
registry won't provide it, it's HTML5 that provides it. All the
charset registry can do is make sure the charset label is registere=
d,
Post by Martin J. Dürst
and with it, the charset is reasonably well defined.
I refers HTML5's "Character encoding overrides".
It defines overrides as "de jure to de facto".
I want to define EUC-JP and ISO-2022-JP version.
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.h=
tml#character-encodings-0

So I want the set which is already used in the current web for compat=
ibility.

--=20
NARUSE, Yui <***@airemix.jp>
Shawn Steele
2010-08-31 17:29:17 UTC
Permalink
I'm adding Adrian from the IE Team to the thread :)

I own most of the encoding APIs here, but I don't know what IE's thinking is around HTML 5 and stuff.

I don't mind 50220 being registered, but I'm a bit concerned that there may be some differences between the MLang, MultiByteToWideChar, and .Net versions of encoding/decoding. (One reason I don't like non-Unicode encodings is that it is difficult to find multiple implementations that match exactly). Also, it is "like" ISO-2022-JP, but differs, as noted, and I was a tad confused about why they were being tied together. I'd consider MultiByteToWideChar behavior to be "authoritative", but your emphasis is probably on MLang behavior because IE used MLang.

-Shawn
Masatoshi Kimura
2010-09-02 11:43:58 UTC
Permalink
Hi,
Windows, .Net& MLang aren't going to change the behavior of these code
pages, it would break people. Instead we'd encourage customers to use
UTF-8, particularly if they're having problems.
Totally agree. I do not want to change the existing legacy codes only to
comply the registration. Note that sending ESC(I is illegal regardless
of "CP50220". Mozilla ISO-2022-JP encoder should have been changed anyway.
I was sort of assuming that since you're using the Windows nomenclature,
you're attempting to pin down the behavior for some sort of
interoperability when you see the Windows names. It is, perhaps, odd
for the "7 bit" form to do something when it sees 8 bit data, but I was
just letting you know that's what it does :) I'm sure there are also
other subtle discrepancies between the 5022x behavior and the official
standards, but we're pretty much stuck with the existing behavior.
Windows does not use the name "CP50220" at all. It treats the name
"ISO-2022-JP" as Windows Codepage 50220.
Windows should behave as if it uses genuine ISO-2022-JP because it
doesn't use its own name. Therefore it should preclude
"Content-Transfer-Encoding: 8bit" on sending. MSIMN didn't, so many
users complained about this rudeness. Eventually, Outlook Express fixed
this bug.
The name "CP50220" is used by non-MS implementations to differentiate it
from genuine ISO-2022-JP. Those implementations do not share exactly the
same behavior. Even WideCharToMultiByte and MLang are differ from each
other. So there is no authoritative definition of "CP50220".

Consequently, I think the regitration should include only the greatest
common denominator. The rest should be left undefined.
--
***@nifty.ne.jp
NARUSE, Yui
2010-09-02 12:04:57 UTC
Permalink
Post by Masatoshi Kimura
Windows, .Net& MLang aren't going to change the behavior of these
code pages, it would break people. Instead we'd encourage customers
to use UTF-8, particularly if they're having problems.
Totally agree. I do not want to change the existing legacy codes only
to comply the registration. Note that sending ESC(I is illegal
regardless of "CP50220". Mozilla ISO-2022-JP encoder should have been
changed anyway.
This refers:
Mozilla has ISO-2022-JP ESC(I decoder, but when sending JIS X 0201-Katakana
Mozilla encodes them as Character reference.
Post by Masatoshi Kimura
I was sort of assuming that since you're using the Windows
nomenclature, you're attempting to pin down the behavior for some
sort of interoperability when you see the Windows names. It is,
perhaps, odd for the "7 bit" form to do something when it sees 8
bit data, but I was just letting you know that's what it does :)
I'm sure there are also other subtle discrepancies between the
5022x behavior and the official standards, but we're pretty much
stuck with the existing behavior.
Windows does not use the name "CP50220" at all. It treats the name
"ISO-2022-JP" as Windows Codepage 50220. Windows should behave as if
it uses genuine ISO-2022-JP because it doesn't use its own name.
Therefore it should preclude "Content-Transfer-Encoding: 8bit" on
sending. MSIMN didn't, so many users complained about this rudeness.
Eventually, Outlook Express fixed this bug. The name "CP50220" is
used by non-MS implementations to differentiate it from genuine
ISO-2022-JP. Those implementations do not share exactly the same
behavior. Even WideCharToMultiByte and MLang are differ from each
other. So there is no authoritative definition of "CP50220".
Consequently, I think the regitration should include only the
greatest common denominator. The rest should be left undefined.
Only I wanted is decoder, so I can change the description like:

* On sending JIS X 0201-Katakana, it MUST be converted to related
character of JIS X 0208 or escaped characters on the context.
--
NARUSE, Yui <***@airemix.jp>
Masatoshi Kimura
2010-09-02 12:19:52 UTC
Permalink
Post by NARUSE, Yui
Mozilla has ISO-2022-JP ESC(I decoder, but when sending JIS X 0201-Katakana
Mozilla encodes them as Character reference.
No. Old encoder had sent ESC(I. The current encoder converts
hankaku kana to zenkaku kana.
Post by NARUSE, Yui
Only I wanted is decoder,
We also need the encoder. (consider about form submission.)
Post by NARUSE, Yui
* On sending JIS X 0201-Katakana, it MUST be converted to related
character of JIS X 0208 or escaped characters on the context.
Please do not change. Otherwise Mozilla will have to change the encoder
again. Moreover, most other browsers (at least IE, Chrome, Safari, and
Opera for Windows) already convert hankaku to zenkaku on form submission.
--
***@nifty.ne.jp
NARUSE, Yui
2010-09-02 12:39:30 UTC
Permalink
This refers: Mozilla has ISO-2022-JP ESC(I decoder, but when
sending JIS X 0201-Katakana Mozilla encodes them as Character
reference.
No. Old encoder had sent ESC(I. The current encoder converts hankaku
kana to zenkaku kana.
Only I wanted is decoder,
We also need the encoder. (consider about form submission.)
* On sending JIS X 0201-Katakana, it MUST be converted to related
character of JIS X 0208 or escaped characters on the context.
Please do not change. Otherwise Mozilla will have to change the
encoder again. Moreover, most other browsers (at least IE, Chrome,
Safari, and Opera for Windows) already convert hankaku to zenkaku on
form submission.
OK, I see.
--
NARUSE, Yui <***@airemix.jp>
Shawn Steele
2010-09-03 19:50:25 UTC
Permalink
I don't think the registration needs an alias, unless someone's actually using that name already?

If I understand the intent correctly, the intent here is to match Microsoft's 50220 behavior? If so, I think that defining the "character sets" in terms of the JIS standards is a little bit odd. Instead I might consider pointing the character set mappings, like at http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/ ? (Not sure if that completely works)

I'm not sure the comparison to shift_jis makes much sense. They both encode Japanese, but 50220/iso-2022-jp are both stateful escape sequence based encodings, whereas shift_jis is "just" a double-byte code page.

Probably worth mentioning that not all applications that support this code page would recognize the CP50220 name.

- Shawn
Masatoshi Kimura
2010-09-04 01:54:23 UTC
Permalink
Post by Shawn Steele
I don't think the registration needs an alias, unless someone's
actually using that name already?
The "csCP50220" alias was added for MIB requirements. See
<http://mail.apps.ietf.org/ietf/charsets/msg01880.html>.
Post by Shawn Steele
If I understand the intent correctly, the intent here is to match
Microsoft's 50220 behavior? If so, I think that defining the
"character sets" in terms of the JIS standards is a little bit odd.
Instead I might consider pointing the character set mappings, like at
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/ ? (Not
sure if that completely works)
There is no official character set mappings for CP50220 (it it existed,
the problem would be much less complicate).
Moreover, it will be insuffiecient to point a mapping table because
CP50220 is a stateful encoding. A BNF or a decoding algorithm (like
HTML5 spec) will be required. Defining in terms of the JIS standards is
an easiest workaround.
Post by Shawn Steele
I'm not sure the comparison to shift_jis makes much sense. They both
encode Japanese, but 50220/iso-2022-jp are both stateful escape
sequence based encodings, whereas shift_jis is "just" a double-byte
code page.
The comparison is not a CP50220/ISO-2022-JP vs. Shift_JIS but a
Shift_JIS vs. Windows-31J. It is required to express "characters
extended by Windows Codepage 932" because only shift_jis variant
(Windows-31J, Windows Codepage 932, or whatever) has the "official"
mappings provided by Microsoft. Again, it would not be required if
Microsoft provided an official mappings for CP50220.
--
***@nifty.ne.jp
Shawn Steele
2010-09-06 05:54:34 UTC
Permalink
I'm happy if you're happy :)

I think that CP50220 could point the states for each escape sequence at appropriate tables, however some of the code not not completely match the tables (I'd have to take a close look). However it is unlikely that the various standards referred to by these escape sequences are exactly what our behavior is, I've heard complaints. Readers would presume that the escape sequences were exactly as the JIS standards, however I think that's not true :(

So I guess I'm a bit unclear as to your goal. If you intend to accurately document the behavior, then it seems that more detail is required. If you are just trying to register a place for CP50220 and note that it's not ISO-2022-JP, then this might work. If the higher level detail of Microsoft exact behavior is required, then this might be something I need to follow up on, which might take a little while :(

There are 2 ways to read this sentence, I read it differently: (It can be read that CP50220, Windows-31J & Shift_JIS are all variants of ISO-2022-JP.) Sorry, I didn't think of reading it the other way.
"CP50220 is a variant of ISO-2022-JP (like Windows-31J and Shift_JIS)."

Perhaps this would have helped me:
"CP50220 is a variant of ISO-2022-JP. (Similar to the way that Windows-31J is a variant of Shift_JIS)"

-Shawn

 
http://blogs.msdn.com/shawnste


________________________________________
From: Masatoshi Kimura [***@nifty.ne.jp]
Sent: Friday, September 03, 2010 6:54 PM
To: Shawn Steele
Cc: 'NARUSE, Yui'; 'ietf-charsets'
Subject: Re: Registration of new charset CP50220
Post by Shawn Steele
I don't think the registration needs an alias, unless someone's
actually using that name already?
The "csCP50220" alias was added for MIB requirements. See
<http://mail.apps.ietf.org/ietf/charsets/msg01880.html>.
Post by Shawn Steele
If I understand the intent correctly, the intent here is to match
Microsoft's 50220 behavior? If so, I think that defining the
"character sets" in terms of the JIS standards is a little bit odd.
Instead I might consider pointing the character set mappings, like at
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/ ? (Not
sure if that completely works)
There is no official character set mappings for CP50220 (it it existed,
the problem would be much less complicate).
Moreover, it will be insuffiecient to point a mapping table because
CP50220 is a stateful encoding. A BNF or a decoding algorithm (like
HTML5 spec) will be required. Defining in terms of the JIS standards is
an easiest workaround.
Post by Shawn Steele
I'm not sure the comparison to shift_jis makes much sense. They both
encode Japanese, but 50220/iso-2022-jp are both stateful escape
sequence based encodings, whereas shift_jis is "just" a double-byte
code page.
The comparison is not a CP50220/ISO-2022-JP vs. Shift_JIS but a
Shift_JIS vs. Windows-31J. It is required to express "characters
extended by Windows Codepage 932" because only shift_jis variant
(Windows-31J, Windows Codepage 932, or whatever) has the "official"
mappings provided by Microsoft. Again, it would not be required if
Microsoft provided an official mappings for
NARUSE, Yui
2010-09-16 15:03:07 UTC
Permalink
Post by Shawn Steele
I'm happy if you're happy :)
I think that CP50220 could point the states for each escape sequence
at appropriate tables, however some of the code not not completely
match the tables (I'd have to take a close look). However it is
unlikely that the various standards referred to by these escape
sequences are exactly what our behavior is, I've heard complaints.
Readers would presume that the escape sequences were exactly as the
JIS standards, however I think that's not true :(
So I guess I'm a bit unclear as to your goal. If you intend to
accurately document the behavior, then it seems that more detail is
required. If you are just trying to register a place for CP50220 and
note that it's not ISO-2022-JP, then this might work. If the higher
level detail of Microsoft exact behavior is required, then this might
be something I need to follow up on, which might take a little while
:(
There are 2 ways to read this sentence, I read it differently: (It
can be read that CP50220, Windows-31J& Shift_JIS are all variants of
ISO-2022-JP.) Sorry, I didn't think of reading it the other way.
"CP50220 is a variant of ISO-2022-JP (like Windows-31J and
Shift_JIS)."
Perhaps this would have helped me: "CP50220 is a variant of
ISO-2022-JP. (Similar to the way that Windows-31J is a variant of
Shift_JIS)"
A registration is a standard for information exchange.
So what we need is a reasonable subset of real implementation of
Windows Codepage 50220.

For example CP932, it has some strange mappings like:
<U0080> \x80 |0
<UF8F0> \xA0 |0
<UF8F1> \xFD |0
<UF8F2> \xFE |0
<UF8F3> \xFF |0
Moreover it has User Defined Characters.

I don't think they are needed for IANA Charset registry.

I believe that what I proposed for CP50220 is a reasonable one.
--
NARUSE, Yui <***@airemix.jp>
Shawn Steele
2010-09-16 17:50:44 UTC
Permalink
Ok

-Shawn

 
http://blogs.msdn.com/shawnste


________________________________________
From: NARUSE, Yui [***@airemix.jp]
Sent: Thursday, September 16, 2010 8:03 AM
To: Shawn Steele
Cc: Masatoshi Kimura; 'ietf-charsets'; Ryan Cavalcante; Peter Constable
Subject: Re: Registration of new charset CP50220
Post by Shawn Steele
I'm happy if you're happy :)
I think that CP50220 could point the states for each escape sequence
at appropriate tables, however some of the code not not completely
match the tables (I'd have to take a close look). However it is
unlikely that the various standards referred to by these escape
sequences are exactly what our behavior is, I've heard complaints.
Readers would presume that the escape sequences were exactly as the
JIS standards, however I think that's not true :(
So I guess I'm a bit unclear as to your goal. If you intend to
accurately document the behavior, then it seems that more detail is
required. If you are just trying to register a place for CP50220 and
note that it's not ISO-2022-JP, then this might work. If the higher
level detail of Microsoft exact behavior is required, then this might
be something I need to follow up on, which might take a little while
:(
There are 2 ways to read this sentence, I read it differently: (It
can be read that CP50220, Windows-31J& Shift_JIS are all variants of
ISO-2022-JP.) Sorry, I didn't think of reading it the other way.
"CP50220 is a variant of ISO-2022-JP (like Windows-31J and
Shift_JIS)."
Perhaps this would have helped me: "CP50220 is a variant of
ISO-2022-JP. (Similar to the way that Windows-31J is a variant of
Shift_JIS)"
A registration is a standard for information exchange.
So what we need is a reasonable subset of real implementation of
Windows Codepage 50220.

For example CP932, it has some strange mappings like:
<U0080> \x80 |0
<UF8F0> \xA0 |0
<UF8F1> \xFD |0
<UF8F2> \xFE |0
<UF8F3> \xFF |0
Moreover it has User Defined Characters.

I don't think they are needed for IANA Charset registry.

I believe that what I proposed for CP50220 is a r
NARUSE, Yui
2010-09-17 12:57:43 UTC
Permalink
Hello,

This is Registration of new charset CP50220.

Following proposal should get a consensus in ietf-charsets,
so I submit this to IANA.

Regards,

---
Charset name: CP50220

Charset aliases: csCP50220

Suitability for use in MIME text:

Yes, CP50220 is suitable for use with subtypes of the "text" Content-Type.

Since the "CP50220" is 7bit encoding, Content-Transfer-Encoding is not needed.
Based64 or Quoted-Printable encoding MAY break this encoding.

Published specification(s):

CP50220 is consisted by following character sets:

reg# character set ESC sequence designated to
------------------------------------------------------
6 US-ASCII ESC ( B G0
13 JIS X 0201-Katakana ESC ( I G0
14 JIS X 0201-Roman ESC ( J G0
42 JIS X 0208-1978 ESC $ @ G0
87 JIS X 0208-1983 ESC $ B G0

* The beggining of a text is assumed to have "ESC ( B ESC ) I".
* Each line of CP50220 text MUST end with ASCII.
* On receiving JIS X 0201-Katakana characters MAY be encoded
with the escape sequence: ESC ( I.
* On sending JIS X 0201-Katakana, it MUST be converted to related
character of JIS X 0208.
* The character set of CP50220 is based on Windows Codepage 932.
So a meaning and a map to Unicode of each character is refer to it.
http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx

ISO 10646 equivalency table:

This charset is ISO/IEC 2022 family.
Conversion of each character refers Windows Codepage 932:
http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx
http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
http://icu-project.org/repos/icu/data/trunk/charset/data/ucm/windows-932-2000.ucm

Additional information:

CP50220 is a variant of ISO-2022-JP (like Windows-31J and Shift_JIS).
this charset is different from ISO-2022-JP in:
* CP50220 supports JIS X 0201-Katakana
* CP50220 supports characters extended by Windows Codepage 932
(NEC special characters and NEC selection of IBM extensions)
* Unicode mapping of some characters are different

Typical user of CP50220 is web browsers. When web browsers load
a page which are declared or auto-detected as "ISO-2022-JP", they
don't interpret it as true ISO-2022-JP registerd in IANA Character
Sets but as CP50220. When they post form data as "ISO-2022-JP",
the data is also encoded as CP50220. Note that though csISO2022JP
is alias of ISO-2022-JP in IANA Character Sets, on Windows it means
neither registered ISO-2022-JP nor CP50220 but means CP50221.

The name "CP50220" is in use following applications:
* Citrus iconv (NetBSD and DragonFly uses this)
* Mojikan http://www.mirai-ii.co.jp/moji/mojikan/
* nkf 2.0.5
* Encode-EUCJPMS-0.06

Moreover applications which uses MLang.DLL or .NET Framework for
converting "ISO-2022-JP" implicitly uses this charset.

Why the name is not "Windows-50220" is some of applications which accept
the name "CP50220" don't support the name "Windows-50220".

CP50220 is for use of communicating with legacy system.
UTF-8 is preferred to CP50220 for new system.

Related references are:

"Remarks" of "GetEncodings Method" of "System.Text"
http://msdn.microsoft.com/en-us/library/system.text.encoding.getencodings.aspx

"Introduction to JIS X0213 Implementation based on Unicode -
A new Japanese Language Processing Environment for Information Systems",
Nikkei BP Soft Press, ISBN 978-4891006082, 2008, pp. 17-18, 20, 120-158 (in Japanese)

CP50220 - Legacy Encoding Project
http://legacy-encoding.sourceforge.jp/wiki/index.php?cp50220

This charset is also known as Windows Codepage 50220.

Person & email address to contact for further information:

NARUSE, Yui
Email: ***@ruby-lang.org

Intended usage: LIMITED USE

-- NARUSE, Yui <***@airemix.jp>

Loading...