Discussion:
Registration of new charset CP51932
NARUSE Yui
2010-04-02 19:39:30 UTC
Permalink
Charset name: CP51932

Charset aliases: (none)

Suitability for use in MIME text:

Yes, CP51932 is suitable for use with subtypes of the "text"
Content-Type. Note that CP51932 is an multi-octet charset.
Care should be taken to choose an appropriate Content-Transfer-Enc=
oding.

Published specification(s):

Uses ISO 2022 rules to select:
code set 0: US-ASCII (a single 7-bit byte set)
* 0x5C is U+005C : REVERSE SOLIDUS (YEN SIGN)
* 0x7E is U+007E : TILDE
code set 1: Microsoft Standard Character Set (a double 8-bit byt=
e set)
* JIS X 0208-1983
* NEC special characters (Row 13)
* NEC selection of IBM extensions (Rows 89 to 92)
code set 2: Halfwidth Katakana (a single 7-bit byte set)
JIS X 0201-1976
requiring SS2 as the character prefix

Meaning and mapping to Unicode of each character is refer to
Windows Codepage 932.
http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx

ISO 10646 equivalency table:

http://cpansearch.perl.org/src/NARUSE/Encode-EUCJPMS-0.07/ucm/cp51=
932.ucm

Additional information:

This is a request for a new registration of this charset.

CP51932 is real implementation of EUC-JP mostly used by Web Browse=
rs.
Internet Explorer gives a reference implementation.
Firefox, Safari, Opera, and Google Chrome support also this.
They refers this charset by the name "EUC-JP".
http://coq.no/character-tables/mime/euc/en

The name "CP51932" is in use following applications:
* Citrus iconv (NetBSD and DragonFly uses this)
* patched GNU libiconv in FreeBSD ports
* Mojikan http://www.mirai-ii.co.jp/moji/mojikan/
* nkf 2.0.5
* PHP 5.2.1
* Ruby 1.9.1
* Encode-EUCJPMS-0.06

Moreover applications which uses MLang.DLL or .NET Framework for
converting "EUC-JP" implicitly uses this charset.

So this charset is widely used, but doesn't have its own name.
Intended use of this name is to override the implementation of EUC=
-JP
or charset convertion.
http://wiki.whatwg.org/wiki/Web_Encodings
http://www.w3.org/Bugs/Public/show_bug.cgi?id=3D7444

Why the name is not "Windows-51932" is some of applications which =
accept
the name "CP51932" don't support the name "Windows-51932".

CP51932 is for use of importing legacy data.
UTF-8 is preferred to CP51932 for new system.

Related references are:

"Remarks" of "GetEncodings Method" of "System.Text"
http://msdn.microsoft.com/en-us/library/system.text.encoding.get=
encodings.aspx

"Unicode=E3=81=AB=E3=82=88=E3=82=8BJIS X0213=E5=AE=9F=E8=A3=85=
=E5=85=A5=E9=96=80=E2=80=95=E6=83=85=E5=A0=B1=E3=82=B7=E3=82=B9=E3=
=83=86=E3=83=A0=E3=81=AE=E6=96=B0=E3=81=9F=E3=81=AA=E6=97=A5=E6=9C=
=AC=E8=AA=9E=E5=87=A6=E7=90=86=E7=92=B0=E5=A2=83"
=E6=97=A5=E7=B5=8CBP=E3=82=BD=E3=83=95=E3=83=88=E3=83=97=E3=83=
=AC=E3=82=B9, ISBN 978-4891006082, 2008, p. 17-18, 20, 120-158

CP51932 - Legacy Encoding Project
http://legacy-encoding.sourceforge.jp/wiki/index.php?cp51932

This charset is also known as Windows Codepage 51932.

Person & email address to contact for further information:

NARUSE, Yui
Email: ***@airemix.jp

Intended usage: LIMITED USE

--=20
NARUSE, Yui <***@airemix.jp>
Ned Freed
2010-04-04 16:11:03 UTC
Permalink
(In addition to the specific comments I've made below, I think a gene=
ral
dicussion of how to handle this whole CP932 mess is probably in order=
.)
Post by NARUSE Yui
Charset name: CP51932
Charset aliases: (none)
Yes, CP51932 is suitable for use with subtypes of the "text"
Content-Type. Note that CP51932 is an multi-octet charset.
Care should be taken to choose an appropriate Content-Transfer-E=
ncoding.

While this is a generally true statement, it doesn't follow from the =
charset
being multi-octet. There are multi-octet charsets like iso-2022-jp th=
at
are 7bit and hence require no special encoding at all, others like ut=
f-8
are 8bit, and still others like utf-16 require binary.

I suggest changing this to "Since CP1932 is an 8bit charset Care shou=
ld be
taken to choose an appropriate Content-Transfer-Encoding."
Post by NARUSE Yui
code set 0: US-ASCII (a single 7-bit byte set)
* 0x5C is U+005C : REVERSE SOLIDUS (YEN SIGN)
This makes no sense to me. A Reverse Solidus isn't a Yen sign.
Post by NARUSE Yui
* 0x7E is U+007E : TILDE
AFAIK 0x7E is a tilde in US-ASCII, so why are you calling this out?
Post by NARUSE Yui
code set 1: Microsoft Standard Character Set (a double 8-bit b=
yte set)

Which Microsoft standard character set? I assume it's CP932, but this=
needs to
be stated explicitly.
Post by NARUSE Yui
* JIS X 0208-1983
* NEC special characters (Row 13)
* NEC selection of IBM extensions (Rows 89 to 92)
code set 2: Halfwidth Katakana (a single 7-bit byte set)
JIS X 0201-1976
requiring SS2 as the character prefix
You're approaching this backwards, which is unfortunately a pretty co=
mmon
problem with these ISO 2022-based charsets, including some existing
registratons.

Charsets are mappings from octets to characters and should be specifi=
ed as=20
such. What you're doing here is using the ISO approach, which defines
things in terms of one or more coded character sets and then describe=
s how
to encode them using a character encoding scheme.

I suggest that this be flipped around, to say something like:

Octets with the high bit clear specify single US-ASCII characters,=
while
octets with the high bit set encode characters from the Microsoft =
Coded
Character Set 932 by combining the bits from the two octets ...

the problem with the ISO 2022 approach is that once you say you're us=
ing ISO
2022 you then have to profile what parts of ISO 2022 are allowed. ISO=
2022 in
full generality is extremeply complex and essentially nobody supports=
all of
it. (And EUC charsets are some of the most limited profiles of all - =
they
assume fixed bindings of coded character sets to C0-1/G0-3 and only u=
se the
"shift next character" control sequences.

I suggest you check out the specifiation of iso-2022-jp or any of the=
other
iso-2022-* variants in order to see how to write this sort of descrip=
tion.
Post by NARUSE Yui
Meaning and mapping to Unicode of each character is refer to
Windows Codepage 932.
http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx
http://cpansearch.perl.org/src/NARUSE/Encode-EUCJPMS-0.07/ucm/cp=
51932.ucm
Post by NARUSE Yui
This is a request for a new registration of this charset.
CP51932 is real implementation of EUC-JP mostly used by Web Brow=
sers.

I think what you're trying to say is that these browsers interpret EU=
C-JP
(which is already registered and differs in some details) as the char=
set you
describe here rather than what's actually registered under the name E=
UC-JP. If
so, you need to make that much clearer.

I also think some mention needs to be made of JIS X 0212 here and the=
apparent
lack of a binding of it to the G3 range (which is present in EUC-JP).=
While I
have no problem with dropping JIS X 0212 support - support for which =
is
sporadic at best - the rationale for not having it needs to be called=
out.
Post by NARUSE Yui
Internet Explorer gives a reference implementation.
Firefox, Safari, Opera, and Google Chrome support also this.
They refers this charset by the name "EUC-JP".
http://coq.no/character-tables/mime/euc/en
I'm not sure references to incorrect definitions of other charsets ar=
e
appropriate or useful.
Post by NARUSE Yui
* Citrus iconv (NetBSD and DragonFly uses this)
* patched GNU libiconv in FreeBSD ports
* Mojikan http://www.mirai-ii.co.jp/moji/mojikan/
* nkf 2.0.5
* PHP 5.2.1
* Ruby 1.9.1
* Encode-EUCJPMS-0.06
Moreover applications which uses MLang.DLL or .NET Framework for
converting "EUC-JP" implicitly uses this charset.
So this charset is widely used, but doesn't have its own name.
Intended use of this name is to override the implementation of E=
UC-JP
Post by NARUSE Yui
or charset convertion.
http://wiki.whatwg.org/wiki/Web_Encodings
http://www.w3.org/Bugs/Public/show_bug.cgi?id=3D7444
Why the name is not "Windows-51932" is some of applications whic=
h accept
Post by NARUSE Yui
the name "CP51932" don't support the name "Windows-51932".
CP51932 is for use of importing legacy data.
UTF-8 is preferred to CP51932 for new system.
"Remarks" of "GetEncodings Method" of "System.Text"
http://msdn.microsoft.com/en-us/library/system.text.encoding.g=
etencodings.aspx
Post by NARUSE Yui
"Unicode=E3=81=AB=E3=82=88=E3=82=8BJIS X0213=E5=AE=9F=E8=A3=
=85=E5=85=A5=E9=96=80=E2=80=95=E6=83=85=E5=A0=B1=E3=82=B7=E3=82=B9=
=E3=83=86=E3=83=A0=E3=81=AE=E6=96=B0=E3=81=9F=E3=81=AA=E6=97=A5=E6=
=9C=AC=E8=AA=9E=E5=87=A6=E7=90=86=E7=92=B0=E5=A2=83"
Post by NARUSE Yui
=E6=97=A5=E7=B5=8CBP=E3=82=BD=E3=83=95=E3=83=88=E3=83=97=E3=
=83=AC=E3=82=B9, ISBN 978-4891006082, 2008, p. 17-18, 20, 120-158
Post by NARUSE Yui
CP51932 - Legacy Encoding Project
http://legacy-encoding.sourceforge.jp/wiki/index.php?cp51932
This charset is also known as Windows Codepage 51932.
NARUSE, Yui
Intended usage: LIMITED USE
--
NARUSE Yui
2010-04-05 10:15:46 UTC
Permalink
Thank you for comment,
(In addition to the specific comments I've made below, I think a ge=
neral
dicussion of how to handle this whole CP932 mess is probably in ord=
er.)

There are two strategies: override or option.
HTML5 uses overrides. They must treat exist documents which are mista=
kenly
labeled as Shift_JIS or EUC-JP. In those situation people should over=
ride.

Otherwise converters, databases, Programing languages added new chars=
et as
an option. Their user can one of them.

IANA Charsets are back-end of libraries, so they should choose option=
solution.
Post by NARUSE Yui
Yes, CP51932 is suitable for use with subtypes of the "text"
Content-Type. Note that CP51932 is an multi-octet charset.
Care should be taken to choose an appropriate Content-Transfer-Enc=
oding.
While this is a generally true statement, it doesn't follow from th=
e charset
being multi-octet. There are multi-octet charsets like iso-2022-jp =
that
are 7bit and hence require no special encoding at all, others like =
utf-8
are 8bit, and still others like utf-16 require binary.
I suggest changing this to "Since CP1932 is an 8bit charset Care sh=
ould be
taken to choose an appropriate Content-Transfer-Encoding."
I changed it.
Post by NARUSE Yui
code set 0: US-ASCII (a single 7-bit byte set)
* 0x5C is U+005C : REVERSE SOLIDUS (YEN SIGN)
This makes no sense to me. A Reverse Solidus isn't a Yen sign.
In JIS X 0201 (Japanese version of ISO 646), 0x5C is YEN SIGN,
and Shift_JIS's 7bit area is JIS X 0201.
But on the context of conversion to Unicode,
on convertion the source code of C Language or BASIC or something
0x5C -> U+00A5 conversion breaks it meaning: escape character.
So Microsoft treat 0x5C of Windows Codepage 932 as REVERSE SOLIDUS,
but its glyph is YEN SIGN (bundled font is hacked that U+005C's glyph
is YEN SIGN, like MS Gothic).

"When is a backslash not a backslash?"
http://blogs.msdn.com/michkap/archive/2005/09/17/469941.aspx

I mentioned it in this, but it should be described in Additional
Information.
Post by NARUSE Yui
* 0x7E is U+007E : TILDE
AFAIK 0x7E is a tilde in US-ASCII, so why are you calling this out?
JIS X 0201's 0x7E is OVERLINE, I made clear it.
Post by NARUSE Yui
code set 1: Microsoft Standard Character Set (a double 8-bit byte =
set)
Which Microsoft standard character set? I assume it's CP932, but th=
is needs to
be stated explicitly.
Post by NARUSE Yui
* JIS X 0208-1983
* NEC special characters (Row 13)
* NEC selection of IBM extensions (Rows 89 to 92)
code set 2: Halfwidth Katakana (a single 7-bit byte set)
JIS X 0201-1976
requiring SS2 as the character prefix
You're approaching this backwards, which is unfortunately a pretty =
common
problem with these ISO 2022-based charsets, including some existing
registratons.
Charsets are mappings from octets to characters and should be speci=
fied
as such. What you're doing here is using the ISO approach, which de=
fines
things in terms of one or more coded character sets and then descri=
bes how
to encode them using a character encoding scheme.
Octets with the high bit clear specify single US-ASCII characters, =
while
octets with the high bit set encode characters from the Microsoft C=
oded
Character Set 932 by combining the bits from the two octets ...
the problem with the ISO 2022 approach is that once you say you're =
using ISO
2022 you then have to profile what parts of ISO 2022 are allowed. I=
SO 2022 in
full generality is extremeply complex and essentially nobody suppor=
ts all of
it. (And EUC charsets are some of the most limited profiles of all =
- they
assume fixed bindings of coded character sets to C0-1/G0-3 and only=
use the
"shift next character" control sequences.
I suggest you check out the specifiation of iso-2022-jp or any of t=
he other
iso-2022-* variants in order to see how to write this sort of descr=
iption.

I changed as:
Octets with the high bit clear specify single US-ASCII characters=
, while
octets with the high bit set encode characters from the Windows C=
odepage
932 by combining the bits from the two octets except the first oc=
tet is
0x8E which represents Halfwidth Katakana.
Post by NARUSE Yui
This is a request for a new registration of this charset.
CP51932 is real implementation of EUC-JP mostly used by Web Browse=
rs.
I think what you're trying to say is that these browsers interpret =
EUC-JP
(which is already registered and differs in some details) as the ch=
arset you
describe here rather than what's actually registered under the name=
EUC-JP. If
so, you need to make that much clearer.
I changed the description as:
Typical user of CP51932 is web browsers. When web browsers load
a page which are declared or auto-detected as "EUC-JP", they don'=
t
interpret it as true EUC-JP registerd in IANA Character Sets but =
as
CP51932. When they post form data as "EUC-JP", the data is also
encoded as CP51932.
I also think some mention needs to be made of JIS X 0212 here and t=
he apparent
lack of a binding of it to the G3 range (which is present in EUC-JP=
). While I
have no problem with dropping JIS X 0212 support - support for whic=
h is
sporadic at best - the rationale for not having it needs to be call=
ed out.

I add description as:
CP51932 is a variant of EUC-JP (like Windows-31J and Shift_JIS).
this charset is different from EUC-JP in:
* CP51932 doesn't support JIS X 0212
* CP51932 supports characters extended by Windows Codepage 932
* Unicode mapping of some characters are different
Post by NARUSE Yui
Internet Explorer gives a reference implementation.
Firefox, Safari, Opera, and Google Chrome support also this.
They refers this charset by the name "EUC-JP".
http://coq.no/character-tables/mime/euc/en
I'm not sure references to incorrect definitions of other charsets =
are
appropriate or useful.
I removed it.


Now it comes as following:
----------
Charset name: CP51932

Charset aliases: (none)

Suitability for use in MIME text:

Yes, CP51932 is suitable for use with subtypes of the "text"
Content-Type. Since CP1932 is an 8bit charset Care should be
taken to choose an appropriate Content-Transfer-Encoding.

Published specification(s):

Octets with the high bit clear specify single US-ASCII characters=
, while
octets with the high bit set encode characters from the Windows C=
odepage
932 by combining the bits from the two octets except the first oc=
tet is
0x8E which represents Halfwidth Katakana.

Meaning and mapping to Unicode of each character is refer to
Windows Codepage 932.
http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx

ISO 10646 equivalency table:

http://cpansearch.perl.org/src/NARUSE/Encode-EUCJPMS-0.07/ucm/cp5=
1932.ucm

Additional information:

This is a request for a new registration of this charset.

CP51932 is a variant of EUC-JP (like Windows-31J and Shift_JIS).
this charset is different from EUC-JP in:
* CP51932 doesn't support JIS X 0212
* CP51932 supports characters extended by Windows Codepage 932
* Unicode mapping of some characters are different

Typical user of CP51932 is web browsers. When web browsers load
a page which are declared or auto-detected as "EUC-JP", they don'=
t
interpret it as true EUC-JP registerd in IANA Character Sets but =
as
CP51932. When they post form data as "EUC-JP", the data is also
encoded as CP51932.

The name "CP51932" is in use following applications:
* Citrus iconv (NetBSD and DragonFly uses this)
* patched GNU libiconv in FreeBSD ports
* Mojikan http://www.mirai-ii.co.jp/moji/mojikan/
* nkf 2.0.5
* PHP 5.2.1
* Ruby 1.9.1
* Encode-EUCJPMS-0.06

Moreover applications which uses MLang.DLL or .NET Framework for
converting "EUC-JP" implicitly uses this charset.

So this charset is widely used, but doesn't have its own name.
Intended use of this name is to override the implementation of EU=
C-JP
or charset convertion.
http://wiki.whatwg.org/wiki/Web_Encodings
http://www.w3.org/Bugs/Public/show_bug.cgi?id=3D7444

Why the name is not "Windows-51932" is some of applications which=
accept
the name "CP51932" don't support the name "Windows-51932".

CP51932 is for use of importing legacy data.
UTF-8 is preferred to CP51932 for new system.

Related references are:

"Remarks" of "GetEncodings Method" of "System.Text"
http://msdn.microsoft.com/en-us/library/system.text.encoding.ge=
tencodings.aspx

"Unicode=E3=81=AB=E3=82=88=E3=82=8BJIS X0213=E5=AE=9F=E8=A3=
=85=E5=85=A5=E9=96=80=E2=80=95=E6=83=85=E5=A0=B1=E3=82=B7=E3=82=B9=
=E3=83=86=E3=83=A0=E3=81=AE=E6=96=B0=E3=81=9F=E3=81=AA=E6=97=A5=E6=
=9C=AC=E8=AA=9E=E5=87=A6=E7=90=86=E7=92=B0=E5=A2=83"
=E6=97=A5=E7=B5=8CBP=E3=82=BD=E3=83=95=E3=83=88=E3=83=97=E3=
=83=AC=E3=82=B9, ISBN 978-4891006082, 2008, p. 17-18, 20, 120-158

CP51932 - Legacy Encoding Project
http://legacy-encoding.sourceforge.jp/wiki/index.php?cp51932

This charset is also known as Windows Codepage 51932.

Person & email address to contact for further information:

NARUSE, Yui
Email: ***@airemix.jp

Intended usage: LIMITED USE

--=20
NARUSE, Yui <***@airemix.jp>
Ira McDonald
2010-04-05 16:49:13 UTC
Permalink
Hi,

Still missing in this registration is one charset alias of the form
"csXxxx", where "Xxxx" is usually the primary name of the charset,
e.g., "csCP51932" in this case.

Section 2.3 on page 4 of IANA Charset Registration Procedures
(RFC 2978 / BCP 19) says:

"All charsets MUST be assigned a name that provides a display string
for the associated "MIBenum" value defined below. These "MIBenum"
values are defined by and used in the Printer MIB [RFC-1759]. Such
names MUST begin with the letters "cs" and MUST contain no more than
40 characters (including the "cs" prefix) chosen from from the
printable subset of US-ASCII. Only one name beginning with "cs" may
be assigned to a single charset. If no name of this form is
explicitly defined IANA will assign an alias consisting of "cs"
prepended to the primary charset name."

In Printer MIB v2 (RFC 3805), these "csXxxx" aliases were moved out
of the Printer MIB and into the IANA Charset MIB (RFC 3808).

Cheers,
- Ira (editor of IANA Charset MIB, RFC 3808).

Ira McDonald (Musician / Software Architect)
Chair - Linux Foundation Open Printing WG
Co-Chair - TCG Hardcopy WG
IETF Designated Expert - IPP & Printer MIB
Blue Roof Music/High North Inc
http://sites.google.com/site/blueroofmusic
http://sites.google.com/site/highnorthinc
mailto:***@gmail.com
winter:
579 Park Place Saline, MI 48176
734-944-0094
summer:
PO Box 221 Grand Marais, MI 49839
906-494-2434
Post by NARUSE Yui
Thank you for comment,
(In addition to the specific comments I've made below, I think a general
dicussion of how to handle this whole CP932 mess is probably in order.)
There are two strategies: override or option.
HTML5 uses overrides. They must treat exist documents which are mistakenly
labeled as Shift_JIS or EUC-JP. In those situation people should override.
Otherwise converters, databases, Programing languages added new charset as
an option. Their user can one of them.
IANA Charsets are back-end of libraries, so they should choose option
solution.
Post by NARUSE Yui
Yes, CP51932 is suitable for use with subtypes of the "text"
Content-Type. Note that CP51932 is an multi-octet charset.
Care should be taken to choose an appropriate Content-Transfer-Encoding.
While this is a generally true statement, it doesn't follow from the
charset
being multi-octet. There are multi-octet charsets like iso-2022-jp that
are 7bit and hence require no special encoding at all, others like utf-8
are 8bit, and still others like utf-16 require binary.
I suggest changing this to "Since CP1932 is an 8bit charset Care should be
taken to choose an appropriate Content-Transfer-Encoding."
I changed it.
Post by NARUSE Yui
code set 0: US-ASCII (a single 7-bit byte set)
* 0x5C is U+005C : REVERSE SOLIDUS (YEN SIGN)
This makes no sense to me. A Reverse Solidus isn't a Yen sign.
In JIS X 0201 (Japanese version of ISO 646), 0x5C is YEN SIGN,
and Shift_JIS's 7bit area is JIS X 0201.
But on the context of conversion to Unicode,
on convertion the source code of C Language or BASIC or something
0x5C -> U+00A5 conversion breaks it meaning: escape character.
So Microsoft treat 0x5C of Windows Codepage 932 as REVERSE SOLIDUS,
but its glyph is YEN SIGN (bundled font is hacked that U+005C's glyph
is YEN SIGN, like MS Gothic).
"When is a backslash not a backslash?"
http://blogs.msdn.com/michkap/archive/2005/09/17/469941.aspx
I mentioned it in this, but it should be described in Additional
Information.
Post by NARUSE Yui
* 0x7E is U+007E : TILDE
AFAIK 0x7E is a tilde in US-ASCII, so why are you calling this out?
JIS X 0201's 0x7E is OVERLINE, I made clear it.
Post by NARUSE Yui
code set 1: Microsoft Standard Character Set (a double 8-bit byte set)
Which Microsoft standard character set? I assume it's CP932, but this
needs to
be stated explicitly.
Post by NARUSE Yui
* JIS X 0208-1983
* NEC special characters (Row 13)
* NEC selection of IBM extensions (Rows 89 to 92)
code set 2: Halfwidth Katakana (a single 7-bit byte set)
JIS X 0201-1976
requiring SS2 as the character prefix
You're approaching this backwards, which is unfortunately a pretty common
problem with these ISO 2022-based charsets, including some existing
registratons.
Charsets are mappings from octets to characters and should be specified
as such. What you're doing here is using the ISO approach, which defines
things in terms of one or more coded character sets and then describes how
to encode them using a character encoding scheme.
Octets with the high bit clear specify single US-ASCII characters, while
octets with the high bit set encode characters from the Microsoft Coded
Character Set 932 by combining the bits from the two octets ...
the problem with the ISO 2022 approach is that once you say you're using
ISO
2022 you then have to profile what parts of ISO 2022 are allowed. ISO 2022
in
full generality is extremeply complex and essentially nobody supports all
of
it. (And EUC charsets are some of the most limited profiles of all - they
assume fixed bindings of coded character sets to C0-1/G0-3 and only use
the
"shift next character" control sequences.
I suggest you check out the specifiation of iso-2022-jp or any of the
other
iso-2022-* variants in order to see how to write this sort of description.
Octets with the high bit clear specify single US-ASCII characters, while
octets with the high bit set encode characters from the Windows Codepage
932 by combining the bits from the two octets except the first octet is
0x8E which represents Halfwidth Katakana.
Post by NARUSE Yui
This is a request for a new registration of this charset.
CP51932 is real implementation of EUC-JP mostly used by Web Browsers.
I think what you're trying to say is that these browsers interpret EUC-JP
(which is already registered and differs in some details) as the charset
you
describe here rather than what's actually registered under the name
EUC-JP. If
so, you need to make that much clearer.
Typical user of CP51932 is web browsers. When web browsers load
a page which are declared or auto-detected as "EUC-JP", they don't
interpret it as true EUC-JP registerd in IANA Character Sets but as
CP51932. When they post form data as "EUC-JP", the data is also
encoded as CP51932.
I also think some mention needs to be made of JIS X 0212 here and the
apparent
lack of a binding of it to the G3 range (which is present in EUC-JP).
While I
have no problem with dropping JIS X 0212 support - support for which is
sporadic at best - the rationale for not having it needs to be called out.
CP51932 is a variant of EUC-JP (like Windows-31J and Shift_JIS).
* CP51932 doesn't support JIS X 0212
* CP51932 supports characters extended by Windows Codepage 932
* Unicode mapping of some characters are different
Post by NARUSE Yui
Internet Explorer gives a reference implementation.
Firefox, Safari, Opera, and Google Chrome support also this.
They refers this charset by the name "EUC-JP".
http://coq.no/character-tables/mime/euc/en
I'm not sure references to incorrect definitions of other charsets are
appropriate or useful.
I removed it.
----------
Charset name: CP51932
Charset aliases: (none)
Yes, CP51932 is suitable for use with subtypes of the "text"
Content-Type. Since CP1932 is an 8bit charset Care should be
taken to choose an appropriate Content-Transfer-Encoding.
Octets with the high bit clear specify single US-ASCII characters, while
octets with the high bit set encode characters from the Windows Codepage
932 by combining the bits from the two octets except the first octet is
0x8E which represents Halfwidth Katakana.
Meaning and mapping to Unicode of each character is refer to
Windows Codepage 932.
http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx
http://cpansearch.perl.org/src/NARUSE/Encode-EUCJPMS-0.07/ucm/cp51932.ucm
This is a request for a new registration of this charset.
CP51932 is a variant of EUC-JP (like Windows-31J and Shift_JIS).
* CP51932 doesn't support JIS X 0212
* CP51932 supports characters extended by Windows Codepage 932
* Unicode mapping of some characters are different
Typical user of CP51932 is web browsers. When web browsers load
a page which are declared or auto-detected as "EUC-JP", they don't
interpret it as true EUC-JP registerd in IANA Character Sets but as
CP51932. When they post form data as "EUC-JP", the data is also
encoded as CP51932.
* Citrus iconv (NetBSD and DragonFly uses this)
* patched GNU libiconv in FreeBSD ports
* Mojikan http://www.mirai-ii.co.jp/moji/mojikan/
* nkf 2.0.5
* PHP 5.2.1
* Ruby 1.9.1
* Encode-EUCJPMS-0.06
Moreover applications which uses MLang.DLL or .NET Framework for
converting "EUC-JP" implicitly uses this charset.
So this charset is widely used, but doesn't have its own name.
Intended use of this name is to override the implementation of EUC-JP
or charset convertion.
http://wiki.whatwg.org/wiki/Web_Encodings
http://www.w3.org/Bugs/Public/show_bug.cgi?id=7444
Why the name is not "Windows-51932" is some of applications which accept
the name "CP51932" don't support the name "Windows-51932".
CP51932 is for use of importing legacy data.
UTF-8 is preferred to CP51932 for new system.
"Remarks" of "GetEncodings Method" of "System.Text"
http://msdn.microsoft.com/en-us/library/system.text.encoding.getencodings.aspx
"Unicode$B$K$h$k(BJIS X0213$B<BAuF~Lg!=>pJs%7%9%F%`$N?7$?$JF|K\8l=hM}4D6-(B"
$BF|7P(BBP$B%=%U%H%W%l%9(B, ISBN 978-4891006082, 2008, p. 17-18, 20, 120-158
CP51932 - Legacy Encoding Project
http://legacy-encoding.sourceforge.jp/wiki/index.php?cp51932
This charset is also known as Windows Codepage 51932.
NARUSE, Yui
Intended usage: LIMITED USE
--
NARUSE Yui
2010-04-06 20:52:56 UTC
Permalink
Hi,
Post by Ira McDonald
Still missing in this registration is one charset alias of the form
"csXxxx", where "Xxxx" is usually the primary name of the charset,
e.g., "csCP51932" in this case.
I see, I added it.
Post by Ira McDonald
Section 2.3 on page 4 of IANA Charset Registration Procedures
"All charsets MUST be assigned a name that provides a display s=
tring
Post by Ira McDonald
for the associated "MIBenum" value defined below. These "MIBen=
um"
Post by Ira McDonald
values are defined by and used in the Printer MIB [RFC-1759]. =
Such
Post by Ira McDonald
names MUST begin with the letters "cs" and MUST contain no more=
than
Post by Ira McDonald
40 characters (including the "cs" prefix) chosen from from the
printable subset of US-ASCII. Only one name beginning with "cs=
" may
Post by Ira McDonald
be assigned to a single charset. If no name of this form is
explicitly defined IANA will assign an alias consisting of "cs"
prepended to the primary charset name."
In Printer MIB v2 (RFC 3805), these "csXxxx" aliases were moved out
of the Printer MIB and into the IANA Charset MIB (RFC 3808).
The proposal is now following.

Thanks,

----------
Charset name: CP51932

Charset aliases: csCP51932

Suitability for use in MIME text:

Yes, CP51932 is suitable for use with subtypes of the "text"
Content-Type. Since CP1932 is an 8bit charset Care should be
taken to choose an appropriate Content-Transfer-Encoding.

Published specification(s):

Octets with the high bit clear specify single US-ASCII characters=
, while
octets with the high bit set encode characters from the Windows C=
odepage
932 by combining the bits from the two octets except the first oc=
tet is
0x8E which represents Halfwidth Katakana.

Meaning and mapping to Unicode of each character is refer to
Windows Codepage 932.
http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx

ISO 10646 equivalency table:

http://cpansearch.perl.org/src/NARUSE/Encode-EUCJPMS-0.07/ucm/cp5=
1932.ucm

Additional information:

This is a request for a new registration of this charset.

CP51932 is a variant of EUC-JP (like Windows-31J and Shift_JIS).
this charset is different from EUC-JP in:
* CP51932 doesn't support JIS X 0212
* CP51932 supports characters extended by Windows Codepage 932
* Unicode mapping of some characters are different

Typical user of CP51932 is web browsers. When web browsers load
a page which are declared or auto-detected as "EUC-JP", they don'=
t
interpret it as true EUC-JP registerd in IANA Character Sets but =
as
CP51932. When they post form data as "EUC-JP", the data is also
encoded as CP51932.

The name "CP51932" is in use following applications:
* Citrus iconv (NetBSD and DragonFly uses this)
* patched GNU libiconv in FreeBSD ports
* Mojikan http://www.mirai-ii.co.jp/moji/mojikan/
* nkf 2.0.5
* PHP 5.2.1
* Ruby 1.9.1
* Encode-EUCJPMS-0.06

Moreover applications which uses MLang.DLL or .NET Framework for
converting "EUC-JP" implicitly uses this charset.

So this charset is widely used, but doesn't have its own name.
Intended use of this name is to override the implementation of EU=
C-JP
or charset convertion.
http://wiki.whatwg.org/wiki/Web_Encodings
http://www.w3.org/Bugs/Public/show_bug.cgi?id=3D7444

Why the name is not "Windows-51932" is some of applications which=
accept
the name "CP51932" don't support the name "Windows-51932".

CP51932 is for use of importing legacy data.
UTF-8 is preferred to CP51932 for new system.

Related references are:

"Remarks" of "GetEncodings Method" of "System.Text"
http://msdn.microsoft.com/en-us/library/system.text.encoding.ge=
tencodings.aspx

"Unicode=E3=81=AB=E3=82=88=E3=82=8BJIS X0213=E5=AE=9F=E8=A3=
=85=E5=85=A5=E9=96=80=E2=80=95=E6=83=85=E5=A0=B1=E3=82=B7=E3=82=B9=
=E3=83=86=E3=83=A0=E3=81=AE=E6=96=B0=E3=81=9F=E3=81=AA=E6=97=A5=E6=
=9C=AC=E8=AA=9E=E5=87=A6=E7=90=86=E7=92=B0=E5=A2=83"
=E6=97=A5=E7=B5=8CBP=E3=82=BD=E3=83=95=E3=83=88=E3=83=97=E3=
=83=AC=E3=82=B9, ISBN 978-4891006082, 2008, p. 17-18, 20, 120-158

CP51932 - Legacy Encoding Project
http://legacy-encoding.sourceforge.jp/wiki/index.php?cp51932

This charset is also known as Windows Codepage 51932.

Person & email address to contact for further information:

NARUSE, Yui
Email: ***@airemix.jp

Intended usage: LIMITED USE

--=20
NARUSE, Yui <***@airemix.jp>
NARUSE Yui
2010-05-16 11:10:19 UTC
Permalink
It is one month from last main of this thread.
I think it is enough time for this topic.

Or someone has any suggestion?
Post by NARUSE Yui
Hi,
Still missing in this registration is one charset alias of the for=
m
Post by NARUSE Yui
"csXxxx", where "Xxxx" is usually the primary name of the charset,
e.g., "csCP51932" in this case.
I see, I added it.
Section 2.3 on page 4 of IANA Charset Registration Procedures
"All charsets MUST be assigned a name that provides a display stri=
ng
Post by NARUSE Yui
for the associated "MIBenum" value defined below. These "MIBenum"
values are defined by and used in the Printer MIB [RFC-1759]. Such
names MUST begin with the letters "cs" and MUST contain no more th=
an
Post by NARUSE Yui
40 characters (including the "cs" prefix) chosen from from the
printable subset of US-ASCII. Only one name beginning with "cs" ma=
y
Post by NARUSE Yui
be assigned to a single charset. If no name of this form is
explicitly defined IANA will assign an alias consisting of "cs"
prepended to the primary charset name."
In Printer MIB v2 (RFC 3805), these "csXxxx" aliases were moved ou=
t
Post by NARUSE Yui
of the Printer MIB and into the IANA Charset MIB (RFC 3808).
The proposal is now following.
Thanks,
----------
Charset name: CP51932
Charset aliases: csCP51932
Yes, CP51932 is suitable for use with subtypes of the "text"
Content-Type. Since CP1932 is an 8bit charset Care should be
taken to choose an appropriate Content-Transfer-Encoding.
Octets with the high bit clear specify single US-ASCII characters, =
while
Post by NARUSE Yui
octets with the high bit set encode characters from the Windows Cod=
epage
Post by NARUSE Yui
932 by combining the bits from the two octets except the first octe=
t is
Post by NARUSE Yui
0x8E which represents Halfwidth Katakana.
Meaning and mapping to Unicode of each character is refer to
Windows Codepage 932.
http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx
http://cpansearch.perl.org/src/NARUSE/Encode-EUCJPMS-0.07/ucm/cp519=
32.ucm
Post by NARUSE Yui
This is a request for a new registration of this charset.
CP51932 is a variant of EUC-JP (like Windows-31J and Shift_JIS).
* CP51932 doesn't support JIS X 0212
* CP51932 supports characters extended by Windows Codepage 932
* Unicode mapping of some characters are different
Typical user of CP51932 is web browsers. When web browsers load
a page which are declared or auto-detected as "EUC-JP", they don't
interpret it as true EUC-JP registerd in IANA Character Sets but as
CP51932. When they post form data as "EUC-JP", the data is also
encoded as CP51932.
* Citrus iconv (NetBSD and DragonFly uses this)
* patched GNU libiconv in FreeBSD ports
* Mojikan http://www.mirai-ii.co.jp/moji/mojikan/
* nkf 2.0.5
* PHP 5.2.1
* Ruby 1.9.1
* Encode-EUCJPMS-0.06
Moreover applications which uses MLang.DLL or .NET Framework for
converting "EUC-JP" implicitly uses this charset.
So this charset is widely used, but doesn't have its own name.
Intended use of this name is to override the implementation of EUC-=
JP
Post by NARUSE Yui
or charset convertion.
http://wiki.whatwg.org/wiki/Web_Encodings
http://www.w3.org/Bugs/Public/show_bug.cgi?id=3D7444
Why the name is not "Windows-51932" is some of applications which a=
ccept
Post by NARUSE Yui
the name "CP51932" don't support the name "Windows-51932".
CP51932 is for use of importing legacy data.
UTF-8 is preferred to CP51932 for new system.
"Remarks" of "GetEncodings Method" of "System.Text"
http://msdn.microsoft.com/en-us/library/system.text.encoding.getenc=
odings.aspx
Post by NARUSE Yui
"Unicode=E3=81=AB=E3=82=88=E3=82=8BJIS X0213=E5=AE=9F=E8=A3=85=E5=
=85=A5=E9=96=80=E2=80=95=E6=83=85=E5=A0=B1=E3=82=B7=E3=82=B9=E3=83=
=86=E3=83=A0=E3=81=AE=E6=96=B0=E3=81=9F=E3=81=AA=E6=97=A5=E6=9C=AC=
=E8=AA=9E=E5=87=A6=E7=90=86=E7=92=B0=E5=A2=83"
Post by NARUSE Yui
=E6=97=A5=E7=B5=8CBP=E3=82=BD=E3=83=95=E3=83=88=E3=83=97=E3=83=AC=
=E3=82=B9, ISBN 978-4891006082, 2008, p. 17-18, 20, 120-158
Post by NARUSE Yui
CP51932 - Legacy Encoding Project
http://legacy-encoding.sourceforge.jp/wiki/index.php?cp51932
This charset is also known as Windows Codepage 51932.
NARUSE, Yui
Intended usage: LIMITED USE
--=20
NARUSE, Yui <***@airemix.jp>
NARUSE, Yui
2010-08-27 06:11:08 UTC
Permalink
Hi,

I think, this topic has enough discussed.
So I want to go next stage.

Thanks,
Post by NARUSE Yui
It is one month from last main of this thread.
I think it is enough time for this topic.
Or someone has any suggestion?
Post by NARUSE Yui
Hi,
Still missing in this registration is one charset alias of the fo=
rm
Post by NARUSE Yui
Post by NARUSE Yui
"csXxxx", where "Xxxx" is usually the primary name of the charset=
,
Post by NARUSE Yui
Post by NARUSE Yui
e.g., "csCP51932" in this case.
I see, I added it.
Section 2.3 on page 4 of IANA Charset Registration Procedures
"All charsets MUST be assigned a name that provides a display str=
ing
Post by NARUSE Yui
Post by NARUSE Yui
for the associated "MIBenum" value defined below. These "MIBenum"
values are defined by and used in the Printer MIB [RFC-1759]. Suc=
h
Post by NARUSE Yui
Post by NARUSE Yui
names MUST begin with the letters "cs" and MUST contain no more t=
han
Post by NARUSE Yui
Post by NARUSE Yui
40 characters (including the "cs" prefix) chosen from from the
printable subset of US-ASCII. Only one name beginning with "cs" m=
ay
Post by NARUSE Yui
Post by NARUSE Yui
be assigned to a single charset. If no name of this form is
explicitly defined IANA will assign an alias consisting of "cs"
prepended to the primary charset name."
In Printer MIB v2 (RFC 3805), these "csXxxx" aliases were moved o=
ut
Post by NARUSE Yui
Post by NARUSE Yui
of the Printer MIB and into the IANA Charset MIB (RFC 3808).
The proposal is now following.
Thanks,
----------
Charset name: CP51932
Charset aliases: csCP51932
Yes, CP51932 is suitable for use with subtypes of the "text"
Content-Type. Since CP1932 is an 8bit charset Care should be
taken to choose an appropriate Content-Transfer-Encoding.
Octets with the high bit clear specify single US-ASCII characters,=
while
Post by NARUSE Yui
Post by NARUSE Yui
octets with the high bit set encode characters from the Windows Co=
depage
Post by NARUSE Yui
Post by NARUSE Yui
932 by combining the bits from the two octets except the first oct=
et is
Post by NARUSE Yui
Post by NARUSE Yui
0x8E which represents Halfwidth Katakana.
Meaning and mapping to Unicode of each character is refer to
Windows Codepage 932.
http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx
http://cpansearch.perl.org/src/NARUSE/Encode-EUCJPMS-0.07/ucm/cp51=
932.ucm
Post by NARUSE Yui
Post by NARUSE Yui
This is a request for a new registration of this charset.
CP51932 is a variant of EUC-JP (like Windows-31J and Shift_JIS).
* CP51932 doesn't support JIS X 0212
* CP51932 supports characters extended by Windows Codepage 932
* Unicode mapping of some characters are different
Typical user of CP51932 is web browsers. When web browsers load
a page which are declared or auto-detected as "EUC-JP", they don't
interpret it as true EUC-JP registerd in IANA Character Sets but a=
s
Post by NARUSE Yui
Post by NARUSE Yui
CP51932. When they post form data as "EUC-JP", the data is also
encoded as CP51932.
* Citrus iconv (NetBSD and DragonFly uses this)
* patched GNU libiconv in FreeBSD ports
* Mojikan http://www.mirai-ii.co.jp/moji/mojikan/
* nkf 2.0.5
* PHP 5.2.1
* Ruby 1.9.1
* Encode-EUCJPMS-0.06
Moreover applications which uses MLang.DLL or .NET Framework for
converting "EUC-JP" implicitly uses this charset.
So this charset is widely used, but doesn't have its own name.
Intended use of this name is to override the implementation of EUC=
-JP
Post by NARUSE Yui
Post by NARUSE Yui
or charset convertion.
http://wiki.whatwg.org/wiki/Web_Encodings
http://www.w3.org/Bugs/Public/show_bug.cgi?id=3D7444
Why the name is not "Windows-51932" is some of applications which =
accept
Post by NARUSE Yui
Post by NARUSE Yui
the name "CP51932" don't support the name "Windows-51932".
CP51932 is for use of importing legacy data.
UTF-8 is preferred to CP51932 for new system.
"Remarks" of "GetEncodings Method" of "System.Text"
http://msdn.microsoft.com/en-us/library/system.text.encoding.geten=
codings.aspx
Post by NARUSE Yui
Post by NARUSE Yui
"Unicode=E3=81=AB=E3=82=88=E3=82=8BJIS X0213=E5=AE=9F=E8=A3=85=
=E5=85=A5=E9=96=80=E2=80=95=E6=83=85=E5=A0=B1=E3=82=B7=E3=82=B9=E3=
=83=86=E3=83=A0=E3=81=AE=E6=96=B0=E3=81=9F=E3=81=AA=E6=97=A5=E6=9C=
=AC=E8=AA=9E=E5=87=A6=E7=90=86=E7=92=B0=E5=A2=83"
Post by NARUSE Yui
Post by NARUSE Yui
=E6=97=A5=E7=B5=8CBP=E3=82=BD=E3=83=95=E3=83=88=E3=83=97=E3=83=
=AC=E3=82=B9, ISBN 978-4891006082, 2008, p. 17-18, 20, 120-158
Post by NARUSE Yui
Post by NARUSE Yui
CP51932 - Legacy Encoding Project
http://legacy-encoding.sourceforge.jp/wiki/index.php?cp51932
This charset is also known as Windows Codepage 51932.
NARUSE, Yui
Intended usage: LIMITED USE
--=20
NARUSE, Yui <***@airemix.jp>
Shawn Steele
2010-08-27 17:56:36 UTC
Permalink
Can we say "you should really use UTF-8" as a note in new registrations?

-----Original Message-----
From: NARUSE, Yui [mailto:***@airemix.jp]
Sent: Thursday, August 26, 2010 11:11 PM
To: Ira McDonald
Cc: Ned Freed; ietf-charsets; "Martin J. Dürst"
Subject: Re: Registration of new charset CP51932

Hi,

I think, this topic has enough discussed.
So I want to go next stage.

Thanks,
Post by NARUSE Yui
It is one month from last main of this thread.
I think it is enough time for this topic.
Or someone has any suggestion?
Post by NARUSE Yui
Hi,
Post by Ira McDonald
Still missing in this registration is one charset alias of the form
"csXxxx", where "Xxxx" is usually the primary name of the charset,
e.g., "csCP51932" in this case.
I see, I added it.
Post by Ira McDonald
Section 2.3 on page 4 of IANA Charset Registration Procedures (RFC
"All charsets MUST be assigned a name that provides a display string
for the associated "MIBenum" value defined below. These "MIBenum"
values are defined by and used in the Printer MIB [RFC-1759]. Such
names MUST begin with the letters "cs" and MUST contain no more than
40 characters (including the "cs" prefix) chosen from from the
printable subset of US-ASCII. Only one name beginning with "cs" may
be assigned to a single charset. If no name of this form is
explicitly defined IANA will assign an alias consisting of "cs"
prepended to the primary charset name."
In Printer MIB v2 (RFC 3805), these "csXxxx" aliases were moved out
of the Printer MIB and into the IANA Charset MIB (RFC 3808).
The proposal is now following.
Thanks,
----------
Charset name: CP51932
Charset aliases: csCP51932
Yes, CP51932 is suitable for use with subtypes of the "text"
Content-Type. Since CP1932 is an 8bit charset Care should be taken to
choose an appropriate Content-Transfer-Encoding.
Octets with the high bit clear specify single US-ASCII characters,
while octets with the high bit set encode characters from the Windows
Codepage
932 by combining the bits from the two octets except the first octet
is 0x8E which represents Halfwidth Katakana.
Meaning and mapping to Unicode of each character is refer to Windows
Codepage 932.
http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx
http://cpansearch.perl.org/src/NARUSE/Encode-EUCJPMS-0.07/ucm/cp51932
.ucm
This is a request for a new registration of this charset.
CP51932 is a variant of EUC-JP (like Windows-31J and Shift_JIS).
* CP51932 doesn't support JIS X 0212
* CP51932 supports characters extended by Windows Codepage 932
* Unicode mapping of some characters are different
Typical user of CP51932 is web browsers. When web browsers load a
page which are declared or auto-detected as "EUC-JP", they don't
interpret it as true EUC-JP registerd in IANA Character Sets but as
CP51932. When they post form data as "EUC-JP", the data is also
encoded as CP51932.
* Citrus iconv (NetBSD and DragonFly uses this)
* patched GNU libiconv in FreeBSD ports
* Mojikan http://www.mirai-ii.co.jp/moji/mojikan/
* nkf 2.0.5
* PHP 5.2.1
* Ruby 1.9.1
* Encode-EUCJPMS-0.06
Moreover applications which uses MLang.DLL or .NET Framework for
converting "EUC-JP" implicitly uses this charset.
So this charset is widely used, but doesn't have its own name.
Intended use of this name is to override the implementation of EUC-JP
or charset convertion.
http://wiki.whatwg.org/wiki/Web_Encodings
http://www.w3.org/Bugs/Public/show_bug.cgi?id=7444
Why the name is not "Windows-51932" is some of applications which
accept the name "CP51932" don't support the name "Windows-51932".
CP51932 is for use of importing legacy data.
UTF-8 is preferred to CP51932 for new system.
"Remarks" of "GetEncodings Method" of "System.Text"
http://msdn.microsoft.com/en-us/library/system.text.encoding.getencod
ings.aspx
"UnicodeによるJIS X0213実装入門―情報システムの新たな日本語処理環境"
日経BPソフトプレス, ISBN 978-4891006082, 2008, p. 17-18, 20, 120-158
CP51932 - Legacy Encoding Project
http://legacy-encoding.sourceforge.jp/wiki/index.php?cp51932
This charset is also known as Windows Codepage 51932.
NARUSE, Yui
Intended usage: LIMITED USE
--
NARUSE, Yu
Bjoern Hoehrmann
2010-08-27 22:31:25 UTC
Permalink
Can we say "you should really use UTF-8" as a note in new registrati=
ons?

If you mean this should be in the entry in the IANA charset registry
then the answer would probably be "maybe", but there would be little
point; of the few people who read through the registry, pretty much
no-one would take such a note as (the) reason to research the matter
in detail and then conclude the note is correct and abandon plans to
register a "new" encoding. A document that summarizes the benefits of
UTF-8 would be (slightly?) more effective.
--=20
Bj=F6rn H=F6hrmann =B7 mailto:***@hoehrmann.de =B7 http://bjoern.h=
oehrmann.de
Am Badedeich 7 =B7 Telefon: +49(0)160/4415681 =B7 http://www.bjoernsw=
orld.de
25899 Dageb=FCll =B7 PGP Pub. KeyID: 0xA4357E78 =B7 http://www.websit=
edev.de/=20
NARUSE, Yui
2010-08-28 01:01:50 UTC
Permalink
Post by Shawn Steele
Can we say "you should really use UTF-8" as a note in new registrations?
This comment means following description should change to it?

CP51932 is for use of importing legacy data.
UTF-8 is preferred to CP51932 for new system.
--
NARUSE, Yui <***@airemix.jp>
Shawn Steele
2010-08-29 04:13:02 UTC
Permalink
Thanks for pointing that out, I must've been sleeping, sorry :)

-Shawn

 
http://blogs.msdn.com/shawnste


________________________________________
From: Shawn Steele
Sent: Friday, August 27, 2010 10:56 AM
To: NARUSE, Yui; Ira McDonald
Cc: Ned Freed; ietf-charsets; "Martin J. Dürst"
Subject: RE: Registration of new charset CP51932

Can we say "you should really use UTF-8" as a note in new registrations?

-----Original Message-----
From: NARUSE, Yui [mailto:***@airemix.jp]
Sent: Thursday, August 26, 2010 11:11 PM
To: Ira McDonald
Cc: Ned Freed; ietf-charsets; "Martin J. Dürst"
Subject: Re: Registration of new charset CP51932

Hi,

I think, this topic has enough discussed.
So I want to go next stage.

Thanks,
Post by NARUSE Yui
It is one month from last main of this thread.
I think it is enough time for this topic.
Or someone has any suggestion?
Post by NARUSE Yui
Hi,
Post by Ira McDonald
Still missing in this registration is one charset alias of the form
"csXxxx", where "Xxxx" is usually the primary name of the charset,
e.g., "csCP51932" in this case.
I see, I added it.
Post by Ira McDonald
Section 2.3 on page 4 of IANA Charset Registration Procedures (RFC
"All charsets MUST be assigned a name that provides a display string
for the associated "MIBenum" value defined below. These "MIBenum"
values are defined by and used in the Printer MIB [RFC-1759]. Such
names MUST begin with the letters "cs" and MUST contain no more than
40 characters (including the "cs" prefix) chosen from from the
printable subset of US-ASCII. Only one name beginning with "cs" may
be assigned to a single charset. If no name of this form is
explicitly defined IANA will assign an alias consisting of "cs"
prepended to the primary charset name."
In Printer MIB v2 (RFC 3805), these "csXxxx" aliases were moved out
of the Printer MIB and into the IANA Charset MIB (RFC 3808).
The proposal is now following.
Thanks,
----------
Charset name: CP51932
Charset aliases: csCP51932
Yes, CP51932 is suitable for use with subtypes of the "text"
Content-Type. Since CP1932 is an 8bit charset Care should be taken to
choose an appropriate Content-Transfer-Encoding.
Octets with the high bit clear specify single US-ASCII characters,
while octets with the high bit set encode characters from the Windows
Codepage
932 by combining the bits from the two octets except the first octet
is 0x8E which represents Halfwidth Katakana.
Meaning and mapping to Unicode of each character is refer to Windows
Codepage 932.
http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx
http://cpansearch.perl.org/src/NARUSE/Encode-EUCJPMS-0.07/ucm/cp51932
.ucm
This is a request for a new registration of this charset.
CP51932 is a variant of EUC-JP (like Windows-31J and Shift_JIS).
* CP51932 doesn't support JIS X 0212
* CP51932 supports characters extended by Windows Codepage 932
* Unicode mapping of some characters are different
Typical user of CP51932 is web browsers. When web browsers load a
page which are declared or auto-detected as "EUC-JP", they don't
interpret it as true EUC-JP registerd in IANA Character Sets but as
CP51932. When they post form data as "EUC-JP", the data is also
encoded as CP51932.
* Citrus iconv (NetBSD and DragonFly uses this)
* patched GNU libiconv in FreeBSD ports
* Mojikan http://www.mirai-ii.co.jp/moji/mojikan/
* nkf 2.0.5
* PHP 5.2.1
* Ruby 1.9.1
* Encode-EUCJPMS-0.06
Moreover applications which uses MLang.DLL or .NET Framework for
converting "EUC-JP" implicitly uses this charset.
So this charset is widely used, but doesn't have its own name.
Intended use of this name is to override the implementation of EUC-JP
or charset convertion.
http://wiki.whatwg.org/wiki/Web_Encodings
http://www.w3.org/Bugs/Public/show_bug.cgi?id=7444
Why the name is not "Windows-51932" is some of applications which
accept the name "CP51932" don't support the name "Windows-51932".
CP51932 is for use of importing legacy data.
UTF-8 is preferred to CP51932 for new system.
"Remarks" of "GetEncodings Method" of "System.Text"
http://msdn.microsoft.com/en-us/library/system.text.encoding.getencod
ings.aspx
"UnicodeによるJIS X0213実装入門―情報システムの新たな日本語処理環境"
日経BPソフトプレス, ISBN 978-4891006082, 2008, p. 17-18, 20, 120-158
CP51932 - Legacy Encoding Project
http://legacy-encoding.sourceforge.jp/wiki/index.php?cp51932
This charset is also known as Windows Codepage 51932.
NARUSE, Yui
Intended usage: LIMITED USE
Martin J. Dürst
2010-08-28 02:39:35 UTC
Permalink
Can we say "you should really use UTF-8" as a note in new registrat=
ions?

[Expert reviewer hat off]

First, the registration says (at the very end):

Intended usage: LIMITED USE

which implies "use something else".

Second, adding "you should really use UTF-8" to some new registration=
s=20
might imply that this doesn't apply to older registrations, which in =
my=20
view would give the wrong impression.

Third, the registration ALREADY says:

UTF-8 is preferred to CP51932 for new system.

Regards, Martin.
-----Original Message-----
Sent: Thursday, August 26, 2010 11:11 PM
To: Ira McDonald
Cc: Ned Freed; ietf-charsets; "Martin J. D=C3=BCrst"
Subject: Re: Registration of new charset CP51932
Hi,
I think, this topic has enough discussed.
So I want to go next stage.
Thanks,
Post by NARUSE Yui
The proposal is now following.
Thanks,
----------
Charset name: CP51932
Charset aliases: csCP51932
NARUSE, Yui
Intended usage: LIMITED USE
--
--=20
#-# Martin J. D=C3=BCrst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
Martin J. Dürst
2010-08-28 05:26:45 UTC
Permalink
Hello Yui,
Post by NARUSE, Yui
Hi,
I think, this topic has enough discussed.
So I want to go next stage.
Many thanks for the reminder. RFC 2978=20
(http://tools.ietf.org/html/rfc2978) says:
3.2. Charset Reviewer

When the two week period has passed and the registration proposer=
is
convinced that consensus has been achieved, the registration
application should be submitted to IANA and the charset reviewer.
The charset reviewer, who is appointed by the IETF Applications A=
rea
Director(s), either approves the request for registration or reje=
cts
it. ...
Because the section is entitled "Charset Reviewer", it's a bit diffic=
ult=20
to find and understand that the idea is that the *proposer* sends the=
=20
proposal to IANA. And it's also not clear what "sends the proposal to=
=20
IANA" means.

So please send your proposal to ***@iana.org, copying Ned and me (an=
d=20
ietf-***@iana.org if you want). At the start of your mail, pleas=
e=20
say that this is a proposal for registration of a charset.
Please remove the leading ">>>" (or however the quotation shows in yo=
ur=20
mailer). And please make the changes below (mostly related to the f=
act=20
that when this gets registered, some of the current statements will n=
o=20
longer be true).

Regards, Martin.
Post by NARUSE, Yui
Post by NARUSE Yui
----------
Charset name: CP51932
Charset aliases: csCP51932
Yes, CP51932 is suitable for use with subtypes of the "text"
Content-Type. Since CP1932 is an 8bit charset Care should be
taken to choose an appropriate Content-Transfer-Encoding.
Please change "charset Care" to "charset, care".
Post by NARUSE, Yui
Post by NARUSE Yui
Octets with the high bit clear specify single US-ASCII characters=
, while
Post by NARUSE, Yui
Post by NARUSE Yui
octets with the high bit set encode characters from the Windows C=
odepage
Post by NARUSE, Yui
Post by NARUSE Yui
932 by combining the bits from the two octets except the first oc=
tet is
Post by NARUSE, Yui
Post by NARUSE Yui
0x8E which represents Halfwidth Katakana.
Please change "two octets except the first octet is 0x8E which=20
represents Halfwidth Katakana" to "two octets except when the first=
=20
octet is 0x8E, in which case this represents Halfwidth Katakana".
Post by NARUSE, Yui
Post by NARUSE Yui
Meaning and mapping to Unicode of each character is refer to
Windows Codepage 932.
http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx
http://cpansearch.perl.org/src/NARUSE/Encode-EUCJPMS-0.07/ucm/cp5=
1932.ucm
Post by NARUSE, Yui
Post by NARUSE Yui
This is a request for a new registration of this charset.
This sentence can be removed.
Post by NARUSE, Yui
Post by NARUSE Yui
CP51932 is a variant of EUC-JP (like Windows-31J and Shift_JIS).
* CP51932 doesn't support JIS X 0212
* CP51932 supports characters extended by Windows Codepage 932
* Unicode mapping of some characters are different
Typical user of CP51932 is web browsers. When web browsers load
a page which are declared or auto-detected as "EUC-JP", they don'=
t
Post by NARUSE, Yui
Post by NARUSE Yui
interpret it as true EUC-JP registerd in IANA Character Sets but =
as
Post by NARUSE, Yui
Post by NARUSE Yui
CP51932. When they post form data as "EUC-JP", the data is also
encoded as CP51932.
* Citrus iconv (NetBSD and DragonFly uses this)
* patched GNU libiconv in FreeBSD ports
* Mojikan http://www.mirai-ii.co.jp/moji/mojikan/
* nkf 2.0.5
* PHP 5.2.1
* Ruby 1.9.1
* Encode-EUCJPMS-0.06
Moreover applications which uses MLang.DLL or .NET Framework for
converting "EUC-JP" implicitly uses this charset.
So this charset is widely used, but doesn't have its own name.
Please remove (or reword) this sentence.
Post by NARUSE, Yui
Post by NARUSE Yui
Intended use of this name is to override the implementation of EU=
C-JP
Post by NARUSE, Yui
Post by NARUSE Yui
or charset convertion.
http://wiki.whatwg.org/wiki/Web_Encodings
http://www.w3.org/Bugs/Public/show_bug.cgi?id=3D7444
Why the name is not "Windows-51932" is some of applications which=
accept
Post by NARUSE, Yui
Post by NARUSE Yui
the name "CP51932" don't support the name "Windows-51932".
CP51932 is for use of importing legacy data.
UTF-8 is preferred to CP51932 for new system.
"Remarks" of "GetEncodings Method" of "System.Text"
http://msdn.microsoft.com/en-us/library/system.text.encoding.gete=
ncodings.aspx
Post by NARUSE, Yui
Post by NARUSE Yui
"Unicode=E3=81=AB=E3=82=88=E3=82=8BJIS X0213=E5=AE=9F=E8=A3=85=
=E5=85=A5=E9=96=80=E2=80=95=E6=83=85=E5=A0=B1=E3=82=B7=E3=82=B9=E3=
=83=86=E3=83=A0=E3=81=AE=E6=96=B0=E3=81=9F=E3=81=AA=E6=97=A5=E6=9C=
=AC=E8=AA=9E=E5=87=A6=E7=90=86=E7=92=B0=E5=A2=83"
Post by NARUSE, Yui
Post by NARUSE Yui
=E6=97=A5=E7=B5=8CBP=E3=82=BD=E3=83=95=E3=83=88=E3=83=97=E3=83=
=AC=E3=82=B9, ISBN 978-4891006082, 2008, p. 17-18, 20, 120-158

II'm not sure IANA can get this up in Japanese, but let's try. I prop=
ose=20
to provide a translation of this reference, at least as a fallback.

"Introduction to JIS X0213 Implementation based on Unicode - A new=
=20
Japanese Language Processing Environment for Information Systems",=
=20
Nikkei BP Soft Press, ISBN 978-4891006082, 2008, pp. 17-18, 20, 120-1=
58=20
(in Japanese)

[translation is my own, please feel free to improve]
Post by NARUSE, Yui
Post by NARUSE Yui
CP51932 - Legacy Encoding Project
http://legacy-encoding.sourceforge.jp/wiki/index.php?cp51932
This charset is also known as Windows Codepage 51932.
NARUSE, Yui
Intended usage: LIMITED USE
--=20
#-# Martin J. D=C3=BCrst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
NARUSE, Yui
2010-08-28 08:39:44 UTC
Permalink
Hello,

This is Registration of new charset CP51932.

Following proposal should get a consensus in ietf-charsets,
so I submit this to IANA.

Regards,

---
Charset name: CP51932

Charset aliases: csCP51932

Suitability for use in MIME text:

Yes, CP51932 is suitable for use with subtypes of the "text"
Content-Type. Since CP1932 is an 8bit charset care should be
taken to choose an appropriate Content-Transfer-Encoding.

Published specification(s):

Octets with the high bit clear specify single US-ASCII characters, while
octets with the high bit set encode characters from the Windows Codepage
932 by combining the bits from the two octets except when the first octet
is 0x8E, in which case this represents Halfwidth Katakana.

Meaning and mapping to Unicode of each character is refer to
Windows Codepage 932.
http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx

ISO 10646 equivalency table:

http://cpansearch.perl.org/src/NARUSE/Encode-EUCJPMS-0.07/ucm/cp51932.ucm

Additional information:

CP51932 is a variant of EUC-JP (like Windows-31J and Shift_JIS).
this charset is different from EUC-JP in:
* CP51932 doesn't support JIS X 0212
* CP51932 supports characters extended by Windows Codepage 932
* Unicode mapping of some characters are different

Typical user of CP51932 is web browsers. When web browsers load
a page which are declared or auto-detected as "EUC-JP", they don't
interpret it as true EUC-JP registerd in IANA Character Sets but as
CP51932. When they post form data as "EUC-JP", the data is also
encoded as CP51932.

The name "CP51932" is in use following applications:
* Citrus iconv (NetBSD and DragonFly uses this)
* patched GNU libiconv in FreeBSD ports
* Mojikan http://www.mirai-ii.co.jp/moji/mojikan/
* nkf 2.0.5
* PHP 5.2.1
* Ruby 1.9.1
* Encode-EUCJPMS-0.06

Moreover applications which uses MLang.DLL or .NET Framework for
converting "EUC-JP" implicitly uses this charset.

Intended use of this name is to override the implementation of EUC-JP
or charset convertion.
http://wiki.whatwg.org/wiki/Web_Encodings
http://www.w3.org/Bugs/Public/show_bug.cgi?id=7444

Why the name is not "Windows-51932" is some of applications which accept
the name "CP51932" don't support the name "Windows-51932".

CP51932 is for use of importing legacy data.
UTF-8 is preferred to CP51932 for new system.

Related references are:

"Remarks" of "GetEncodings Method" of "System.Text"
http://msdn.microsoft.com/en-us/library/system.text.encoding.getencodings.aspx

"Introduction to JIS X0213 Implementation based on Unicode -
A new Japanese Language Processing Environment for Information Systems",
Nikkei BP Soft Press, ISBN 978-4891006082, 2008, pp. 17-18, 20, 120-158 (in Japanese)

CP51932 - Legacy Encoding Project
http://legacy-encoding.sourceforge.jp/wiki/index.php?cp51932

This charset is also known as Windows Codepage 51932.

Person & email address to contact for further information:

NARUSE, Yui
Email: ***@airemix.jp

Intended usage: LIMITED USE
--
NARUSE, Yui <***@airemix.jp>
Loading...