Discussion:
windows 936
Shawn Steele
2007-05-22 00:55:41 UTC
Permalink
I am looking at the registrations for the remaining 4 "system" code
pages: 932, 936, 949 & 950. This seems complicated since IE uses other
names for them.



For example, for 936, IE recognizes Chinese, CN-GB, csGB2312,
csGB231280, csISO58GB231280, GB2312, GB2312-80, GB231280, GBK,
GB_2312_80, iso-ir-58, and, of course its known to the system as 936.



Our APIs report this code page as being "gb2312"



There is an existing registration for GBK, aliases of CP936, MS936 and
windows-936, but not of the gb2312 name. The existing registration
points to broken links at Microsoft and IBM. This should probably be
updated to point to:



http://www.microsoft.com/globaldev/reference/dbcs/936.mspx

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT
and

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bes
tfit936.txt



I am a bit uncertain that GBK == 936, although this is what the existing
registration implies.



The alternative solution would seem to be to register a new charset as
"windows-936" with the same additional aliases as the GBK registration
and point to the above tables. This would then also lead to the
question of whether GBK and gb2312 should be listed as aliases for any
such windows-936 code page although the interpretation of those aliases
could differ for other systems.



My goal is to clarify the Microsoft system code page mappings such as
for 932, 936, 949 & 950, and I'd appreciate any suggestions about how to
best do that J



Thanks,



- Shawn



Shawn Steele

SDE

Windows International

Microsoft
Erik van der Poel
2007-05-22 01:51:38 UTC
Permalink
Most of the Windows code pages are "supersets" of other standard sets.
But rather than adding new charset names for these supersets, it might
be better to add comments to the existing registrations to point out
the relationships between the various sets.

For example, the windows-936 registration might refer to the gb2312
one, the windows-31j registration might refer to Windows Code Page 932
and the Shift_JIS registration, the EUC-KR registration might refer to
CP 949 and the Big5 registration to CP 950. All as informative
references, rather than normative, I think.

This promotes interoperability while avoiding the addition of more
names and "virtual" aliases.

Erik

On 5/21/07, Shawn Steele <***@microsoft.com> wrote:
>
>
>
>
> I am looking at the registrations for the remaining 4 "system" code pages:
> 932, 936, 949 & 950. This seems complicated since IE uses other names for
> them.
>
>
>
> For example, for 936, IE recognizes Chinese, CN-GB, csGB2312, csGB231280,
> csISO58GB231280, GB2312, GB2312-80, GB231280, GBK, GB_2312_80, iso-ir-58,
> and, of course its known to the system as 936.
>
>
>
> Our APIs report this code page as being "gb2312"
>
>
>
> There is an existing registration for GBK, aliases of CP936, MS936 and
> windows-936, but not of the gb2312 name. The existing registration points
> to broken links at Microsoft and IBM. This should probably be updated to
> point to:
>
>
>
> http://www.microsoft.com/globaldev/reference/dbcs/936.mspx
>
> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT
> and
>
> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit936.txt
>
>
>
> I am a bit uncertain that GBK == 936, although this is what the existing
> registration implies.
>
>
>
> The alternative solution would seem to be to register a new charset as
> "windows-936" with the same additional aliases as the GBK registration and
> point to the above tables. This would then also lead to the question of
> whether GBK and gb2312 should be listed as aliases for any such windows-936
> code page although the interpretation of those aliases could differ for
> other systems.
>
>
>
> My goal is to clarify the Microsoft system code page mappings such as for
> 932, 936, 949 & 950, and I'd appreciate any suggestions about how to best do
> that J
>
>
>
> Thanks,
>
>
>
> - Shawn
>
>
>
> Shawn Steele
>
> SDE
>
> Windows International
>
> Microsoft
>
>
Martin Duerst
2007-05-22 02:57:57 UTC
Permalink
Dealing with minor differeces and variants as notes is definitely
one possibility. However, I think sooner rather than later, we
should look at a more syntematic way of indicating variants
and extensions.

Here is an extremely rough strawman:

a) Identify a character that's okay in charset tags but rarely
used (e.g. '+', don't even know whether that's okay)
b) Use this character to separate base tag and variants, e.g.
base tag: Shift_jis
tag with variant: Shift_jis+cp932

Shift_jis would only indicate that this is some kind of shift_jis.
Applications that don't care too much about variants would just
use this. Shift_jis+cp932 indicates the variant with the Microsoft
additions. Applications on the receiving end not interested in
variants would have to cut off trailing '+' and what's after.

The above proposal isn't without problems, but addresses the
second most fundamental problem in the current scheme.

(The first most fundamental problem is that stuff is often
tagged wrongly. But that's a much harder problem than the variants.)

Regards, Martin.

At 10:51 07/05/22, Erik van der Poel wrote:
>Most of the Windows code pages are "supersets" of other standard sets.
>But rather than adding new charset names for these supersets, it might
>be better to add comments to the existing registrations to point out
>the relationships between the various sets.
>
>For example, the windows-936 registration might refer to the gb2312
>one, the windows-31j registration might refer to Windows Code Page 932
>and the Shift_JIS registration, the EUC-KR registration might refer to
>CP 949 and the Big5 registration to CP 950. All as informative
>references, rather than normative, I think.
>
>This promotes interoperability while avoiding the addition of more
>names and "virtual" aliases.
>
>Erik
>
>On 5/21/07, Shawn Steele <***@microsoft.com> wrote:
>>
>>
>>
>>
>> I am looking at the registrations for the remaining 4 "system" code pages:
>> 932, 936, 949 & 950. This seems complicated since IE uses other names for
>> them.
>>
>>
>>
>> For example, for 936, IE recognizes Chinese, CN-GB, csGB2312, csGB231280,
>> csISO58GB231280, GB2312, GB2312-80, GB231280, GBK, GB_2312_80, iso-ir-58,
>> and, of course its known to the system as 936.
>>
>>
>>
>> Our APIs report this code page as being "gb2312"
>>
>>
>>
>> There is an existing registration for GBK, aliases of CP936, MS936 and
>> windows-936, but not of the gb2312 name. The existing registration points
>> to broken links at Microsoft and IBM. This should probably be updated to
>> point to:
>>
>>
>>
>> http://www.microsoft.com/globaldev/reference/dbcs/936.mspx
>>
>> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT
>> and
>>
>> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit936.txt
>>
>>
>>
>> I am a bit uncertain that GBK == 936, although this is what the existing
>> registration implies.
>>
>>
>>
>> The alternative solution would seem to be to register a new charset as
>> "windows-936" with the same additional aliases as the GBK registration and
>> point to the above tables. This would then also lead to the question of
>> whether GBK and gb2312 should be listed as aliases for any such windows-936
>> code page although the interpretation of those aliases could differ for
>> other systems.
>>
>>
>>
>> My goal is to clarify the Microsoft system code page mappings such as for
>> 932, 936, 949 & 950, and I'd appreciate any suggestions about how to best do
>> that J
>>
>>
>>
>> Thanks,
>>
>>
>>
>> - Shawn
>>
>>
>>
>> Shawn Steele
>>
>> SDE
>>
>> Windows International
>>
>> Microsoft
>>
>>


#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
McDonald Ira
2007-05-22 17:07:02 UTC
Permalink
Hi,

Now that's an interesting idea, Martin. And "+" _is_
legal in charset names, per the following quote from
page 4 of RFC 2978:

Finally, charsets being registered for use with the "text" media type
MUST have a primary name that conforms to the more restrictive syntax
of the charset field in MIME encoded-words [RFC-2047, RFC-2184] and
MIME extended parameter values [RFC-2184]. A combined ABNF
definition for such names is as follows:

mime-charset = 1*mime-charset-chars
mime-charset-chars = ALPHA / DIGIT /
"!" / "#" / "$" / "%" / "&" /
"'" / "+" / "-" / "^" / "_" /
"`" / "{" / "}" / "~"
ALPHA = "A".."Z" ; Case insensitive ASCII Letter
DIGIT = "0".."9" ; Numeric digit


Looking at the latest posted IANA Charset Registry
plaintext, there are a few uses of "+" (for "+euro")
in aliases (but never base names), but it's pretty
rare. See:

http://www.iana.org/assignments/character-sets

Cheers,
- Ira

Ira McDonald (Musician / Software Architect)
Chair - Linux Foundation Open Printing WG
Blue Roof Music / High North Inc
PO Box 221 Grand Marais, MI 49839
phone: +1-906-494-2434
email: ***@sharplabs.com

-----Original Message-----
From: Martin Duerst [mailto:***@it.aoyama.ac.jp]
Sent: Monday, May 21, 2007 9:58 PM
To: Erik van der Poel; Shawn Steele
Cc: ietf-***@mail.apps.ietf.org
Subject: Re: windows 936


Dealing with minor differeces and variants as notes is definitely
one possibility. However, I think sooner rather than later, we
should look at a more syntematic way of indicating variants
and extensions.

Here is an extremely rough strawman:

a) Identify a character that's okay in charset tags but rarely
used (e.g. '+', don't even know whether that's okay)
b) Use this character to separate base tag and variants, e.g.
base tag: Shift_jis
tag with variant: Shift_jis+cp932

Shift_jis would only indicate that this is some kind of shift_jis.
Applications that don't care too much about variants would just
use this. Shift_jis+cp932 indicates the variant with the Microsoft
additions. Applications on the receiving end not interested in
variants would have to cut off trailing '+' and what's after.

The above proposal isn't without problems, but addresses the
second most fundamental problem in the current scheme.

(The first most fundamental problem is that stuff is often
tagged wrongly. But that's a much harder problem than the variants.)

Regards, Martin.

At 10:51 07/05/22, Erik van der Poel wrote:
>Most of the Windows code pages are "supersets" of other standard sets.
>But rather than adding new charset names for these supersets, it might
>be better to add comments to the existing registrations to point out
>the relationships between the various sets.
>
>For example, the windows-936 registration might refer to the gb2312
>one, the windows-31j registration might refer to Windows Code Page 932
>and the Shift_JIS registration, the EUC-KR registration might refer to
>CP 949 and the Big5 registration to CP 950. All as informative
>references, rather than normative, I think.
>
>This promotes interoperability while avoiding the addition of more
>names and "virtual" aliases.
>
>Erik
>
>On 5/21/07, Shawn Steele <***@microsoft.com> wrote:
>>
>>
>>
>>
>> I am looking at the registrations for the remaining 4 "system" code pages:
>> 932, 936, 949 & 950. This seems complicated since IE uses other names for
>> them.
>>
>>
>>
>> For example, for 936, IE recognizes Chinese, CN-GB, csGB2312, csGB231280,
>> csISO58GB231280, GB2312, GB2312-80, GB231280, GBK, GB_2312_80, iso-ir-58,
>> and, of course its known to the system as 936.
>>
>>
>>
>> Our APIs report this code page as being "gb2312"
>>
>>
>>
>> There is an existing registration for GBK, aliases of CP936, MS936 and
>> windows-936, but not of the gb2312 name. The existing registration points
>> to broken links at Microsoft and IBM. This should probably be updated to
>> point to:
>>
>>
>>
>> http://www.microsoft.com/globaldev/reference/dbcs/936.mspx
>>
>> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT
>> and
>>
>> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit936.txt
>>
>>
>>
>> I am a bit uncertain that GBK == 936, although this is what the existing
>> registration implies.
>>
>>
>>
>> The alternative solution would seem to be to register a new charset as
>> "windows-936" with the same additional aliases as the GBK registration and
>> point to the above tables. This would then also lead to the question of
>> whether GBK and gb2312 should be listed as aliases for any such windows-936
>> code page although the interpretation of those aliases could differ for
>> other systems.
>>
>>
>>
>> My goal is to clarify the Microsoft system code page mappings such as for
>> 932, 936, 949 & 950, and I'd appreciate any suggestions about how to best do
>> that J
>>
>>
>>
>> Thanks,
>>
>>
>>
>> - Shawn
>>
>>
>>
>> Shawn Steele
>>
>> SDE
>>
>> Windows International
>>
>> Microsoft
>>
>>


#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp


No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.467 / Virus Database: 269.7.6/814 - Release Date: 5/21/2007 2:01 PM
Martin Duerst
2007-09-21 09:46:21 UTC
Permalink
Hello Ira, others,

This is a long overdue reply.

At 02:07 07/05/23, McDonald, Ira wrote:
>Hi,
>
>Now that's an interesting idea, Martin.

Thanks!

>And "+" _is_
>legal in charset names, per the following quote from
>page 4 of RFC 2978:
>
> Finally, charsets being registered for use with the "text" media type
> MUST have a primary name that conforms to the more restrictive syntax
> of the charset field in MIME encoded-words [RFC-2047, RFC-2184] and
> MIME extended parameter values [RFC-2184]. A combined ABNF
> definition for such names is as follows:
>
> mime-charset = 1*mime-charset-chars
> mime-charset-chars = ALPHA / DIGIT /
> "!" / "#" / "$" / "%" / "&" /
> "'" / "+" / "-" / "^" / "_" /
> "`" / "{" / "}" / "~"
> ALPHA = "A".."Z" ; Case insensitive ASCII Letter
> DIGIT = "0".."9" ; Numeric digit

Yes, but that may not be good enough. XML spoils things.
The relevant production in the XML Recommendation doesn't
allow '+'. From http://www.w3.org/TR/REC-xml/#charencoding:

[80] EncodingDecl ::= S 'encoding' Eq ('"' EncName '"' | "'" EncName "'" )
[81] EncName ::= [A-Za-z] ([A-Za-z0-9._] | '-')*

Now there would be three ways ahead:
- Ignore XML. I don't think we want to go there.
- Try to change XML. A few years ago, that would have been
easy with an erratum, but I don't think this will be met
with cheers these days.
- Choose a separator different from '+'. After quite a bit of
thinking, I have reached the conclusion that the obvious
thing to do would be to use something like '--'.

What does everybody think?

Regards, Martin.



>Looking at the latest posted IANA Charset Registry
>plaintext, there are a few uses of "+" (for "+euro")
>in aliases (but never base names), but it's pretty
>rare. See:
>
> http://www.iana.org/assignments/character-sets
>
>Cheers,
>- Ira
>
>Ira McDonald (Musician / Software Architect)
>Chair - Linux Foundation Open Printing WG
>Blue Roof Music / High North Inc
>PO Box 221 Grand Marais, MI 49839
>phone: +1-906-494-2434
>email: ***@sharplabs.com
>
>-----Original Message-----
>From: Martin Duerst [mailto:***@it.aoyama.ac.jp]
>Sent: Monday, May 21, 2007 9:58 PM
>To: Erik van der Poel; Shawn Steele
>Cc: ietf-***@mail.apps.ietf.org
>Subject: Re: windows 936
>
>
>Dealing with minor differeces and variants as notes is definitely
>one possibility. However, I think sooner rather than later, we
>should look at a more syntematic way of indicating variants
>and extensions.
>
>Here is an extremely rough strawman:
>
>a) Identify a character that's okay in charset tags but rarely
> used (e.g. '+', don't even know whether that's okay)
>b) Use this character to separate base tag and variants, e.g.
> base tag: Shift_jis
> tag with variant: Shift_jis+cp932
>
>Shift_jis would only indicate that this is some kind of shift_jis.
>Applications that don't care too much about variants would just
>use this. Shift_jis+cp932 indicates the variant with the Microsoft
>additions. Applications on the receiving end not interested in
>variants would have to cut off trailing '+' and what's after.
>
>The above proposal isn't without problems, but addresses the
>second most fundamental problem in the current scheme.
>
>(The first most fundamental problem is that stuff is often
>tagged wrongly. But that's a much harder problem than the variants.)
>
>Regards, Martin.
>
>At 10:51 07/05/22, Erik van der Poel wrote:
>>Most of the Windows code pages are "supersets" of other standard sets.
>>But rather than adding new charset names for these supersets, it might
>>be better to add comments to the existing registrations to point out
>>the relationships between the various sets.
>>
>>For example, the windows-936 registration might refer to the gb2312
>>one, the windows-31j registration might refer to Windows Code Page 932
>>and the Shift_JIS registration, the EUC-KR registration might refer to
>>CP 949 and the Big5 registration to CP 950. All as informative
>>references, rather than normative, I think.
>>
>>This promotes interoperability while avoiding the addition of more
>>names and "virtual" aliases.
>>
>>Erik
>>
>>On 5/21/07, Shawn Steele <***@microsoft.com> wrote:
>>>
>>>
>>>
>>>
>>> I am looking at the registrations for the remaining 4 "system" code pages:
>>> 932, 936, 949 & 950. This seems complicated since IE uses other names for
>>> them.
>>>
>>>
>>>
>>> For example, for 936, IE recognizes Chinese, CN-GB, csGB2312, csGB231280,
>>> csISO58GB231280, GB2312, GB2312-80, GB231280, GBK, GB_2312_80, iso-ir-58,
>>> and, of course its known to the system as 936.
>>>
>>>
>>>
>>> Our APIs report this code page as being "gb2312"
>>>
>>>
>>>
>>> There is an existing registration for GBK, aliases of CP936, MS936 and
>>> windows-936, but not of the gb2312 name. The existing registration points
>>> to broken links at Microsoft and IBM. This should probably be updated to
>>> point to:
>>>
>>>
>>>
>>> http://www.microsoft.com/globaldev/reference/dbcs/936.mspx
>>>
>>> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT
>>> and
>>>
>>>
>http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit936.txt
>>>
>>>
>>>
>>> I am a bit uncertain that GBK == 936, although this is what the existing
>>> registration implies.
>>>
>>>
>>>
>>> The alternative solution would seem to be to register a new charset as
>>> "windows-936" with the same additional aliases as the GBK registration and
>>> point to the above tables. This would then also lead to the question of
>>> whether GBK and gb2312 should be listed as aliases for any such windows-936
>>> code page although the interpretation of those aliases could differ for
>>> other systems.
>>>
>>>
>>>
>>> My goal is to clarify the Microsoft system code page mappings such as for
>>> 932, 936, 949 & 950, and I'd appreciate any suggestions about how to best do
>>> that J
>>>
>>>
>>>
>>> Thanks,
>>>
>>>
>>>
>>> - Shawn
>>>
>>>
>>>
>>> Shawn Steele
>>>
>>> SDE
>>>
>>> Windows International
>>>
>>> Microsoft
>>>
>>>
>
>
>#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
>#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
>
>
>No virus found in this outgoing message.
>Checked by AVG Free Edition.
>Version: 7.5.467 / Virus Database: 269.7.6/814 - Release Date: 5/21/2007 2:01 PM
>


#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
Erik van der Poel
2007-09-21 13:52:49 UTC
Permalink
I don't think it's such a good idea. The Web has come a long way in
terms of labelling charsets. In the early days, very few people
bothered to insert the HTML <meta> with charset, and even fewer people
inserted the HTTP charset. Nowadays, around 74% of the documents in
Google's index have the meta charset.

The commonly used characters are currently being conveyed correctly
from human to human by using the common charset names on the wire.
If/when you start to introduce charset variant names that are not
understood by the clients, even the commonly used characters cannot be
viewed, let alone the rare characters supposedly enabled by these
variant names.

Of course, if we get all the clients to upgrade first, we won't have
this problem. But are these minor variants worth all that trouble?

Erik

On 9/21/07, Martin Duerst <***@it.aoyama.ac.jp> wrote:
> Hello Ira, others,
>
> This is a long overdue reply.
>
> At 02:07 07/05/23, McDonald, Ira wrote:
> >Hi,
> >
> >Now that's an interesting idea, Martin.
>
> Thanks!
>
> >And "+" _is_
> >legal in charset names, per the following quote from
> >page 4 of RFC 2978:
> >
> > Finally, charsets being registered for use with the "text" media type
> > MUST have a primary name that conforms to the more restrictive syntax
> > of the charset field in MIME encoded-words [RFC-2047, RFC-2184] and
> > MIME extended parameter values [RFC-2184]. A combined ABNF
> > definition for such names is as follows:
> >
> > mime-charset = 1*mime-charset-chars
> > mime-charset-chars = ALPHA / DIGIT /
> > "!" / "#" / "$" / "%" / "&" /
> > "'" / "+" / "-" / "^" / "_" /
> > "`" / "{" / "}" / "~"
> > ALPHA = "A".."Z" ; Case insensitive ASCII Letter
> > DIGIT = "0".."9" ; Numeric digit
>
> Yes, but that may not be good enough. XML spoils things.
> The relevant production in the XML Recommendation doesn't
> allow '+'. From http://www.w3.org/TR/REC-xml/#charencoding:
>
> [80] EncodingDecl ::= S 'encoding' Eq ('"' EncName '"' | "'" EncName "'" )
> [81] EncName ::= [A-Za-z] ([A-Za-z0-9._] | '-')*
>
> Now there would be three ways ahead:
> - Ignore XML. I don't think we want to go there.
> - Try to change XML. A few years ago, that would have been
> easy with an erratum, but I don't think this will be met
> with cheers these days.
> - Choose a separator different from '+'. After quite a bit of
> thinking, I have reached the conclusion that the obvious
> thing to do would be to use something like '--'.
>
> What does everybody think?
>
> Regards, Martin.
>
>
>
> >Looking at the latest posted IANA Charset Registry
> >plaintext, there are a few uses of "+" (for "+euro")
> >in aliases (but never base names), but it's pretty
> >rare. See:
> >
> > http://www.iana.org/assignments/character-sets
> >
> >Cheers,
> >- Ira
> >
> >Ira McDonald (Musician / Software Architect)
> >Chair - Linux Foundation Open Printing WG
> >Blue Roof Music / High North Inc
> >PO Box 221 Grand Marais, MI 49839
> >phone: +1-906-494-2434
> >email: ***@sharplabs.com
> >
> >-----Original Message-----
> >From: Martin Duerst [mailto:***@it.aoyama.ac.jp]
> >Sent: Monday, May 21, 2007 9:58 PM
> >To: Erik van der Poel; Shawn Steele
> >Cc: ietf-***@mail.apps.ietf.org
> >Subject: Re: windows 936
> >
> >
> >Dealing with minor differeces and variants as notes is definitely
> >one possibility. However, I think sooner rather than later, we
> >should look at a more syntematic way of indicating variants
> >and extensions.
> >
> >Here is an extremely rough strawman:
> >
> >a) Identify a character that's okay in charset tags but rarely
> > used (e.g. '+', don't even know whether that's okay)
> >b) Use this character to separate base tag and variants, e.g.
> > base tag: Shift_jis
> > tag with variant: Shift_jis+cp932
> >
> >Shift_jis would only indicate that this is some kind of shift_jis.
> >Applications that don't care too much about variants would just
> >use this. Shift_jis+cp932 indicates the variant with the Microsoft
> >additions. Applications on the receiving end not interested in
> >variants would have to cut off trailing '+' and what's after.
> >
> >The above proposal isn't without problems, but addresses the
> >second most fundamental problem in the current scheme.
> >
> >(The first most fundamental problem is that stuff is often
> >tagged wrongly. But that's a much harder problem than the variants.)
> >
> >Regards, Martin.
> >
> >At 10:51 07/05/22, Erik van der Poel wrote:
> >>Most of the Windows code pages are "supersets" of other standard sets.
> >>But rather than adding new charset names for these supersets, it might
> >>be better to add comments to the existing registrations to point out
> >>the relationships between the various sets.
> >>
> >>For example, the windows-936 registration might refer to the gb2312
> >>one, the windows-31j registration might refer to Windows Code Page 932
> >>and the Shift_JIS registration, the EUC-KR registration might refer to
> >>CP 949 and the Big5 registration to CP 950. All as informative
> >>references, rather than normative, I think.
> >>
> >>This promotes interoperability while avoiding the addition of more
> >>names and "virtual" aliases.
> >>
> >>Erik
> >>
> >>On 5/21/07, Shawn Steele <***@microsoft.com> wrote:
> >>>
> >>>
> >>>
> >>>
> >>> I am looking at the registrations for the remaining 4 "system" code pages:
> >>> 932, 936, 949 & 950. This seems complicated since IE uses other names for
> >>> them.
> >>>
> >>>
> >>>
> >>> For example, for 936, IE recognizes Chinese, CN-GB, csGB2312, csGB231280,
> >>> csISO58GB231280, GB2312, GB2312-80, GB231280, GBK, GB_2312_80, iso-ir-58,
> >>> and, of course its known to the system as 936.
> >>>
> >>>
> >>>
> >>> Our APIs report this code page as being "gb2312"
> >>>
> >>>
> >>>
> >>> There is an existing registration for GBK, aliases of CP936, MS936 and
> >>> windows-936, but not of the gb2312 name. The existing registration points
> >>> to broken links at Microsoft and IBM. This should probably be updated to
> >>> point to:
> >>>
> >>>
> >>>
> >>> http://www.microsoft.com/globaldev/reference/dbcs/936.mspx
> >>>
> >>> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT
> >>> and
> >>>
> >>>
> >http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit936.txt
> >>>
> >>>
> >>>
> >>> I am a bit uncertain that GBK == 936, although this is what the existing
> >>> registration implies.
> >>>
> >>>
> >>>
> >>> The alternative solution would seem to be to register a new charset as
> >>> "windows-936" with the same additional aliases as the GBK registration and
> >>> point to the above tables. This would then also lead to the question of
> >>> whether GBK and gb2312 should be listed as aliases for any such windows-936
> >>> code page although the interpretation of those aliases could differ for
> >>> other systems.
> >>>
> >>>
> >>>
> >>> My goal is to clarify the Microsoft system code page mappings such as for
> >>> 932, 936, 949 & 950, and I'd appreciate any suggestions about how to best do
> >>> that J
> >>>
> >>>
> >>>
> >>> Thanks,
> >>>
> >>>
> >>>
> >>> - Shawn
> >>>
> >>>
> >>>
> >>> Shawn Steele
> >>>
> >>> SDE
> >>>
> >>> Windows International
> >>>
> >>> Microsoft
> >>>
> >>>
> >
> >
> >#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
> >#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
> >
> >
> >No virus found in this outgoing message.
> >Checked by AVG Free Edition.
> >Version: 7.5.467 / Virus Database: 269.7.6/814 - Release Date: 5/21/2007 2:01 PM
> >
>
>
> #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
> #-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
>
>
Shawn Steele
2007-09-21 16:28:16 UTC
Permalink
> I don't think it's such a good idea.

I agree, for mostly the same reason: Clients (& servers, ours & others) already abuse the code page names by creating the variants. Its unlikely that all of these clients will get upgraded to support variant tags. Even if they did, that would presume there are no errors (which is how lots of these variations happened). Its possible that updated clients could even accidentally save data with the wrong tag.

Existing data would need to be upgraded to indicate its variant, and frankly if they're going to that much effort, I'd be much happier if they upgraded to UTF-8 or UTF-16, for which solutions already exist.

- Shawn
Martin Duerst
2007-09-22 07:20:35 UTC
Permalink
At 22:52 07/09/21, Erik van der Poel wrote:
>I don't think it's such a good idea. The Web has come a long way in
>terms of labelling charsets. In the early days, very few people
>bothered to insert the HTML <meta> with charset, and even fewer people
>inserted the HTTP charset. Nowadays, around 74% of the documents in
>Google's index have the meta charset.

Even if some percentage of these is wrong (do you have any idea?),
that's definitely a lot of progress.

>The commonly used characters are currently being conveyed correctly
>from human to human by using the common charset names on the wire.
>If/when you start to introduce charset variant names that are not
>understood by the clients, even the commonly used characters cannot be
>viewed, let alone the rare characters supposedly enabled by these
>variant names.
>
>Of course, if we get all the clients to upgrade first, we won't have
>this problem. But are these minor variants worth all that trouble?

It's definitely a good question. For some applications, the answer
is clearly 'no'. But for others, it may easily be 'yes'.

Please note that the first step towards supporting these variant
tags would be that recepients check if they support a full variant
tag, and if not, they look for '--' in the tag, cut off the variant
part, and try again. That's the main advantage and purpose of a
special separator. Of course even that requires an update to
receivers.

Regards, Martin.s


#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
Erik van der Poel
2007-09-22 13:38:01 UTC
Permalink
On 9/22/07, Martin Duerst <***@it.aoyama.ac.jp> wrote:
> At 22:52 07/09/21, Erik van der Poel wrote:
> >I don't think it's such a good idea. The Web has come a long way in
> >terms of labelling charsets. In the early days, very few people
> >bothered to insert the HTML <meta> with charset, and even fewer people
> >inserted the HTTP charset. Nowadays, around 74% of the documents in
> >Google's index have the meta charset.
>
> Even if some percentage of these is wrong (do you have any idea?),
> that's definitely a lot of progress.

I believe there are shades of gray between "wrong" and right, partly
because of the variant issue and partly because the major browsers are
not 100% consistent with each other, with the IANA registry or with
the official and semi-official Unicode mapping tables.

I suspect that a large percentage of the meta and http charsets is
"correct", at least to the extent of displaying the more common
characters correctly on the major browsers.

One way to estimate that percentage might be to compare the http or
meta charset with an encoding detector's result.

> >The commonly used characters are currently being conveyed correctly
> >from human to human by using the common charset names on the wire.
> >If/when you start to introduce charset variant names that are not
> >understood by the clients, even the commonly used characters cannot be
> >viewed, let alone the rare characters supposedly enabled by these
> >variant names.
> >
> >Of course, if we get all the clients to upgrade first, we won't have
> >this problem. But are these minor variants worth all that trouble?
>
> It's definitely a good question. For some applications, the answer
> is clearly 'no'. But for others, it may easily be 'yes'.

Which apps do you have in mind here?

> Please note that the first step towards supporting these variant
> tags would be that recepients check if they support a full variant
> tag, and if not, they look for '--' in the tag, cut off the variant
> part, and try again. That's the main advantage and purpose of a
> special separator. Of course even that requires an update to
> receivers.

Understood.

Erik
Martin Duerst
2007-09-23 12:14:40 UTC
Permalink
At 22:38 07/09/22, Erik van der Poel wrote:
>On 9/22/07, Martin Duerst <***@it.aoyama.ac.jp> wrote:

>> Even if some percentage of these is wrong (do you have any idea?),
>> that's definitely a lot of progress.
>
>I believe there are shades of gray between "wrong" and right, partly
>because of the variant issue and partly because the major browsers are
>not 100% consistent with each other, with the IANA registry or with
>the official and semi-official Unicode mapping tables.
>
>I suspect that a large percentage of the meta and http charsets is
>"correct", at least to the extent of displaying the more common
>characters correctly on the major browsers.

That's all we can ask for at the moment, I guess. If we don't
provide a way to label variants, there is no way to expect that
pages are labeled with the correct variant.

>One way to estimate that percentage might be to compare the http or
>meta charset with an encoding detector's result.

Yes. Do you have any way to do that (of course that would be done
just on a careful sample)?

>> >The commonly used characters are currently being conveyed correctly
>> >from human to human by using the common charset names on the wire.
>> >If/when you start to introduce charset variant names that are not
>> >understood by the clients, even the commonly used characters cannot be
>> >viewed, let alone the rare characters supposedly enabled by these
>> >variant names.
>> >
>> >Of course, if we get all the clients to upgrade first, we won't have
>> >this problem. But are these minor variants worth all that trouble?
>>
>> It's definitely a good question. For some applications, the answer
>> is clearly 'no'. But for others, it may easily be 'yes'.
>
>Which apps do you have in mind here?

A very typical case would be XML Signature. According to the spec,
you can sign e.g. an XML document in Shift_JIS, but it's done
by conversion to UTF-8. If the conversion isn't the same when
the signature is checked, the signature won't match anymore even
if it's actually correct.

Cases like these are quite different from the usual browsing case,
where an odd wrong character may just be overlooked.

Regards, Martin.



#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
Erik van der Poel
2007-09-23 15:15:11 UTC
Permalink
On 9/23/07, Martin Duerst <***@it.aoyama.ac.jp> wrote:
> At 22:38 07/09/22, Erik van der Poel wrote:
> >One way to estimate that percentage might be to compare the http or
> >meta charset with an encoding detector's result.
>
> Yes. Do you have any way to do that (of course that would be done
> just on a careful sample)?

I'd like to try this at some point, but it may not be very soon...

> A very typical case would be XML Signature. According to the spec,
> you can sign e.g. an XML document in Shift_JIS, but it's done
> by conversion to UTF-8. If the conversion isn't the same when
> the signature is checked, the signature won't match anymore even
> if it's actually correct.
>
> Cases like these are quite different from the usual browsing case,
> where an odd wrong character may just be overlooked.

Interesting. I guess the charset name is just one part of the problem.
Implementations would have to agree on their Unicode mapping tables
too.

Erik
Frank Ellermann
2007-09-21 14:57:23 UTC
Permalink
Martin Duerst wrote:

> XML spoils things.

That's bad. There are quite a lot registrations with names
in the form IBM00858 or IBM01140. I'd guess that nobody
will use these names with leading zeros, and sticks to the
whatever+euro alias, e.g. pc-multilingual-850+euro.

On the platforms where this charset is used its local name
is "codepage 850", the preferred MIME name should include
850, not the obscure 00858 or 858.

>| EncName ::= [A-Za-z] ([A-Za-z0-9._] | '-')*

Ugh, that's really bad.

> Now there would be three ways ahead:
> - Ignore XML. I don't think we want to go there.

True.

> - Try to change XML. A few years ago, that would have been
> easy with an erratum, but I don't think this will be met
> with cheers these days.

Interoperability is more important. They have already fixed
xml:lang to allow empty values, they should fix SystemLiteral
to be a valid URI, and while they're at it maybe updating the
EncName is a better option than...

> - Choose a separator different from '+'. After quite a bit of
> thinking, I have reached the conclusion that the obvious
> thing to do would be to use something like '--'.

...registering new aliases as preferred MIME names for the
various existing whatever+euro entries.

> What does everybody think?

Either fix XML or the registry for cases like ...850+euro.

Frank
McDonald Ira
2007-09-23 13:45:42 UTC
Permalink
Hi Martin,

I agree that a separator like "--" might be best.

I'm mostly invisible this week at two printing industry
conferences in Montreal, but I'll try to catch up next
week.

Cheers,
- Ira


-----Original Message-----
From: Martin Duerst [mailto:***@it.aoyama.ac.jp]
Sent: Fri 9/21/2007 5:46 AM
To: McDonald, Ira; Erik van der Poel; Shawn Steele
Cc: ietf-***@mail.apps.ietf.org
Subject: Indicating charset variants (was: RE: windows 936)

Hello Ira, others,

This is a long overdue reply.

At 02:07 07/05/23, McDonald, Ira wrote:
>Hi,
>
>Now that's an interesting idea, Martin.

Thanks!

>And "+" _is_
>legal in charset names, per the following quote from
>page 4 of RFC 2978:
>
> Finally, charsets being registered for use with the "text" media type
> MUST have a primary name that conforms to the more restrictive syntax
> of the charset field in MIME encoded-words [RFC-2047, RFC-2184] and
> MIME extended parameter values [RFC-2184]. A combined ABNF
> definition for such names is as follows:
>
> mime-charset = 1*mime-charset-chars
> mime-charset-chars = ALPHA / DIGIT /
> "!" / "#" / "$" / "%" / "&" /
> "'" / "+" / "-" / "^" / "_" /
> "`" / "{" / "}" / "~"
> ALPHA = "A".."Z" ; Case insensitive ASCII Letter
> DIGIT = "0".."9" ; Numeric digit

Yes, but that may not be good enough. XML spoils things.
The relevant production in the XML Recommendation doesn't
allow '+'. From http://www.w3.org/TR/REC-xml/#charencoding:

[80] EncodingDecl ::= S 'encoding' Eq ('"' EncName '"' | "'" EncName "'" )
[81] EncName ::= [A-Za-z] ([A-Za-z0-9._] | '-')*

Now there would be three ways ahead:
- Ignore XML. I don't think we want to go there.
- Try to change XML. A few years ago, that would have been
easy with an erratum, but I don't think this will be met
with cheers these days.
- Choose a separator different from '+'. After quite a bit of
thinking, I have reached the conclusion that the obvious
thing to do would be to use something like '--'.

What does everybody think?

Regards, Martin.



>Looking at the latest posted IANA Charset Registry
>plaintext, there are a few uses of "+" (for "+euro")
>in aliases (but never base names), but it's pretty
>rare. See:
>
> http://www.iana.org/assignments/character-sets
>
>Cheers,
>- Ira
>
>Ira McDonald (Musician / Software Architect)
>Chair - Linux Foundation Open Printing WG
>Blue Roof Music / High North Inc
>PO Box 221 Grand Marais, MI 49839
>phone: +1-906-494-2434
>email: ***@sharplabs.com
>
>-----Original Message-----
>From: Martin Duerst [mailto:***@it.aoyama.ac.jp]
>Sent: Monday, May 21, 2007 9:58 PM
>To: Erik van der Poel; Shawn Steele
>Cc: ietf-***@mail.apps.ietf.org
>Subject: Re: windows 936
>
>
>Dealing with minor differeces and variants as notes is definitely
>one possibility. However, I think sooner rather than later, we
>should look at a more syntematic way of indicating variants
>and extensions.
>
>Here is an extremely rough strawman:
>
>a) Identify a character that's okay in charset tags but rarely
> used (e.g. '+', don't even know whether that's okay)
>b) Use this character to separate base tag and variants, e.g.
> base tag: Shift_jis
> tag with variant: Shift_jis+cp932
>
>Shift_jis would only indicate that this is some kind of shift_jis.
>Applications that don't care too much about variants would just
>use this. Shift_jis+cp932 indicates the variant with the Microsoft
>additions. Applications on the receiving end not interested in
>variants would have to cut off trailing '+' and what's after.
>
>The above proposal isn't without problems, but addresses the
>second most fundamental problem in the current scheme.
>
>(The first most fundamental problem is that stuff is often
>tagged wrongly. But that's a much harder problem than the variants.)
>
>Regards, Martin.
>
>At 10:51 07/05/22, Erik van der Poel wrote:
>>Most of the Windows code pages are "supersets" of other standard sets.
>>But rather than adding new charset names for these supersets, it might
>>be better to add comments to the existing registrations to point out
>>the relationships between the various sets.
>>
>>For example, the windows-936 registration might refer to the gb2312
>>one, the windows-31j registration might refer to Windows Code Page 932
>>and the Shift_JIS registration, the EUC-KR registration might refer to
>>CP 949 and the Big5 registration to CP 950. All as informative
>>references, rather than normative, I think.
>>
>>This promotes interoperability while avoiding the addition of more
>>names and "virtual" aliases.
>>
>>Erik
>>
>>On 5/21/07, Shawn Steele <***@microsoft.com> wrote:
>>>
>>>
>>>
>>>
>>> I am looking at the registrations for the remaining 4 "system" code pages:
>>> 932, 936, 949 & 950. This seems complicated since IE uses other names for
>>> them.
>>>
>>>
>>>
>>> For example, for 936, IE recognizes Chinese, CN-GB, csGB2312, csGB231280,
>>> csISO58GB231280, GB2312, GB2312-80, GB231280, GBK, GB_2312_80, iso-ir-58,
>>> and, of course its known to the system as 936.
>>>
>>>
>>>
>>> Our APIs report this code page as being "gb2312"
>>>
>>>
>>>
>>> There is an existing registration for GBK, aliases of CP936, MS936 and
>>> windows-936, but not of the gb2312 name. The existing registration points
>>> to broken links at Microsoft and IBM. This should probably be updated to
>>> point to:
>>>
>>>
>>>
>>> http://www.microsoft.com/globaldev/reference/dbcs/936.mspx
>>>
>>> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT
>>> and
>>>
>>>
>http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit936.txt
>>>
>>>
>>>
>>> I am a bit uncertain that GBK == 936, although this is what the existing
>>> registration implies.
>>>
>>>
>>>
>>> The alternative solution would seem to be to register a new charset as
>>> "windows-936" with the same additional aliases as the GBK registration and
>>> point to the above tables. This would then also lead to the question of
>>> whether GBK and gb2312 should be listed as aliases for any such windows-936
>>> code page although the interpretation of those aliases could differ for
>>> other systems.
>>>
>>>
>>>
>>> My goal is to clarify the Microsoft system code page mappings such as for
>>> 932, 936, 949 & 950, and I'd appreciate any suggestions about how to best do
>>> that J
>>>
>>>
>>>
>>> Thanks,
>>>
>>>
>>>
>>> - Shawn
>>>
>>>
>>>
>>> Shawn Steele
>>>
>>> SDE
>>>
>>> Windows International
>>>
>>> Microsoft
>>>
>>>
>
>
>#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
>#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
>
>
>No virus found in this outgoing message.
>Checked by AVG Free Edition.
>Version: 7.5.467 / Virus Database: 269.7.6/814 - Release Date: 5/21/2007 2:01 PM
>


#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
Shawn Steele
2007-09-24 18:54:16 UTC
Permalink
>> I agree that a separator like "--" might be best.

I'm concerned about the addition of another layer of names that's incompatible with the current set of names. The concept of adding a variation would require changes to all client parsers, and that just isn't a realistic expectation. So the net result would be that a few applications emitting variations would create data unrecognizable by the majority of current software.

If there was to be such a huge breaking change in the naming of code pages, the effort in updating the software and legacy data sets would be much better used to migrate to Unicode than to migrate to a different form of the existing code pages.

My thinking is that for the registrations for code pages like 936, it would probably be worth stating that variations of the official code page exist and cause different interpretations on some systems.

- Shawn
Martin Duerst
2007-09-25 01:55:42 UTC
Permalink
At 03:54 07/09/25, Shawn Steele wrote:
>>> I agree that a separator like "--" might be best.
>
>I$BCN(B concerned about the addition of another layer of names that$BCT(B incompatible with the current set of names. The concept of adding a variation would require changes to all client parsers, and that just isn$BCU(B a realistic expectation. So the net result would be that a few applications emitting variations would create data unrecognizable by the majority of current software.

It is very clear that having every page suddenly come with a label
with some --variant information attached isn't an option.

Everybody would still be allowed to use a label without a variant.
Most people would choose that, because that's what's currently
supported. But some applications, and some data, where it really
matters, could be more precise.


>If there was to be such a huge breaking change in the naming of code pages, the effort in updating the software and legacy data sets would be much better used to migrate to Unicode than to migrate to a different form of the existing code pages.

That's an extremely valid point. However, the two issues are quite
interrelated. For people and applications where exact conversion
is important, correctly and precisely labeling an encoding variant
can be the first step to migrating to Unicode.


>My thinking is that for the registrations for code pages like 936, it would probably be worth stating that variations of the official code page exist and cause different interpretations on some systems.

That would definitely be very valuable information.

Regards, Martin.




#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
Shawn Steele
2007-09-25 04:22:16 UTC
Permalink
> Everybody would still be allowed to use a label without a variant.
> Most people would choose that, because that's what's currently
> supported. But some applications, and some data, where it really
> matters, could be more precise.

Yup. Unicode offers that precision :) You lose that precision though if it ends up going to a client that doesn't recognized the extension and doesn't know how to parse it. Then it becomes very imprecise.

> For people and applications where exact conversion
> is important, correctly and precisely labeling an encoding variant
> can be the first step to migrating to Unicode.

I can see that, but then I'd suggest another tag if possible, so that it didn't conflict with existing software. For example, if I wanted to port all of my invoices I wouldn't want a legacy reporting tool to break because a different label was used. Perhaps that information could be provided in meta data though.

- Shawn
McDonald Ira
2007-10-03 21:21:19 UTC
Permalink
Hi,

Shawn's made a good point about breakage of existing software.

Adding 'variant' suffixes with a new separator to charset tags
suffers the same issues that adding 'script' subtags (as infixes,
groan...) to natural language tags yields - "simple" matching
is no longer so simple.

I think the right answer is a new piece of HTML/XML/whatever
metadata, a "charset-variant" element, which would be harmless
in any reasonably well-written existing software.

Cheers,
- Ira

Ira McDonald (Musician / Software Architect)
Chair - Linux Foundation Open Printing WG
Blue Roof Music / High North Inc
PO Box 221 Grand Marais, MI 49839
phone: +1-906-494-2434
email: ***@sharplabs.com

-----Original Message-----
From: Shawn Steele [mailto:***@microsoft.com]
Sent: Monday, September 24, 2007 11:22 PM
To: Martin Duerst; McDonald, Ira; Erik van der Poel
Cc: ietf-***@mail.apps.ietf.org
Subject: RE: Indicating charset variants (was: RE: windows 936)


> Everybody would still be allowed to use a label without a variant.
> Most people would choose that, because that's what's currently
> supported. But some applications, and some data, where it really
> matters, could be more precise.

Yup. Unicode offers that precision :) You lose that precision though if it ends up going to a client that doesn't recognized the extension and doesn't know how to parse it. Then it becomes very imprecise.

> For people and applications where exact conversion
> is important, correctly and precisely labeling an encoding variant
> can be the first step to migrating to Unicode.

I can see that, but then I'd suggest another tag if possible, so that it didn't conflict with existing software. For example, if I wanted to port all of my invoices I wouldn't want a legacy reporting tool to break because a different label was used. Perhaps that information could be provided in meta data though.

- Shawn

No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.488 / Virus Database: 269.14.0/1046 - Release Date: 10/3/2007 10:08 AM
Shawn Steele
2010-10-25 23:02:21 UTC
Permalink
I was asked about this again for a couple of code pages. Eg: how could I clarify cp950 as a variant of GBK? Similarly shift_jis and Windows-31J.

Thanks,
Shawn

-----Original Message-----
From: McDonald, Ira [mailto:***@sharplabs.com]
Sent: Pōʻ, Mei 22, 2007 10:07 AM
To: Martin Duerst; Erik van der Poel; Shawn Steele
Cc: ietf-***@mail.apps.ietf.org
Subject: RE: windows 936

Hi,

Now that's an interesting idea, Martin. And "+" _is_ legal in charset names, per the following quote from page 4 of RFC 2978:

Finally, charsets being registered for use with the "text" media type
MUST have a primary name that conforms to the more restrictive syntax
of the charset field in MIME encoded-words [RFC-2047, RFC-2184] and
MIME extended parameter values [RFC-2184]. A combined ABNF
definition for such names is as follows:

mime-charset = 1*mime-charset-chars
mime-charset-chars = ALPHA / DIGIT /
"!" / "#" / "$" / "%" / "&" /
"'" / "+" / "-" / "^" / "_" /
"`" / "{" / "}" / "~"
ALPHA = "A".."Z" ; Case insensitive ASCII Letter
DIGIT = "0".."9" ; Numeric digit


Looking at the latest posted IANA Charset Registry plaintext, there are a few uses of "+" (for "+euro") in aliases (but never base names), but it's pretty rare. See:

http://www.iana.org/assignments/character-sets

Cheers,
- Ira

Ira McDonald (Musician / Software Architect) Chair - Linux Foundation Open Printing WG Blue Roof Music / High North Inc PO Box 221 Grand Marais, MI 49839
phone: +1-906-494-2434
email: ***@sharplabs.com

-----Original Message-----
From: Martin Duerst [mailto:***@it.aoyama.ac.jp]
Sent: Monday, May 21, 2007 9:58 PM
To: Erik van der Poel; Shawn Steele
Cc: ietf-***@mail.apps.ietf.org
Subject: Re: windows 936


Dealing with minor differeces and variants as notes is definitely one possibility. However, I think sooner rather than later, we should look at a more syntematic way of indicating variants and extensions.

Here is an extremely rough strawman:

a) Identify a character that's okay in charset tags but rarely
used (e.g. '+', don't even know whether that's okay)
b) Use this character to separate base tag and variants, e.g.
base tag: Shift_jis
tag with variant: Shift_jis+cp932

Shift_jis would only indicate that this is some kind of shift_jis.
Applications that don't care too much about variants would just use this. Shift_jis+cp932 indicates the variant with the Microsoft additions. Applications on the receiving end not interested in variants would have to cut off trailing '+' and what's after.


The above proposal isn't without problems, but addresses the second most fundamental problem in the current scheme.

(The first most fundamental problem is that stuff is often tagged wrongly. But that's a much harder problem than the variants.)

Regards, Martin.

At 10:51 07/05/22, Erik van der Poel wrote:
>Most of the Windows code pages are "supersets" of other standard sets.
>But rather than adding new charset names for these supersets, it might
>be better to add comments to the existing registrations to point out
>the relationships between the various sets.
>
>For example, the windows-936 registration might refer to the gb2312
>one, the windows-31j registration might refer to Windows Code Page 932
>and the Shift_JIS registration, the EUC-KR registration might refer to
>CP 949 and the Big5 registration to CP 950. All as informative
>references, rather than normative, I think.
>
>This promotes interoperability while avoiding the addition of more
>names and "virtual" aliases.
>
>Erik
>
>On 5/21/07, Shawn Steele <***@microsoft.com> wrote:
>>
>>
>>
>>
>> I am looking at the registrations for the remaining 4 "system" code pages:
>> 932, 936, 949 & 950. This seems complicated since IE uses other
>> names for them.
>>
>>
>>
>> For example, for 936, IE recognizes Chinese, CN-GB, csGB2312,
>> csGB231280, csISO58GB231280, GB2312, GB2312-80, GB231280, GBK,
>> GB_2312_80, iso-ir-58, and, of course its known to the system as 936.
>>
>>
>>
>> Our APIs report this code page as being "gb2312"
>>
>>
>>
>> There is an existing registration for GBK, aliases of CP936, MS936
>> and windows-936, but not of the gb2312 name. The existing
>> registration points to broken links at Microsoft and IBM. This
>> should probably be updated to point to:
>>
>>
>>
>> http://www.microsoft.com/globaldev/reference/dbcs/936.mspx
>>
>> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.T
>> XT
>> and
>>
>> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/
>> bestfit936.txt
>>
>>
>>
>> I am a bit uncertain that GBK == 936, although this is what the
>> existing registration implies.
>>
>>
>>
>> The alternative solution would seem to be to register a new charset
>> as "windows-936" with the same additional aliases as the GBK
>> registration and point to the above tables. This would then also
>> lead to the question of whether GBK and gb2312 should be listed as
>> aliases for any such windows-936 code page although the
>> interpretation of those aliases could differ for other systems.
>>
>>
>>
>> My goal is to clarify the Microsoft system code page mappings such as
>> for 932, 936, 949 & 950, and I'd appreciate any suggestions about how
>> to best do that J
>>
>>
>>
>> Thanks,
>>
>>
>>
>> - Shawn
>>
>>
>>
>> Shawn Steele
>>
>> SDE
>>
>> Windows International
>>
>> Microsoft
>>
>>


#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp


No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.467 / Virus Database: 269.7.6/814 - Release Date: 5/21/2007 2:01 PM
Ventasriverstorecomar
2007-05-22 03:34:26 UTC
Permalink
Please remove from your list.

Thanks

----- Original Message -----
From: "Erik van der Poel" <***@google.com>
To: "Shawn Steele" <***@microsoft.com>
Cc: <ietf-***@mail.apps.ietf.org>
Sent: Monday, May 21, 2007 10:51 PM
Subject: Re: windows 936


> Most of the Windows code pages are "supersets" of other standard sets.
> But rather than adding new charset names for these supersets, it might
> be better to add comments to the existing registrations to point out
> the relationships between the various sets.
>
> For example, the windows-936 registration might refer to the gb2312
> one, the windows-31j registration might refer to Windows Code Page 932
> and the Shift_JIS registration, the EUC-KR registration might refer to
> CP 949 and the Big5 registration to CP 950. All as informative
> references, rather than normative, I think.
>
> This promotes interoperability while avoiding the addition of more
> names and "virtual" aliases.
>
> Erik
>
> On 5/21/07, Shawn Steele <***@microsoft.com> wrote:
>>
>>
>>
>>
>> I am looking at the registrations for the remaining 4 "system" code
>> pages:
>> 932, 936, 949 & 950. This seems complicated since IE uses other names
>> for
>> them.
>>
>>
>>
>> For example, for 936, IE recognizes Chinese, CN-GB, csGB2312, csGB231280,
>> csISO58GB231280, GB2312, GB2312-80, GB231280, GBK, GB_2312_80, iso-ir-58,
>> and, of course its known to the system as 936.
>>
>>
>>
>> Our APIs report this code page as being "gb2312"
>>
>>
>>
>> There is an existing registration for GBK, aliases of CP936, MS936 and
>> windows-936, but not of the gb2312 name. The existing registration
>> points
>> to broken links at Microsoft and IBM. This should probably be updated to
>> point to:
>>
>>
>>
>> http://www.microsoft.com/globaldev/reference/dbcs/936.mspx
>>
>> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT
>> and
>>
>> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit936.txt
>>
>>
>>
>> I am a bit uncertain that GBK == 936, although this is what the existing
>> registration implies.
>>
>>
>>
>> The alternative solution would seem to be to register a new charset as
>> "windows-936" with the same additional aliases as the GBK registration
>> and
>> point to the above tables. This would then also lead to the question of
>> whether GBK and gb2312 should be listed as aliases for any such
>> windows-936
>> code page although the interpretation of those aliases could differ for
>> other systems.
>>
>>
>>
>> My goal is to clarify the Microsoft system code page mappings such as for
>> 932, 936, 949 & 950, and I'd appreciate any suggestions about how to best
>> do
>> that J
>>
>>
>>
>> Thanks,
>>
>>
>>
>> - Shawn
>>
>>
>>
>> Shawn Steele
>>
>> SDE
>>
>> Windows International
>>
>> Microsoft
>>
>>
>
>
>
Loading...