Discussion:
My draft for windows 1252
Shawn Steele
2006-11-07 23:23:02 UTC
Permalink
Erik has been creating drafts for these, but indicated that he'd really prefer someone from Microsoft do it :)

So here's my draft for the windows-1252 registration, based on Erik & Mike's previous draft. If this is acceptable I'll do the same to the rest of the code pages.

Thanks,

- Shawn

--------------------------

Charset name: windows-1252

Charset aliases: (None)

Suitability for use in MIME text:

Yes, windows-1252 is suitable for use with subtypes of the "text"
Content-Type. Note that windows-1252 is an 8-bit charset. Care should
be taken to choose an appropriate Content-Transfer-Encoding.

Published specification(s):

1) http://www.microsoft.com/globaldev/reference/sbcs/1252.htm

ISO 10646 equivalency table:

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

Additional information:

Although not authoritative, the following references may also be of
interest:

Printed mapping table:
Dr. International "Developing International Software, Second Edition",
Microsoft Press, ISBN 0-7356-1583-7, 2003, p. 743-747

Microsoft windows extended "best fit" behavior:
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt

This is an update of an existing registration of this charset. This
charset name is in use.

This charset is also known as Windows Code Page 1252 or cp1252 for
short; these are NOT aliases.

The graphic (non-control) characters of Windows-1252 are a superset of
the graphic characters of the ISO-8859-1 charset. See the range 80 to
9F (hex).

Person & email address to contact for further information:

Shawn Steele
Email: ***@microsoft.com

Microsoft Corporation
One Microsoft Way,
Redmond, WA 98052
U.S.A.
Kent Karlsson
2006-11-08 08:21:56 UTC
Permalink
Post by Shawn Steele
This charset is also known as Windows Code Page 1252 or cp1252 for
short; these are NOT aliases.
Well, the first one cannot be an alias since it contains spaces. The second
one is in practice an alias, though not form MS point of view. And the
(currently auto-generated, but not listed in the charset table) cs alias should
be listed too (I don't oppose having a separate "alias-kind" for them, if possible).

I don't like aliases either, especially not when the "preferred" (and systematic)
name is already given. But these are pre-existing aliases.

/kent k
Shawn Steele
2006-11-08 21:47:25 UTC
Permalink
Post by Shawn Steele
This charset is also known as Windows Code Page 1252 or cp1252 for
short; these are NOT aliases.
Frank mentioned that 1252 could be an alias. I had considered including "1252", however it is usually used as an integer, not a string, so I didn't know if it was appropriate for inclusion as an alias. Any opinions either way?

Regarding the cswindows1252 name, I'll defer to the rest of you as to whether it should be listed as an alias or not. It sort of seems to me like it should have its own entry like the MIBenum field.

I'm fine adding any aliases used anywhere, such as on non-Microsoft platforms, but I don't have knowledge of other aliases besides the cp1252 value. The earlier discussion seemed to lean toward "cp1252" not being a true alias, but I would reconsider if th
Frank Ellermann
2006-11-08 15:04:16 UTC
Permalink
Post by Kent Karlsson
I don't like aliases either, especially not when the "preferred" (and
systematic) name is already given. But these are pre-existing aliases.
The cs- name should certainly be listed as for all other charsets, it's
"obvious" (I hope) that these cs- aliases aren't what's used in MIME or
XML. For cp1252, are you sure that it's actually used anywhere ? Not
counting file names of ICU mappings or similar cases, is there a real
application talking with other applications about "cp1252" ?

Sorry for the stupid question, but on my box I'd use a command like
"chcp 1004" or the system function below it, without any "cp" prefix.

Based on that we might need an alias "1252", not "cp1252". Or we just
ignore "1252" / "cp1252" because it's obvious. I certainly don't need
a "1004" or "cp1004" alias. A comment could make sense, but in practice
the still existing OS/2 users have that already figured out - or if not
they probably won't look into the IANA charset registry for more info.

Frank
Markus Scherer
2006-11-15 17:45:25 UTC
Permalink
Apologies if this was already widely discussed in the previous thread -

On names like "cp1252": These are ambiguous, and I recommend against
adding them as aliases unless they are commonly used for a particular
charset. They are ambiguous because IBM started using 16-bit integers
for "code pages" a long time ago, and Microsoft adopted this practice
and a set of such integers for DOS and Windows. As such, you will find
wide usage of "cp1252" to mean either IBM's or Microsoft's idea of
that code page, which will usually differ. One could argue that, on a
per-integer basis, the company is "right" which "invented" that code
page (e.g., Microsoft for 1252, IBM for 850), but I think that may
just increase confusion.
Post by Frank Ellermann
For cp1252, are you sure that it's actually used anywhere ?
I believe that Java uses "cp1252" and similar names. I am not sure
whether they use the IBM or the Microsoft interpretations of such code
pages, and I believe that even differs in at least some cases between
Sun's JDK and IBM's JDK. I also don't know if Java applications
commonly use such aliases in protocols.
Post by Frank Ellermann
Not counting file names of ICU mappings or similar cases, is there a real
application talking with other applications about "cp1252" ?
As for ICU, once we discovered the ambiguous use of "cp" prefixes, we
became more consistent, for internal identification of mapping tables,
with using "ibm-" prefixes for IBM CCSID integers and "windows-"
prefixes for Microsoft code page integers. (ICU uses strings while IBM
and Microsoft use integers. This was our way of bridging that gap.)
With ICU, "cp" names are used only where someone else (like Java)
recognizes them.
Post by Frank Ellermann
Sorry for the stupid question, but on my box I'd use a command like
"chcp 1004" or the system function below it, without any "cp" prefix.
Based on that we might need an alias "1252", not "cp1252".
I disagree. On a DOS or Windows system, the 1004 in "chcp 1004" gets
parsed into an integer, which is a different beast from what the IANA
charsets list deals with. I don't think that decimal-digit-string
representations of such integers should be added as aliases unless
they are otherwise in common use as strings. I have not seen that as
widespread practice.
Post by Frank Ellermann
Or we just
ignore "1252" / "cp1252" because it's obvious. I certainly don't need
a "1004" or "cp1004" alias.
I don't think "obvious" is an argument one way or another. In my
opinion, the listed aliases should reflect common industry practice.
If anything, _remove_ any aliases that are controversial or ambiguous
or otherwise undesirable. Please don't _add_ aliases that are not
already in common use.

Best regards,
markus
--
Opinions expressed here may not reflect my company's positions unless
otherwise noted.
Shawn Steele
2006-11-15 23:52:55 UTC
Permalink
AFAIK the overlap between "Microsoft" and "IBM" numbers are more like
variations of the same language than completely different code pages
(http://www.unicode.org/Public/MAPPINGS/VENDORS/IBM/readme.txt has a
list of differences).

I've seen similar variations of other code pages that respond to the
same alias(es) with subtly different results between vendors, so I'm not
sure that the variation in implementation invalidates the alias.

In this case it seems like cp1252 is sometimes used to describe
windows-1252 (and maybe also the IBM version), so that fits my
expectation of an alias, even if the exact target is ambiguous.
Similarly 1252 is often processed internally as an integer, however it
can also appear in text (although I wouldn't expect it in MIME or http
content-types.)

So is the ietf charsets assignments only listing those aliases used with
Internet protocols? It appears that some software uses cp1252 and 1252
as aliases, but none cases have been mentioned where they are used with
an Internet protocol.

Personally I don't care much either way, but it seems safer to me to err
on the side of including aliases if we think that they might be used,
which seems the opposite of Markus's position :)

- Shawn
Mark Davis
2006-11-16 03:02:36 UTC
Permalink
An example of the problem is where component A thinks that cp936 means one
thing, and system B thinks it means another. That can lead to customer data
corruption. A takes some data in cp936 out of a database to Unicode, then
sends it (via perhaps very circuitious routes) to B, which converts back to
cp946 and stuffs it into another database. However, since one of the
characters doesn't convert (since the conversion tables disagree), a
character gets trashed, and the data is corrupted.

The way the IANA registry has grown up, this indeterminacy in the meaning of
iana charset names and aliases causes no end of problems. In an ideal world,
each iana charset name would be associated with one and only one mapping to
Unicode/10646. If there were two different mappings, no matter how subtle
the difference, they would perforce have two different iana names. (Of
course, in an 'idealer' world, we'd have already transitioned to Unicode for
interchange, and this wouldn't matter ;-)

Mark
Post by Shawn Steele
AFAIK the overlap between "Microsoft" and "IBM" numbers are more like
variations of the same language than completely different code pages
(http://www.unicode.org/Public/MAPPINGS/VENDORS/IBM/readme.txt has a
list of differences).
I've seen similar variations of other code pages that respond to the
same alias(es) with subtly different results between vendors, so I'm not
sure that the variation in implementation invalidates the alias.
In this case it seems like cp1252 is sometimes used to describe
windows-1252 (and maybe also the IBM version), so that fits my
expectation of an alias, even if the exact target is ambiguous.
Similarly 1252 is often processed internally as an integer, however it
can also appear in text (although I wouldn't expect it in MIME or http
content-types.)
So is the ietf charsets assignments only listing those aliases used with
Internet protocols? It appears that some software uses cp1252 and 1252
as aliases, but none cases have been mentioned where they are used with
an Internet protocol.
Personally I don't care much either way, but it seems safer to me to err
on the side of including aliases if we think that they might be used,
which seems the opposite of Markus's position :)
- Shawn
Erik van der Poel
2006-11-16 03:25:06 UTC
Permalink
Regarding the windows-1252 update, I think we have gotten as close to
"rough consensus" as we are going to get. Can we enter Shawn's text
into the registry now?

Erik
Markus Scherer
2006-11-16 16:08:29 UTC
Permalink
Post by Erik van der Poel
Regarding the windows-1252 update, I think we have gotten as close to
"rough consensus" as we are going to get. Can we enter Shawn's text
into the registry now?
+1

markus

(Now I know why Erik resent his email: The domain name for the mailing
list got longer :-)
Frank Ellermann
2006-11-16 19:02:31 UTC
Permalink
+1
+1

No compelling reasons to register more aliases for this one.
And no danger that we'd copy this decision blindly to other
cases, where listing aliases could make sense.

Frank
Shawn Steele
2006-11-16 07:52:57 UTC
Permalink
An example of the problem is where component A thinks that cp936 means one thing, and system B
thinks it means another.
I agree completely. Unfortunately that already seems to be common to some extent (although more of a problem with the iso-2022-xx code pages than these). I suspect with the 932/936 CJK code pages this already happens for the existing names/aliases.

The opposite problem is that some application uses cp1252 (as apparently Java can), and another application may not recognize that alias at all, in which case all of the characters effectively break.

I'm amiable either way.

- Shawn
Keld Jørn Simonsen
2006-11-16 13:58:13 UTC
Permalink
Post by Shawn Steele
An example of the problem is where component A thinks that cp936 means one thing, and system B
thinks it means another.
I agree completely. Unfortunately that already seems to be common to some extent (although more of a problem with the iso-2022-xx code pages than these). I suspect with the 932/936 CJK code pages this already happens for the existing names/aliases.
The opposite problem is that some application uses cp1252 (as apparently Java can), and another application may not recognize that alias at all, in which case all of the characters effectively break.
As reported earlier the name cp1252 is also in use on unix-like systems in the
program recode. The coding there has slightly different semantics than under MS
Windows for the windows-1252 charset.

One question: is it the intention that the windows-1252 be apllicable
generally, eg for email, and should this then be said in the
registration?

best regards
keld
Shawn Steele
2006-11-16 18:53:57 UTC
Permalink
Post by Keld Jørn Simonsen
As reported earlier the name cp1252 is also in use on unix-like systems in the
program recode. The coding there has slightly different semantics than under MS
Windows for the windows-1252 charset.
I didn't realize that the coding was different in the cases discussed before, although I'm curious if those differences are intentional or accidental (which digresses somewhat)
Post by Keld Jørn Simonsen
One question: is it the intention that the windows-1252 be apllicable
generally, eg for email, and should this then be said in the
registration?
Many mail clients use windows-1252 in many cases, but it isn't appropriately tagged as such.

- Shawn
Keld Jørn Simonsen
2006-11-16 23:30:30 UTC
Permalink
Post by Shawn Steele
Post by Keld Jørn Simonsen
As reported earlier the name cp1252 is also in use on unix-like systems in the
program recode. The coding there has slightly different semantics than under MS
Windows for the windows-1252 charset.
I didn't realize that the coding was different in the cases discussed before, although I'm curious if those differences are intentional or accidental (which digresses somewhat)
I have not checked the actual codepoints. What I meant was that the
cp1252 encoding in recode has semantics to use the mnemonics scheme of
RFC1345 to encode other characters, actually all of 10646 via extension
mechanisms.

best regards
keld
Shawn Steele
2006-11-16 18:56:06 UTC
Permalink
I guess this wants resent...

-----Original Message-----
From: Shawn Steele
Sent: tlherSjaj, nov'mber 16, DIS 2006 tera' 10:10
To: 'Markus Scherer'; Erik van der Poel
Cc: Mark Davis; Frank Ellermann; ietf-***@ietf.org
Subject: RE: My draft for windows 1252

I would be fine with that :)

- Shawn

-----Original Message-----
From: Markus Scherer [mailto:***@gmail.com]
Sent: tlherSjaj, nov'mber 16, DIS 2006 tera' 8:05
To: Erik van der Poel
Cc: Mark Davis; Shawn Steele; Frank Ellermann; ietf-***@ietf.org
Subject: Re: My draft for windows 1252
Post by Erik van der Poel
Regarding the windows-1252 update, I think we have gotten as close to
"rough consensus" as we are going to get. Can we enter Shawn's text
into the registry now?
+1

markus
Erik van der Poel
2006-11-25 00:06:42 UTC
Permalink
Shawn,

Upon re-reading the relevant section of RFC 2978, it appears that when
the proposer believes that consensus has been reached, the
registration should be submitted to IANA and the charset reviewer. We
now have 2 reviewers, so here are the addresses:

***@iana.org
Ned Freed <***@mrochek.com>
Martin Duerst <***@it.aoyama.ac.jp>

If you feel that it's ready, please send them the windows-1252 update.

Erik
Post by Shawn Steele
So here's my draft for the windows-1252 registration, based on Erik & Mike's previous draft. If this is acceptable I'll do the same to the rest of the code pages.
This looks fine to me. I like the rearrangement of references.
Ned
Post by Shawn Steele
--------------------------
Charset name: windows-1252
Charset aliases: (None)
Yes, windows-1252 is suitable for use with subtypes of the "text"
Content-Type. Note that windows-1252 is an 8-bit charset. Care should
be taken to choose an appropriate Content-Transfer-Encoding.
1) http://www.microsoft.com/globaldev/reference/sbcs/1252.htm
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
Although not authoritative, the following references may also be of
Dr. International "Developing International Software, Second Edition",
Microsoft Press, ISBN 0-7356-1583-7, 2003, p. 743-747
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt
This is an update of an existing registration of this charset. This
charset name is in use.
This charset is also known as Windows Code Page 1252 or cp1252 for
short; these are NOT aliases.
The graphic (non-control) characters of Windows-1252 are a superset of
the graphic characters of the ISO-8859-1 charset. See the range 80 to
9F (hex).
Shawn Steele
Microsoft Corporation
One Microsoft Way,
Redmond, WA 98052
U.S.A.
Intended usage: COMMON
Shawn Steele
2007-05-16 19:19:08 UTC
Permalink
Sorry, this was the draft that I sent last fall. We seemed to reach "rough consensus", but I didn't formally submit it.

Please register this update to windows-1252. I will follow through with the other windows code pages shortly.

Thanks,
Shawn

--------------------------

Charset name: windows-1252

Charset aliases: (None)

Suitability for use in MIME text:

Yes, windows-1252 is suitable for use with subtypes of the "text"
Content-Type. Note that windows-1252 is an 8-bit charset. Care should
be taken to choose an appropriate Content-Transfer-Encoding.

Published specification(s):

1) http://www.microsoft.com/globaldev/reference/sbcs/1252.htm

ISO 10646 equivalency table:

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

Additional information:

Although not authoritative, the following references may also be of
interest:

Printed mapping table:
Dr. International "Developing International Software, Second Edition",
Microsoft Press, ISBN 0-7356-1583-7, 2003, p. 743-747

Microsoft windows extended "best fit" behavior:
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt

This is an update of an existing registration of this charset. This
charset name is in use.

This charset is also known as Windows Code Page 1252 or cp1252 for
short; these are NOT aliases.

The graphic (non-control) characters of Windows-1252 are a superset of
the graphic characters of the ISO-8859-1 charset. See the range 80 to
9F (hex).

Person & email address to contact for further information:

Shawn Steele
Email: ***@microsoft.com

Microsoft Corporation
One Microsoft Way,
Redmond, WA 9805
Shawn Steele
2007-05-18 19:29:21 UTC
Permalink
I sent this a couple days ago but there's been no discussion :(. I see that it made it to http://mail.apps.ietf.org/ietf/charsets/maillist.html , but the lack of a response makes me wonder if it got lost elsewhere somehow.

Thanks,
Shawn

-----Original Message-----
From: Shawn Steele
Sent: Wednesday, May 16, 2007 12:19 PM
To: 'Shawn Steele'; ietf-***@mail.apps.ietf.org
Cc: Erik van der Poel
Subject: Update of windows 1252

Sorry, this was the draft that I sent last fall. We seemed to reach "rough consensus", but I didn't formally submit it.

Please register this update to windows-1252. I will follow through with the other windows code pages shortly.

Thanks,
Shawn

--------------------------

Charset name: windows-1252

Charset aliases: (None)

Suitability for use in MIME text:

Yes, windows-1252 is suitable for use with subtypes of the "text"
Content-Type. Note that windows-1252 is an 8-bit charset. Care should
be taken to choose an appropriate Content-Transfer-Encoding.

Published specification(s):

1) http://www.microsoft.com/globaldev/reference/sbcs/1252.htm

ISO 10646 equivalency table:

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

Additional information:

Although not authoritative, the following references may also be of
interest:

Printed mapping table:
Dr. International "Developing International Software, Second Edition",
Microsoft Press, ISBN 0-7356-1583-7, 2003, p. 743-747

Microsoft windows extended "best fit" behavior:
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt

This is an update of an existing registration of this charset. This
charset name is in use.

This charset is also known as Windows Code Page 1252 or cp1252 for
short; these are NOT aliases.

The graphic (non-control) characters of Windows-1252 are a superset of
the graphic characters of the ISO-8859-1 charset. See the range 80 to
9F (hex).

Person & email address to contact for further information:

Shawn Steele
Email: ***@microsoft.com

Microsoft Corporation
One Microsoft Way,
Redmond, WA 98052
U.S.A.
Frank Ellermann
2007-05-18 21:52:08 UTC
Permalink
Post by Shawn Steele
the lack of a response makes me wonder if it got lost elsewhere somehow.
Archived-At: <http://permalink.gmane.org/gmane.ietf.charsets/294>

My first thought was "didn't we agree on this months ago ?", but then I
had no time to check the registry. Doing that now...

http://www.iana.org/assignments/charset-reg/windows-1252

...no, apparently it didn't make it yet. The last change was TSCII
in ftp://ftp.iana.org/assignments/charset-reg/index.htm four days ago,
http://www.iana.org/assignments/charset-reg/TSCII

Each IANA registry has its own peculiar rules, here RFC 2978 at 3.2:

| When the two week period has passed and the registration proposer
| is convinced that consensus has been achieved, the registration
| application should be submitted to IANA and the charset reviewer.

As you wrote in your mail two days ago you forgot to formally submit
it, copies to IANA, Martin, and Ned. Martin + Ned might be redundant,
they're supposed to read the list. But from IANA you'd get a ticket
number, that's good if you run into procedural troubles. Ignoring
these details there are now two weeks for Martin + Ned to decide it:

| The charset reviewer must reach a decision and post it to the ietf-
| charsets mailing list within two weeks. Decisions made by the
| reviewer may be appealed to the IESG.

This split review + registration request model here appears to be
better than the RFC 4646 procedure... Unless the proponents forget
the actual registration request ;-)

Frank
Shawn Steele
2007-05-18 23:00:11 UTC
Permalink
Thanks for the clarification :)
Post by Frank Ellermann
| When the two week period has passed and the registration proposer
| is convinced that consensus has been achieved, the registration
| application should be submitted to IANA and the charset reviewer.
It is unclear to me from the RFC where the IANA submission should go.
Is that ietf-***@iana.org as it implies elsewhere in the RFC, or
is this list sufficient?

There are several more to follow so I want to make sure I get the
process right :)

Thanks,

- Shawn
Ned Freed
2007-05-19 00:43:15 UTC
Permalink
Post by Shawn Steele
Thanks for the clarification :)
Post by Frank Ellermann
| When the two week period has passed and the registration proposer
| is convinced that consensus has been achieved, the registration
| application should be submitted to IANA and the charset reviewer.
It is unclear to me from the RFC where the IANA submission should go.
is this list sufficient?
***@iana.org is the generic submission address. Some submission processes
have specialized addresses but this one doesn't AFAIK.

Ned
K Kalyanasundaram
2007-05-19 06:12:51 UTC
Permalink
Dear Shawn:

For TSCII recently we sent the formal request to the address
<iana AT iana.org> as indicated by Ned Freed. However there were
some problems.

IANA server was supposed to generate a ticket number for each
submission automatically. We did not get any confirmation with
the ticket number. When I querried on this, Michelle Cotton
of IANA was kind enough to follow it through to get a ticket
number. Only then the internal processing starts within IANA.

So if you do not get any ticket number within few days of
submission, you may contact her at <michelle.cotton AT icann.org>

I take this opportunity to thank the two reviewers and participants
of this list for all their help in the registration of TSCII.

Kalyan
(K. Kalyanasundaram)
Post by Erik van der Poel
Post by Shawn Steele
Thanks for the clarification :)
Post by Frank Ellermann
Each IANA registry has its own peculiar rules, here RFC 2978 at
| When the two week period has passed and the registration
proposer
Post by Shawn Steele
Post by Frank Ellermann
| is convinced that consensus has been achieved, the registration
| application should be submitted to IANA and the charset
reviewer.
Post by Shawn Steele
It is unclear to me from the RFC where the IANA submission should
go.
or
Post by Shawn Steele
is this list sufficient?
have specialized addresses but this one doesn't AFAIK.
Ned
Frank Ellermann
2007-05-19 12:15:20 UTC
Permalink
Post by Shawn Steele
It is unclear to me from the RFC where the IANA submission should go.
is this list sufficient?
Ned's proposal ***@iana works as far as any mail still works today,
it's documented on the IANA page below "application forms" -> "other
numbers/registrations" -> "miscellaneous" -> mailto:***@iana

I like "application forms" -> "generic requests for assignments" better,
it's a Web form allowing to create the registration request on their
site (behind spam filters): http://www.iana.org/cgi-bin/assignments.pl

I tried "EXPN ietf-***@iana.org", but as always EXPN is disabled.
Probably ietf-***@iana is just an alias of this (= Ned's) list,
i.e. a "canonical name" independent of the actual charsets list hoster.

Frank

Loading...