Discussion:
Update of charset windows-1252
Erik van der Poel
2006-10-21 04:44:19 UTC
Permalink
I would like to suggest something like this for windows-1252.

Erik

---------------------------

Charset name: windows-1252

Charset aliases: (None)

Suitability for use in MIME text: Yes

Published specification(s):

1) Dr. International "Developing International Software, Second Edition",
Microsoft Press, ISBN 0-7356-1583-7, 2003, p. 743-747

2) http://www.microsoft.com/globaldev/reference/sbcs/1252.htm

ISO 10646 equivalency table:

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt

Additional information:

This is an update of an existing registration of this charset. This
charset name is in use.

Older versions of this charset have been registered as
ISO-8859-1-Windows-3.0-Latin-1
and ISO-8859-1-Windows-3.1-Latin-1.

The graphic (non-control) characters of Windows-1252 are a superset of
the graphic characters of the ISO-8859-1 charset. See the range 80 to
9F (hex).

Person & email address to contact for further information:

Mike Ksar
Email: ***@microsoft.com

Microsoft Corporation
One Microsoft Way,
Redmond, WA 98052
U.S.A.

Intended usage: COMMON
Frank Ellermann
2006-10-21 09:41:27 UTC
Permalink
Erik van der Poel wrote:

> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt

Who created these new "best fit" tables ? The old table...

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

...offers a Contact: cpxlate AT microsoft.com

The reason I ask is the line "CPINFO 1 0x3F 0x003F" in this table. It's
how I implemented it for "codepage 1004" (an OS/2 alias for windows-1252).

<http://purl.net/net/cp/1252> (ICU) proposes 0x1A for windows-1252.

What's the source of the 698 WCTABLE mappings ? The sorting of 0x20ac
could be some artefact of 0x20a0 (I've no idea what u+20A0 really is):

[...]
0x2089 0x39 ;Subscript Nine
0x20ac 0x80 ;Euro Sign
0x20a1 0xa2 ;Colon Sign
0x20a4 0xa3 ;Lira Sign
0x20a7 0x50 ;Peseta Sign
0x2102 0x43 ;Double-Struck Capital C
[...]

Frank
Kent Karlsson
2006-10-21 10:27:48 UTC
Permalink
>>
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bes
tfit1252.txt
>[...]
>0x2089 0x39 ;Subscript Nine
...
>0x2102 0x43 ;Double-Struck Capital C
>[...]

These, and much else in bestfit1252.txt are fallbacks mappings.
Fallbacks
are quite questionable, and when using some charset mapping API I would
(and do) turn the fallbacks off (under almost all circumstances). I
don't think
fallbacks should be part of an IANA charset registration.

/kent k
Erik van der Poel
2006-10-21 15:30:17 UTC
Permalink
I don't know who created the tables, but they were submitted by an
individual from Microsoft.

The windows-1252 iana charset update does offer a contact (Mike Ksar).

The "CPINFO 1 0x3F 0x003F" simply indicates how Microsoft's
implementation maps characters that are not in the destination charset
(0x3F) or illegally encoded (0x003F), depending on the direction (from
Unicode or to Unicode). See the readme. ICU may have chosen 0x1A, but
that was their own decision. There is no interoperability problem here
because the legal characters are fully specified.

The 698 WCTABLE mappings are from Microsoft's implementation. Many of
them are "best fit" mappings. I have confirmed that their
implementation does return these. They do have an option to turn off
the "best fit" mappings.

The mappings are sorted in a strange way. Maybe they will fix that,
but it shouldn't prevent this charset from being updated at IANA.

Regarding Kent's comment saying that best fit mappings should not be
part of an IANA registration: First, Martin says there's not enough
info, now you say there's too much! :-)

Should we strip the best fit mappings from the table and post it somewhere?

Erik

On 10/21/06, Frank Ellermann <***@xyzzy.claranet.de> wrote:
> Erik van der Poel wrote:
>
> > http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt
>
> Who created these new "best fit" tables ? The old table...
>
> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
>
> ...offers a Contact: cpxlate AT microsoft.com
>
> The reason I ask is the line "CPINFO 1 0x3F 0x003F" in this table. It's
> how I implemented it for "codepage 1004" (an OS/2 alias for windows-1252).
>
> <http://purl.net/net/cp/1252> (ICU) proposes 0x1A for windows-1252.
>
> What's the source of the 698 WCTABLE mappings ? The sorting of 0x20ac
> could be some artefact of 0x20a0 (I've no idea what u+20A0 really is):
>
> [...]
> 0x2089 0x39 ;Subscript Nine
> 0x20ac 0x80 ;Euro Sign
> 0x20a1 0xa2 ;Colon Sign
> 0x20a4 0xa3 ;Lira Sign
> 0x20a7 0x50 ;Peseta Sign
> 0x2102 0x43 ;Double-Struck Capital C
> [...]
>
> Frank
>
>
>
Kent Karlsson
2006-10-21 20:51:11 UTC
Permalink
Erik van der Poel wrote:

> The "CPINFO 1 0x3F 0x003F" simply indicates how Microsoft's
> implementation maps characters that are not in the destination charset
> (0x3F) or illegally encoded (0x003F), depending on the direction (from
> Unicode or to Unicode). See the readme. ICU may have chosen 0x1A, but

Not quite. The ICU API allows for quite a lot of "user" (i.e. user of
the API)
control over how to do error substitutions.

[...]
> Regarding Kent's comment saying that best fit mappings should not be
> part of an IANA registration: First, Martin says there's not enough
> info, now you say there's too much! :-)
>
> Should we strip the best fit mappings from the table and post
> it somewhere?

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

will do just fine, and is actually more up to date than the "bestfit"
file
w.r.t. character names... ("ligature ae" vs. "letter ae", and this was
changed
a decade ago; THE infamous change that led to the decision to never
change character names again). But both are in error w.t.r. some control
character names ("horizontal tabulation" -> "character tabulation" and
"vertical tabulation" -> "line tabulation" and the "*separator" ones).

/kent k
Frank Ellermann
2006-10-22 01:40:18 UTC
Permalink
Erik van der Poel wrote:

> I don't know who created the tables, but they were submitted by an
> individual from Microsoft.

For "surprising" mappings it's interesting to know how they could be
reproduced or verified, or if that's maybe only an observation with
API xyz version m.n by an "unknown" individual.

> ICU may have chosen 0x1A, but that was their own decision. There is
> no interoperability problem here

An u2w.icu( x ) != u2w.bestfit( x ) effect could be ugly. For some
code pages like <http://purl.net/net/cp/858> ICU tries hard to list
an "official" substitution character, in that case 0x7F, not 0x1A.

> The 698 WCTABLE mappings are from Microsoft's implementation.
[...]
> I have confirmed that their implementation does return these.

Thanks for info, "did anybody check this" was a part of my question.

> The mappings are sorted in a strange way. Maybe they will fix that,
> but it shouldn't prevent this charset from being updated at IANA.

Sure, that's why I've changed the subject. I wanted to know how the
new "best fit" tables were created. This "best fit" is unrelated to
IANA considerations.

> Should we strip the best fit mappings from the table and post it
> somewhere?

They're fine, but could be improved by adding a hint how they were
determined, and who could fix them if needed.

Frank
Kent Karlsson
2006-10-22 11:05:29 UTC
Permalink
Frank Ellermann wrote:
> > ICU may have chosen 0x1A, but that was their own decision. There is
> > no interoperability problem here
>
> An u2w.icu( x ) != u2w.bestfit( x ) effect could be ugly. For some

As I said, the fallbacks do not belong in the registration. It should be
perfectly ok to use other fallbacks. E.g. generating higher level
markup,
be it character escapes or more [like <sup>...</sup> for instance, or
<span class="red">...</span>], or some "this-is-even-better-fit".

The fallbacks ("bestfit") of the "bestfit" file should *NOT* be part of
the IANA charset registration!

> code pages like <http://purl.net/net/cp/858> ICU tries hard to list
> an "official" substitution character, in that case 0x7F, not 0x1A.

As I mentioned, the ICU API allows the programmer quite a lot of control
on how to handle conversion errors. One can set it up to automatically
generate XML-ish or Java-ish escapes (which I prefer, even if not
targeting
XML or Java), or to use another "error" character (I would *never*
choose '?'
for that). One can set up ones own callback function for conversion
errors.

> > Should we strip the best fit mappings from the table and post it
> > somewhere?

There's one already.

> They're fine, but could be improved by adding a hint how they were
> determined, and who could fix them if needed.

The "bestit" one should NOT be used for the registration. It could be
seen as making any "better" converters (e.g. generating XML escapes)
"non-conforming" (each requiring a different charset registration;
'Windows-1252-XMLescapes', 'Windows-1252-XMLescapes-boldnredCSS',
'Windows-1252-XMLescapes-boldnredCSS-butSUPforsuperscripts',
'Windows-1252-johndoesbetterfit', ...). I hope you don't want that.

/kent k
Erik van der Poel
2006-10-22 16:57:38 UTC
Permalink
I have to admit that Kent does make an important point here. The
example that really drives that point home is
Windows-1252-johndoesbetterfit. The best fit tables provided by
Microsoft are their own choices for mappings from the very large
Unicode set to smaller sets. Other implementors could and do come up
with other choices, depending on their particular product, target
market and current compatibility considerations.

The most important mapping, in my view, is the one from the charset to
Unicode/10646. RFC 2978 is actually a little bit inconsistent here, in
that it mentions mappings to 10646 twice, and to/from 10646 only once.
Just look for "10646" and you will see what I mean.

I believe my attempt to assist in the windows-1252 registration update
has revealed a lack of consensus (albeit among a very small number of
participants) regarding the "best fit" mappings. I wonder if we should
even restrict the normative/recommended 10646 mappings to the "to
10646" mappings, making any supplied "from10646" mappings either
purely informative or maybe even unrecommended, since they appear to
be controversial.

Erik

On 10/22/06, Kent Karlsson <***@comhem.se> wrote:
>
> Frank Ellermann wrote:
> > > ICU may have chosen 0x1A, but that was their own decision. There is
> > > no interoperability problem here
> >
> > An u2w.icu( x ) != u2w.bestfit( x ) effect could be ugly. For some
>
> As I said, the fallbacks do not belong in the registration. It should be
> perfectly ok to use other fallbacks. E.g. generating higher level
> markup,
> be it character escapes or more [like <sup>...</sup> for instance, or
> <span class="red">...</span>], or some "this-is-even-better-fit".
>
> The fallbacks ("bestfit") of the "bestfit" file should *NOT* be part of
> the IANA charset registration!
>
> > code pages like <http://purl.net/net/cp/858> ICU tries hard to list
> > an "official" substitution character, in that case 0x7F, not 0x1A.
>
> As I mentioned, the ICU API allows the programmer quite a lot of control
> on how to handle conversion errors. One can set it up to automatically
> generate XML-ish or Java-ish escapes (which I prefer, even if not
> targeting
> XML or Java), or to use another "error" character (I would *never*
> choose '?'
> for that). One can set up ones own callback function for conversion
> errors.
>
> > > Should we strip the best fit mappings from the table and post it
> > > somewhere?
>
> There's one already.
>
> > They're fine, but could be improved by adding a hint how they were
> > determined, and who could fix them if needed.
>
> The "bestit" one should NOT be used for the registration. It could be
> seen as making any "better" converters (e.g. generating XML escapes)
> "non-conforming" (each requiring a different charset registration;
> 'Windows-1252-XMLescapes', 'Windows-1252-XMLescapes-boldnredCSS',
> 'Windows-1252-XMLescapes-boldnredCSS-butSUPforsuperscripts',
> 'Windows-1252-johndoesbetterfit', ...). I hope you don't want that.
>
> /kent k
>
>
Mark Davis
2006-10-22 17:41:40 UTC
Permalink
I agree I think it would be far more straightforward and well-defined if all
non-roundtrip mappings were excluded from the registrations.

Mark

On 10/22/06, Erik van der Poel <***@google.com> wrote:
>
> I have to admit that Kent does make an important point here. The
> example that really drives that point home is
> Windows-1252-johndoesbetterfit. The best fit tables provided by
> Microsoft are their own choices for mappings from the very large
> Unicode set to smaller sets. Other implementors could and do come up
> with other choices, depending on their particular product, target
> market and current compatibility considerations.
>
> The most important mapping, in my view, is the one from the charset to
> Unicode/10646. RFC 2978 is actually a little bit inconsistent here, in
> that it mentions mappings to 10646 twice, and to/from 10646 only once.
> Just look for "10646" and you will see what I mean.
>
> I believe my attempt to assist in the windows-1252 registration update
> has revealed a lack of consensus (albeit among a very small number of
> participants) regarding the "best fit" mappings. I wonder if we should
> even restrict the normative/recommended 10646 mappings to the "to
> 10646" mappings, making any supplied "from10646" mappings either
> purely informative or maybe even unrecommended, since they appear to
> be controversial.
>
> Erik
>
> On 10/22/06, Kent Karlsson <***@comhem.se> wrote:
> >
> > Frank Ellermann wrote:
> > > > ICU may have chosen 0x1A, but that was their own decision. There is
> > > > no interoperability problem here
> > >
> > > An u2w.icu( x ) != u2w.bestfit( x ) effect could be ugly. For some
> >
> > As I said, the fallbacks do not belong in the registration. It should be
> > perfectly ok to use other fallbacks. E.g. generating higher level
> > markup,
> > be it character escapes or more [like <sup>...</sup> for instance, or
> > <span class="red">...</span>], or some "this-is-even-better-fit".
> >
> > The fallbacks ("bestfit") of the "bestfit" file should *NOT* be part of
> > the IANA charset registration!
> >
> > > code pages like <http://purl.net/net/cp/858> ICU tries hard to list
> > > an "official" substitution character, in that case 0x7F, not 0x1A.
> >
> > As I mentioned, the ICU API allows the programmer quite a lot of control
> > on how to handle conversion errors. One can set it up to automatically
> > generate XML-ish or Java-ish escapes (which I prefer, even if not
> > targeting
> > XML or Java), or to use another "error" character (I would *never*
> > choose '?'
> > for that). One can set up ones own callback function for conversion
> > errors.
> >
> > > > Should we strip the best fit mappings from the table and post it
> > > > somewhere?
> >
> > There's one already.
> >
> > > They're fine, but could be improved by adding a hint how they were
> > > determined, and who could fix them if needed.
> >
> > The "bestit" one should NOT be used for the registration. It could be
> > seen as making any "better" converters (e.g. generating XML escapes)
> > "non-conforming" (each requiring a different charset registration;
> > 'Windows-1252-XMLescapes', 'Windows-1252-XMLescapes-boldnredCSS',
> > 'Windows-1252-XMLescapes-boldnredCSS-butSUPforsuperscripts',
> > 'Windows-1252-johndoesbetterfit', ...). I hope you don't want that.
> >
> > /kent k
> >
> >
>
Erik van der Poel
2006-10-22 19:58:45 UTC
Permalink
I have come across an interoperability problem where one
implementation supports two mappings to a particular 10646 codepoint
and another implementation only supports one of those mappings, in the
Big5 charset:

\xA4\x51 <-> U+5341 (reverse mapping from MS best fit table)
\xA2\xCC -> U+5341

(\x introduces a Big5 byte in hex, U+ introduces a 10646 codepoint in hex)

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit950.txt
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT
http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/CHINTRAD.TXT
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT

So I don't think it will be sufficient to include only the
round-trippable mappings. Now, if we include non-round-trip mappings,
we will probably have to indicate which mapping to use when converting
in the other direction (from 10646). This can be done in at least 2
different ways: mark one of the mappings in the "to 10646" table as
the one to use in the other direction, or provide a full "from 10646"
table, with or without best fit mappings depending on the outcome of
this discussion.

Erik

On 10/22/06, Mark Davis <***@icu-project.org> wrote:
> I agree I think it would be far more straightforward and well-defined if all
> non-roundtrip mappings were excluded from the registrations.
>
> Mark
>
>
> On 10/22/06, Erik van der Poel <***@google.com> wrote:
> > I have to admit that Kent does make an important point here. The
> > example that really drives that point home is
> > Windows-1252-johndoesbetterfit. The best fit tables provided by
> > Microsoft are their own choices for mappings from the very large
> > Unicode set to smaller sets. Other implementors could and do come up
> > with other choices, depending on their particular product, target
> > market and current compatibility considerations.
> >
> > The most important mapping, in my view, is the one from the charset to
> > Unicode/10646. RFC 2978 is actually a little bit inconsistent here, in
> > that it mentions mappings to 10646 twice, and to/from 10646 only once.
> > Just look for "10646" and you will see what I mean.
> >
> > I believe my attempt to assist in the windows-1252 registration update
> > has revealed a lack of consensus (albeit among a very small number of
> > participants) regarding the "best fit" mappings. I wonder if we should
> > even restrict the normative/recommended 10646 mappings to the "to
> > 10646" mappings, making any supplied "from10646" mappings either
> > purely informative or maybe even unrecommended, since they appear to
> > be controversial.
> >
> > Erik
> >
> > On 10/22/06, Kent Karlsson <***@comhem.se> wrote:
> > >
> > > Frank Ellermann wrote:
> > > > > ICU may have chosen 0x1A, but that was their own decision. There is
> > > > > no interoperability problem here
> > > >
> > > > An u2w.icu( x ) != u2w.bestfit( x ) effect could be ugly. For some
> > >
> > > As I said, the fallbacks do not belong in the registration. It should be
> > > perfectly ok to use other fallbacks. E.g. generating higher level
> > > markup,
> > > be it character escapes or more [like <sup>...</sup> for instance, or
> > > <span class="red">...</span>], or some "this-is-even-better-fit".
> > >
> > > The fallbacks ("bestfit") of the "bestfit" file should *NOT* be part of
> > > the IANA charset registration!
> > >
> > > > code pages like < http://purl.net/net/cp/858> ICU tries hard to list
> > > > an "official" substitution character, in that case 0x7F, not 0x1A.
> > >
> > > As I mentioned, the ICU API allows the programmer quite a lot of control
> > > on how to handle conversion errors. One can set it up to automatically
> > > generate XML-ish or Java-ish escapes (which I prefer, even if not
> > > targeting
> > > XML or Java), or to use another "error" character (I would *never*
> > > choose '?'
> > > for that). One can set up ones own callback function for conversion
> > > errors.
> > >
> > > > > Should we strip the best fit mappings from the table and post it
> > > > > somewhere?
> > >
> > > There's one already.
> > >
> > > > They're fine, but could be improved by adding a hint how they were
> > > > determined, and who could fix them if needed.
> > >
> > > The "bestit" one should NOT be used for the registration. It could be
> > > seen as making any "better" converters (e.g. generating XML escapes)
> > > "non-conforming" (each requiring a different charset registration;
> > > 'Windows-1252-XMLescapes',
> 'Windows-1252-XMLescapes-boldnredCSS',
> > >
> 'Windows-1252-XMLescapes-boldnredCSS-butSUPforsuperscripts',
> > > 'Windows-1252-johndoesbetterfit', ...). I hope you
> don't want that.
> > >
> > > /kent k
> > >
> > >
> >
>
>
Martin Duerst
2006-10-23 07:11:55 UTC
Permalink
At 04:58 06/10/23, Erik van der Poel wrote:
>I have come across an interoperability problem

Can you better explain what exactly the interoperability
problem is/how it will be solved by including non-round-trip
mappings?

Regards, Martin.

>where one
>implementation supports two mappings to a particular 10646 codepoint
>and another implementation only supports one of those mappings, in the
>Big5 charset:
>
>\xA4\x51 <-> U+5341 (reverse mapping from MS best fit table)
>\xA2\xCC -> U+5341
>
>(\x introduces a Big5 byte in hex, U+ introduces a 10646 codepoint in hex)
>
>http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit950.txt
>http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT
>http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/CHINTRAD.TXT
>http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT
>
>So I don't think it will be sufficient to include only the
>round-trippable mappings. Now, if we include non-round-trip mappings,
>we will probably have to indicate which mapping to use when converting
>in the other direction (from 10646). This can be done in at least 2
>different ways: mark one of the mappings in the "to 10646" table as
>the one to use in the other direction, or provide a full "from 10646"
>table, with or without best fit mappings depending on the outcome of
>this discussion.
>
>Erik
>
>On 10/22/06, Mark Davis <***@icu-project.org> wrote:
>> I agree I think it would be far more straightforward and well-defined if all
>> non-roundtrip mappings were excluded from the registrations.
>>
>> Mark
>>
>>
>> On 10/22/06, Erik van der Poel <***@google.com> wrote:
>> > I have to admit that Kent does make an important point here. The
>> > example that really drives that point home is
>> > Windows-1252-johndoesbetterfit. The best fit tables provided by
>> > Microsoft are their own choices for mappings from the very large
>> > Unicode set to smaller sets. Other implementors could and do come up
>> > with other choices, depending on their particular product, target
>> > market and current compatibility considerations.
>> >
>> > The most important mapping, in my view, is the one from the charset to
>> > Unicode/10646. RFC 2978 is actually a little bit inconsistent here, in
>> > that it mentions mappings to 10646 twice, and to/from 10646 only once.
>> > Just look for "10646" and you will see what I mean.
>> >
>> > I believe my attempt to assist in the windows-1252 registration update
>> > has revealed a lack of consensus (albeit among a very small number of
>> > participants) regarding the "best fit" mappings. I wonder if we should
>> > even restrict the normative/recommended 10646 mappings to the "to
>> > 10646" mappings, making any supplied "from10646" mappings either
>> > purely informative or maybe even unrecommended, since they appear to
>> > be controversial.
>> >
>> > Erik
>> >
>> > On 10/22/06, Kent Karlsson <***@comhem.se> wrote:
>> > >
>> > > Frank Ellermann wrote:
>> > > > > ICU may have chosen 0x1A, but that was their own decision. There is
>> > > > > no interoperability problem here
>> > > >
>> > > > An u2w.icu( x ) != u2w.bestfit( x ) effect could be ugly. For some
>> > >
>> > > As I said, the fallbacks do not belong in the registration. It should be
>> > > perfectly ok to use other fallbacks. E.g. generating higher level
>> > > markup,
>> > > be it character escapes or more [like <sup>...</sup> for instance, or
>> > > <span class="red">...</span>], or some "this-is-even-better-fit".
>> > >
>> > > The fallbacks ("bestfit") of the "bestfit" file should *NOT* be part of
>> > > the IANA charset registration!
>> > >
>> > > > code pages like < http://purl.net/net/cp/858> ICU tries hard to list
>> > > > an "official" substitution character, in that case 0x7F, not 0x1A.
>> > >
>> > > As I mentioned, the ICU API allows the programmer quite a lot of control
>> > > on how to handle conversion errors. One can set it up to automatically
>> > > generate XML-ish or Java-ish escapes (which I prefer, even if not
>> > > targeting
>> > > XML or Java), or to use another "error" character (I would *never*
>> > > choose '?'
>> > > for that). One can set up ones own callback function for conversion
>> > > errors.
>> > >
>> > > > > Should we strip the best fit mappings from the table and post it
>> > > > > somewhere?
>> > >
>> > > There's one already.
>> > >
>> > > > They're fine, but could be improved by adding a hint how they were
>> > > > determined, and who could fix them if needed.
>> > >
>> > > The "bestit" one should NOT be used for the registration. It could be
>> > > seen as making any "better" converters (e.g. generating XML escapes)
>> > > "non-conforming" (each requiring a different charset registration;
>> > > 'Windows-1252-XMLescapes',
>> 'Windows-1252-XMLescapes-boldnredCSS',
>> > >
>> 'Windows-1252-XMLescapes-boldnredCSS-butSUPforsuperscripts',
>> > > 'Windows-1252-johndoesbetterfit', ...). I hope you
>> don't want that.
>> > >
>> > > /kent k
>> > >
>> > >
>> >
>>
>>


#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
Erik van der Poel
2006-10-23 14:27:15 UTC
Permalink
There are Web pages out there that use \xA2\xCC in Big5, and there is
at least one implementation out there that does not include this in
its Unicode mapping table. So you end up with garbled text, e.g. a '?'
question mark or missing glyph symbol, looking out of place in the
middle of Chinese text. If both mappings had been specified in the
table at the time that the implementation was created, then this
problem would not have occurred. Of course, it would have been better
if only one of the Big5 encodings of that character were in use, but
this is, in fact, not the case.

Erik

On 10/23/06, Martin Duerst <***@it.aoyama.ac.jp> wrote:
> At 04:58 06/10/23, Erik van der Poel wrote:
> >I have come across an interoperability problem
>
> Can you better explain what exactly the interoperability
> problem is/how it will be solved by including non-round-trip
> mappings?
>
> Regards, Martin.
>
> >where one
> >implementation supports two mappings to a particular 10646 codepoint
> >and another implementation only supports one of those mappings, in the
> >Big5 charset:
> >
> >\xA4\x51 <-> U+5341 (reverse mapping from MS best fit table)
> >\xA2\xCC -> U+5341
> >
> >(\x introduces a Big5 byte in hex, U+ introduces a 10646 codepoint in hex)
> >
> >http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit950.txt
> >http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT
> >http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/CHINTRAD.TXT
> >http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT
> >
> >So I don't think it will be sufficient to include only the
> >round-trippable mappings. Now, if we include non-round-trip mappings,
> >we will probably have to indicate which mapping to use when converting
> >in the other direction (from 10646). This can be done in at least 2
> >different ways: mark one of the mappings in the "to 10646" table as
> >the one to use in the other direction, or provide a full "from 10646"
> >table, with or without best fit mappings depending on the outcome of
> >this discussion.
> >
> >Erik
> >
> >On 10/22/06, Mark Davis <***@icu-project.org> wrote:
> >> I agree I think it would be far more straightforward and well-defined if all
> >> non-roundtrip mappings were excluded from the registrations.
> >>
> >> Mark
> >>
> >>
> >> On 10/22/06, Erik van der Poel <***@google.com> wrote:
> >> > I have to admit that Kent does make an important point here. The
> >> > example that really drives that point home is
> >> > Windows-1252-johndoesbetterfit. The best fit tables provided by
> >> > Microsoft are their own choices for mappings from the very large
> >> > Unicode set to smaller sets. Other implementors could and do come up
> >> > with other choices, depending on their particular product, target
> >> > market and current compatibility considerations.
> >> >
> >> > The most important mapping, in my view, is the one from the charset to
> >> > Unicode/10646. RFC 2978 is actually a little bit inconsistent here, in
> >> > that it mentions mappings to 10646 twice, and to/from 10646 only once.
> >> > Just look for "10646" and you will see what I mean.
> >> >
> >> > I believe my attempt to assist in the windows-1252 registration update
> >> > has revealed a lack of consensus (albeit among a very small number of
> >> > participants) regarding the "best fit" mappings. I wonder if we should
> >> > even restrict the normative/recommended 10646 mappings to the "to
> >> > 10646" mappings, making any supplied "from10646" mappings either
> >> > purely informative or maybe even unrecommended, since they appear to
> >> > be controversial.
> >> >
> >> > Erik
> >> >
> >> > On 10/22/06, Kent Karlsson <***@comhem.se> wrote:
> >> > >
> >> > > Frank Ellermann wrote:
> >> > > > > ICU may have chosen 0x1A, but that was their own decision. There is
> >> > > > > no interoperability problem here
> >> > > >
> >> > > > An u2w.icu( x ) != u2w.bestfit( x ) effect could be ugly. For some
> >> > >
> >> > > As I said, the fallbacks do not belong in the registration. It should be
> >> > > perfectly ok to use other fallbacks. E.g. generating higher level
> >> > > markup,
> >> > > be it character escapes or more [like <sup>...</sup> for instance, or
> >> > > <span class="red">...</span>], or some "this-is-even-better-fit".
> >> > >
> >> > > The fallbacks ("bestfit") of the "bestfit" file should *NOT* be part of
> >> > > the IANA charset registration!
> >> > >
> >> > > > code pages like < http://purl.net/net/cp/858> ICU tries hard to list
> >> > > > an "official" substitution character, in that case 0x7F, not 0x1A.
> >> > >
> >> > > As I mentioned, the ICU API allows the programmer quite a lot of control
> >> > > on how to handle conversion errors. One can set it up to automatically
> >> > > generate XML-ish or Java-ish escapes (which I prefer, even if not
> >> > > targeting
> >> > > XML or Java), or to use another "error" character (I would *never*
> >> > > choose '?'
> >> > > for that). One can set up ones own callback function for conversion
> >> > > errors.
> >> > >
> >> > > > > Should we strip the best fit mappings from the table and post it
> >> > > > > somewhere?
> >> > >
> >> > > There's one already.
> >> > >
> >> > > > They're fine, but could be improved by adding a hint how they were
> >> > > > determined, and who could fix them if needed.
> >> > >
> >> > > The "bestit" one should NOT be used for the registration. It could be
> >> > > seen as making any "better" converters (e.g. generating XML escapes)
> >> > > "non-conforming" (each requiring a different charset registration;
> >> > > 'Windows-1252-XMLescapes',
> >> 'Windows-1252-XMLescapes-boldnredCSS',
> >> > >
> >> 'Windows-1252-XMLescapes-boldnredCSS-butSUPforsuperscripts',
> >> > > 'Windows-1252-johndoesbetterfit', ...). I hope you
> >> don't want that.
> >> > >
> >> > > /kent k
> >> > >
> >> > >
> >> >
> >>
> >>
>
>
> #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
> #-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
>
>
Martin Duerst
2006-10-23 07:09:27 UTC
Permalink
At 01:57 06/10/23, Erik van der Poel wrote:
>I have to admit that Kent does make an important point here. The
>example that really drives that point home is
>Windows-1252-johndoesbetterfit. The best fit tables provided by
>Microsoft are their own choices for mappings from the very large
>Unicode set to smaller sets. Other implementors could and do come up
>with other choices, depending on their particular product, target
>market and current compatibility considerations.
>
>The most important mapping, in my view, is the one from the charset to
>Unicode/10646. RFC 2978 is actually a little bit inconsistent here, in
>that it mentions mappings to 10646 twice, and to/from 10646 only once.
>Just look for "10646" and you will see what I mean.
>
>I believe my attempt to assist in the windows-1252 registration update
>has revealed a lack of consensus

I have seen Kent, Mark, and me clearly against the best-fit
fallbacks (I mentioned they could go into "Additional Information",
but I'd be fine if they weren't around at all). You also seem to
agree with that position. I haven't seen any opinion for or
against these tables from Ned or Frank. So in my eyes, it look
like a consensus is forming, although more people's opinions would
be appreciated.

Regards, Martin.

>(albeit among a very small number of
>participants) regarding the "best fit" mappings. I wonder if we should
>even restrict the normative/recommended 10646 mappings to the "to
>10646" mappings, making any supplied "from10646" mappings either
>purely informative or maybe even unrecommended, since they appear to
>be controversial.



#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
Erik van der Poel
2006-10-23 15:42:10 UTC
Permalink
One pair of Windows APIs is MultiByteToWideChar and
WideCharToMultiByte. You can find several other MSIE APIs by looking
for "mlang" at msdn.microsoft.com/library. In particular, look for
"mlconvchar".

I don't know how the best fit tables were created. Some parts look
like they were mechanically generated, but other parts make me suspect
that they were hand-edited there, e.g. inconsistent use of tabs and
spaces, inconsistent use of Control-Z at EOF (end of file).

It is up to Microsoft to supply an email contact for the best fit
files. Since people sometimes leave companies, it might also be a good
idea to provide an alias or list like cpxlate or globaldev(?).

I will address the "u2w.icu( x ) != u2w.bestfit( x )" in a separate email.

Erik

On 10/21/06, Frank Ellermann <***@xyzzy.claranet.de> wrote:
> Erik van der Poel wrote:
>
> > I don't know who created the tables, but they were submitted by an
> > individual from Microsoft.
>
> For "surprising" mappings it's interesting to know how they could be
> reproduced or verified, or if that's maybe only an observation with
> API xyz version m.n by an "unknown" individual.
>
> > ICU may have chosen 0x1A, but that was their own decision. There is
> > no interoperability problem here
>
> An u2w.icu( x ) != u2w.bestfit( x ) effect could be ugly. For some
> code pages like <http://purl.net/net/cp/858> ICU tries hard to list
> an "official" substitution character, in that case 0x7F, not 0x1A.
>
> > The 698 WCTABLE mappings are from Microsoft's implementation.
> [...]
> > I have confirmed that their implementation does return these.
>
> Thanks for info, "did anybody check this" was a part of my question.
>
> > The mappings are sorted in a strange way. Maybe they will fix that,
> > but it shouldn't prevent this charset from being updated at IANA.
>
> Sure, that's why I've changed the subject. I wanted to know how the
> new "best fit" tables were created. This "best fit" is unrelated to
> IANA considerations.
>
> > Should we strip the best fit mappings from the table and post it
> > somewhere?
>
> They're fine, but could be improved by adding a hint how they were
> determined, and who could fix them if needed.
>
> Frank
>
>
>
Martin Duerst
2006-10-22 08:19:40 UTC
Permalink
Regarding the 'best fit' mappings, I think it would be good
to add a pointer to these to the "Additional Information"
section, but they should not be part of the definition of
the charset proper.

First a charset is defined as a way to get from bytes to characters,
for which "best fit" mappings are only marginally relevant.
No mapping of an illegal byte (sequence) should be called "best fit").

Second, Microsoft provides a way to switch these off. A MIME
application (such as an MUA or a Web page editing application)
really SHOULD NOT use the "best fit"; in the Web case, it
should use NCRs (numeric character references), in the mail
case, it should automatically chose another encoding or
ask the user to chose another one.

In summary, charsets are first and formost a means to correctly
exchange correct data, and only secondarily a means to label
processing behavior including error treatment and fallbacks.

Regards, Martin.

At 00:30 06/10/22, Erik van der Poel wrote:
>I don't know who created the tables, but they were submitted by an
>individual from Microsoft.
>
>The windows-1252 iana charset update does offer a contact (Mike Ksar).
>
>The "CPINFO 1 0x3F 0x003F" simply indicates how Microsoft's
>implementation maps characters that are not in the destination charset
>(0x3F) or illegally encoded (0x003F), depending on the direction (from
>Unicode or to Unicode). See the readme. ICU may have chosen 0x1A, but
>that was their own decision. There is no interoperability problem here
>because the legal characters are fully specified.
>
>The 698 WCTABLE mappings are from Microsoft's implementation. Many of
>them are "best fit" mappings. I have confirmed that their
>implementation does return these. They do have an option to turn off
>the "best fit" mappings.
>
>The mappings are sorted in a strange way. Maybe they will fix that,
>but it shouldn't prevent this charset from being updated at IANA.
>
>Regarding Kent's comment saying that best fit mappings should not be
>part of an IANA registration: First, Martin says there's not enough
>info, now you say there's too much! :-)
>
>Should we strip the best fit mappings from the table and post it somewhere?
>
>Erik
>
>On 10/21/06, Frank Ellermann <***@xyzzy.claranet.de> wrote:
>> Erik van der Poel wrote:
>>
>> > http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt
>>
>> Who created these new "best fit" tables ? The old table...
>>
>> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
>>
>> ...offers a Contact: cpxlate AT microsoft.com
>>
>> The reason I ask is the line "CPINFO 1 0x3F 0x003F" in this table. It's
>> how I implemented it for "codepage 1004" (an OS/2 alias for windows-1252).
>>
>> <http://purl.net/net/cp/1252> (ICU) proposes 0x1A for windows-1252.
>>
>> What's the source of the 698 WCTABLE mappings ? The sorting of 0x20ac
>> could be some artefact of 0x20a0 (I've no idea what u+20A0 really is):
>>
>> [...]
>> 0x2089 0x39 ;Subscript Nine
>> 0x20ac 0x80 ;Euro Sign
>> 0x20a1 0xa2 ;Colon Sign
>> 0x20a4 0xa3 ;Lira Sign
>> 0x20a7 0x50 ;Peseta Sign
>> 0x2102 0x43 ;Double-Struck Capital C
>> [...]
>>
>> Frank
>>
>>
>>


#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
Ned Freed
2006-10-21 18:19:54 UTC
Permalink
> I would like to suggest something like this for windows-1252.

Couple of comments inline below.

Ned

> ---------------------------

> Charset name: windows-1252

> Charset aliases: (None)

> Suitability for use in MIME text: Yes

> Published specification(s):

> 1) Dr. International "Developing International Software, Second Edition",
> Microsoft Press, ISBN 0-7356-1583-7, 2003, p. 743-747

> 2) http://www.microsoft.com/globaldev/reference/sbcs/1252.htm

> ISO 10646 equivalency table:

I think a note about the best fit stuff that's included in these
tables might be appropriate here.

> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt
> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt

> Additional information:

> This is an update of an existing registration of this charset. This
> charset name is in use.

> Older versions of this charset have been registered as
> ISO-8859-1-Windows-3.0-Latin-1
> and ISO-8859-1-Windows-3.1-Latin-1.

Another alias I sometimes see is cp1252. I have no problem with omitting
this from the official list of aliases above, but would it make sense to
mention it here?

> The graphic (non-control) characters of Windows-1252 are a superset of
> the graphic characters of the ISO-8859-1 charset. See the range 80 to
> 9F (hex).

> Person & email address to contact for further information:

> Mike Ksar
> Email: ***@microsoft.com

> Microsoft Corporation
> One Microsoft Way,
> Redmond, WA 98052
> U.S.A.

> Intended usage: COMMON
Martin Duerst
2006-10-22 08:26:05 UTC
Permalink
At 03:19 06/10/22, Ned Freed wrote:
>> I would like to suggest something like this for windows-1252.
>
>Couple of comments inline below.
>
> Ned
>
>> ---------------------------
>
>> Charset name: windows-1252
>
>> Charset aliases: (None)
>
>> Suitability for use in MIME text: Yes

Would it make sense here to write:
"yes, with 8bit Content-Transfer-Encoding" ?

Regards, Martin.



#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
Erik van der Poel
2006-10-23 14:14:37 UTC
Permalink
This might be misinterpreted as "not suitable with any
content-transfer-encoding other than 8bit", but quoted-printable is
equally suitable, particularly when it is used to keep lines short
(though format=flowed is considered by some to be a less invasive
solution to that problem).

How about saying "Note that this charset cannot be used with
Content-Transfer-Encoding: 7bit" in the Additional information
section?

Erik

> >> Charset name: windows-1252
> >
> >> Suitability for use in MIME text: Yes
>
> Would it make sense here to write:
> "yes, with 8bit Content-Transfer-Encoding" ?
Erik van der Poel
2006-10-23 16:24:44 UTC
Permalink
Maybe I can correct myself before Ned wakes up. :-) I guess
windows-1252 _can_ be used with Content-Transfer-Encoding: 7bit if the
text does not use any of the 8-bit characters.

E.

On 10/23/06, Erik van der Poel <***@google.com> wrote:
> This might be misinterpreted as "not suitable with any
> content-transfer-encoding other than 8bit", but quoted-printable is
> equally suitable, particularly when it is used to keep lines short
> (though format=flowed is considered by some to be a less invasive
> solution to that problem).
>
> How about saying "Note that this charset cannot be used with
> Content-Transfer-Encoding: 7bit" in the Additional information
> section?
>
> Erik
>
> > >> Charset name: windows-1252
> > >
> > >> Suitability for use in MIME text: Yes
> >
> > Would it make sense here to write:
> > "yes, with 8bit Content-Transfer-Encoding" ?
>
Martin Duerst
2006-10-24 02:26:18 UTC
Permalink
Hello Erik,

At 01:24 06/10/24, Erik van der Poel wrote:
>Maybe I can correct myself before Ned wakes up. :-) I guess
>windows-1252 _can_ be used with Content-Transfer-Encoding: 7bit if the
>text does not use any of the 8-bit characters.

I think this goes too far. It basically would mean something like
"You can use windows-1252 with CTE 7bit if it's just US-ASCII,
not windows-1252." In that case, it should simply be labeled as
US-ASCII.

>On 10/23/06, Erik van der Poel <***@google.com> wrote:
>> This might be misinterpreted as "not suitable with any
>> content-transfer-encoding other than 8bit", but quoted-printable is
>> equally suitable, particularly when it is used to keep lines short
>> (though format=flowed is considered by some to be a less invasive
>> solution to that problem).
>>
>> How about saying "Note that this charset cannot be used with
>> Content-Transfer-Encoding: 7bit" in the Additional information
>> section?
>>
>> Erik

You're right that base64 and QP are also okay. Your proposal
for text in the Additional Information section seems to make
sense to me. I'm waiting to hear from Ned, he knows these things
much better than me.

Regards, Martin.


>> > >> Charset name: windows-1252
>> > >
>> > >> Suitability for use in MIME text: Yes
>> >
>> > Would it make sense here to write:
>> > "yes, with 8bit Content-Transfer-Encoding" ?
>>


#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
Kent Karlsson
2006-10-23 15:49:34 UTC
Permalink
http://dev.icu-project.org/cgi-bin/viewcvs.cgi/charset/data/ucm/glibc-BI
G5-2.3.3.ucm?revision=1.1&view=markup

gives 10 similar cases. They are listed as "|3 for
the best reverse fallback Unicode scaler [sic] value".

I think such cases may well be included in IANA charset
registry (referred) mapping tables, as they represent a
character "more equivalent than canonically equivalent"
to the character it is mapped to.

They are not fallbacks in the sense I referred to previously;
the latter would be "|1 for the best fallback codepage byte
sequence", with a large question mark for the "best" part.
(None are given in the glibc-BIG5-2.3.3.ucm file.)

(I haven't scanned the 1003 other ucm files in
http://dev.icu-project.org/cgi-bin/viewcvs.cgi/charset/data/ucm/.)

/kent k

> -----Original Message-----
> From: Erik van der Poel [mailto:***@google.com]
> Sent: Monday, October 23, 2006 4:27 PM
> To: Martin Duerst
> Cc: Mark Davis; Kent Karlsson; Frank Ellermann;
> ietf-***@mail.apps.ietf.org
> Subject: Re: Best fit
>
>
> There are Web pages out there that use \xA2\xCC in Big5, and there is
> at least one implementation out there that does not include this in
> its Unicode mapping table. So you end up with garbled text, e.g. a '?'
> question mark or missing glyph symbol, looking out of place in the
> middle of Chinese text. If both mappings had been specified in the
> table at the time that the implementation was created, then this
> problem would not have occurred. Of course, it would have been better
> if only one of the Big5 encodings of that character were in use, but
> this is, in fact, not the case.
>
> Erik
>
> On 10/23/06, Martin Duerst <***@it.aoyama.ac.jp> wrote:
> > At 04:58 06/10/23, Erik van der Poel wrote:
> > >I have come across an interoperability problem
> >
> > Can you better explain what exactly the interoperability
> > problem is/how it will be solved by including non-round-trip
> > mappings?
> >
> > Regards, Martin.
> >
> > >where one
> > >implementation supports two mappings to a particular 10646
> codepoint
> > >and another implementation only supports one of those
> mappings, in the
> > >Big5 charset:
> > >
> > >\xA4\x51 <-> U+5341 (reverse mapping from MS best fit table)
> > >\xA2\xCC -> U+5341
> > >
> > >(\x introduces a Big5 byte in hex, U+ introduces a 10646
> codepoint in hex)
> > >
> >
> >http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/Windows
> BestFit/bestfit950.txt
> >
> >http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS
> /CP950.TXT
> > >http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/CHINTRAD.TXT
> >
> >http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHE
> R/BIG5.TXT
> > >
> > >So I don't think it will be sufficient to include only the
> > >round-trippable mappings. Now, if we include
> non-round-trip mappings,
> > >we will probably have to indicate which mapping to use
> when converting
> > >in the other direction (from 10646). This can be done in at least 2
> > >different ways: mark one of the mappings in the "to 10646" table as
> > >the one to use in the other direction, or provide a full
> "from 10646"
> > >table, with or without best fit mappings depending on the
> outcome of
> > >this discussion.
> > >
> > >Erik
> > >
> > >On 10/22/06, Mark Davis <***@icu-project.org> wrote:
> > >> I agree I think it would be far more straightforward and
> well-defined if all
> > >> non-roundtrip mappings were excluded from the registrations.
> > >>
> > >> Mark
> > >>
> > >>
> > >> On 10/22/06, Erik van der Poel <***@google.com> wrote:
> > >> > I have to admit that Kent does make an important point
> here. The
> > >> > example that really drives that point home is
> > >> > Windows-1252-johndoesbetterfit. The best fit tables provided by
> > >> > Microsoft are their own choices for mappings from the
> very large
> > >> > Unicode set to smaller sets. Other implementors could
> and do come up
> > >> > with other choices, depending on their particular
> product, target
> > >> > market and current compatibility considerations.
> > >> >
> > >> > The most important mapping, in my view, is the one
> from the charset to
> > >> > Unicode/10646. RFC 2978 is actually a little bit
> inconsistent here, in
> > >> > that it mentions mappings to 10646 twice, and to/from
> 10646 only once.
> > >> > Just look for "10646" and you will see what I mean.
> > >> >
> > >> > I believe my attempt to assist in the windows-1252
> registration update
> > >> > has revealed a lack of consensus (albeit among a very
> small number of
> > >> > participants) regarding the "best fit" mappings. I
> wonder if we should
> > >> > even restrict the normative/recommended 10646 mappings
> to the "to
> > >> > 10646" mappings, making any supplied "from10646"
> mappings either
> > >> > purely informative or maybe even unrecommended, since
> they appear to
> > >> > be controversial.
> > >> >
> > >> > Erik
> > >> >
> > >> > On 10/22/06, Kent Karlsson <***@comhem.se> wrote:
> > >> > >
> > >> > > Frank Ellermann wrote:
> > >> > > > > ICU may have chosen 0x1A, but that was their own
> decision. There is
> > >> > > > > no interoperability problem here
> > >> > > >
> > >> > > > An u2w.icu( x ) != u2w.bestfit( x ) effect could
> be ugly. For some
> > >> > >
> > >> > > As I said, the fallbacks do not belong in the
> registration. It should be
> > >> > > perfectly ok to use other fallbacks. E.g. generating
> higher level
> > >> > > markup,
> > >> > > be it character escapes or more [like <sup>...</sup>
> for instance, or
> > >> > > <span class="red">...</span>], or some
> "this-is-even-better-fit".
> > >> > >
> > >> > > The fallbacks ("bestfit") of the "bestfit" file
> should *NOT* be part of
> > >> > > the IANA charset registration!
> > >> > >
> > >> > > > code pages like < http://purl.net/net/cp/858> ICU
> tries hard to list
> > >> > > > an "official" substitution character, in that case
> 0x7F, not 0x1A.
> > >> > >
> > >> > > As I mentioned, the ICU API allows the programmer
> quite a lot of control
> > >> > > on how to handle conversion errors. One can set it
> up to automatically
> > >> > > generate XML-ish or Java-ish escapes (which I
> prefer, even if not
> > >> > > targeting
> > >> > > XML or Java), or to use another "error" character (I
> would *never*
> > >> > > choose '?'
> > >> > > for that). One can set up ones own callback function
> for conversion
> > >> > > errors.
> > >> > >
> > >> > > > > Should we strip the best fit mappings from the
> table and post it
> > >> > > > > somewhere?
> > >> > >
> > >> > > There's one already.
> > >> > >
> > >> > > > They're fine, but could be improved by adding a
> hint how they were
> > >> > > > determined, and who could fix them if needed.
> > >> > >
> > >> > > The "bestit" one should NOT be used for the
> registration. It could be
> > >> > > seen as making any "better" converters (e.g.
> generating XML escapes)
> > >> > > "non-conforming" (each requiring a different charset
> registration;
> > >> > > 'Windows-1252-XMLescapes',
> > >> 'Windows-1252-XMLescapes-boldnredCSS',
> > >> > >
> > >> 'Windows-1252-XMLescapes-boldnredCSS-butSUPforsuperscripts',
> > >> > > 'Windows-1252-johndoesbetterfit', ...). I hope you
> > >> don't want that.
> > >> > >
> > >> > > /kent k
> > >> > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> >
> > #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
> > #-#-# http://www.sw.it.aoyama.ac.jp
> mailto:***@it.aoyama.ac.jp
> >
> >
>
Erik van der Poel
2006-10-23 17:05:25 UTC
Permalink
It is quite clear that an implementor must take non-round-trippable
mappings from the charset to 10646 into account, in order to
interoperate with other implementations. It may not be as clear that
IANA should take these into account, but it would be in the spirit of
the IETF, since that organization is quite concerned about
interoperability.

Now, in the other direction, i.e. from 10646 to the charset, re:
"u2w.icu( x ) != u2w.bestfit( x )", it may not be so easy to come up
with scenarios where an implementor would have to mimic another
implementation. One contrived scenario might be a kind of gateway that
converts from a Unicode-based encoding to windows-*, using
Microsoft-chosen best fit mappings. If another implementor wanted to
replace that gateway, they would have to know the best fit mappings.

If noone can come up with realistic scenarios requiring the knowledge
of best fit mappings in the "from 10646" direction, and if there is
consensus that the non-round-trippable mappings in the other direction
(to 10646) are important, perhaps it is time to discuss updating RFC
2978 (or adding another erratum).

http://www.rfc-editor.org/cgi-bin/errataSearch.pl?rfc=2978&

Erik

On 10/23/06, Kent Karlsson <***@comhem.se> wrote:
>
> http://dev.icu-project.org/cgi-bin/viewcvs.cgi/charset/data/ucm/glibc-BI
> G5-2.3.3.ucm?revision=1.1&view=markup
>
> gives 10 similar cases. They are listed as "|3 for
> the best reverse fallback Unicode scaler [sic] value".
>
> I think such cases may well be included in IANA charset
> registry (referred) mapping tables, as they represent a
> character "more equivalent than canonically equivalent"
> to the character it is mapped to.
>
> They are not fallbacks in the sense I referred to previously;
> the latter would be "|1 for the best fallback codepage byte
> sequence", with a large question mark for the "best" part.
> (None are given in the glibc-BIG5-2.3.3.ucm file.)
>
> (I haven't scanned the 1003 other ucm files in
> http://dev.icu-project.org/cgi-bin/viewcvs.cgi/charset/data/ucm/.)
>
> /kent k
>
> > -----Original Message-----
> > From: Erik van der Poel [mailto:***@google.com]
> > Sent: Monday, October 23, 2006 4:27 PM
> > To: Martin Duerst
> > Cc: Mark Davis; Kent Karlsson; Frank Ellermann;
> > ietf-***@mail.apps.ietf.org
> > Subject: Re: Best fit
> >
> >
> > There are Web pages out there that use \xA2\xCC in Big5, and there is
> > at least one implementation out there that does not include this in
> > its Unicode mapping table. So you end up with garbled text, e.g. a '?'
> > question mark or missing glyph symbol, looking out of place in the
> > middle of Chinese text. If both mappings had been specified in the
> > table at the time that the implementation was created, then this
> > problem would not have occurred. Of course, it would have been better
> > if only one of the Big5 encodings of that character were in use, but
> > this is, in fact, not the case.
> >
> > Erik
> >
> > On 10/23/06, Martin Duerst <***@it.aoyama.ac.jp> wrote:
> > > At 04:58 06/10/23, Erik van der Poel wrote:
> > > >I have come across an interoperability problem
> > >
> > > Can you better explain what exactly the interoperability
> > > problem is/how it will be solved by including non-round-trip
> > > mappings?
> > >
> > > Regards, Martin.
> > >
> > > >where one
> > > >implementation supports two mappings to a particular 10646
> > codepoint
> > > >and another implementation only supports one of those
> > mappings, in the
> > > >Big5 charset:
> > > >
> > > >\xA4\x51 <-> U+5341 (reverse mapping from MS best fit table)
> > > >\xA2\xCC -> U+5341
> > > >
> > > >(\x introduces a Big5 byte in hex, U+ introduces a 10646
> > codepoint in hex)
> > > >
> > >
> > >http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/Windows
> > BestFit/bestfit950.txt
> > >
> > >http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS
> > /CP950.TXT
> > > >http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/CHINTRAD.TXT
> > >
> > >http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHE
> > R/BIG5.TXT
> > > >
> > > >So I don't think it will be sufficient to include only the
> > > >round-trippable mappings. Now, if we include
> > non-round-trip mappings,
> > > >we will probably have to indicate which mapping to use
> > when converting
> > > >in the other direction (from 10646). This can be done in at least 2
> > > >different ways: mark one of the mappings in the "to 10646" table as
> > > >the one to use in the other direction, or provide a full
> > "from 10646"
> > > >table, with or without best fit mappings depending on the
> > outcome of
> > > >this discussion.
> > > >
> > > >Erik
> > > >
> > > >On 10/22/06, Mark Davis <***@icu-project.org> wrote:
> > > >> I agree I think it would be far more straightforward and
> > well-defined if all
> > > >> non-roundtrip mappings were excluded from the registrations.
> > > >>
> > > >> Mark
> > > >>
> > > >>
> > > >> On 10/22/06, Erik van der Poel <***@google.com> wrote:
> > > >> > I have to admit that Kent does make an important point
> > here. The
> > > >> > example that really drives that point home is
> > > >> > Windows-1252-johndoesbetterfit. The best fit tables provided by
> > > >> > Microsoft are their own choices for mappings from the
> > very large
> > > >> > Unicode set to smaller sets. Other implementors could
> > and do come up
> > > >> > with other choices, depending on their particular
> > product, target
> > > >> > market and current compatibility considerations.
> > > >> >
> > > >> > The most important mapping, in my view, is the one
> > from the charset to
> > > >> > Unicode/10646. RFC 2978 is actually a little bit
> > inconsistent here, in
> > > >> > that it mentions mappings to 10646 twice, and to/from
> > 10646 only once.
> > > >> > Just look for "10646" and you will see what I mean.
> > > >> >
> > > >> > I believe my attempt to assist in the windows-1252
> > registration update
> > > >> > has revealed a lack of consensus (albeit among a very
> > small number of
> > > >> > participants) regarding the "best fit" mappings. I
> > wonder if we should
> > > >> > even restrict the normative/recommended 10646 mappings
> > to the "to
> > > >> > 10646" mappings, making any supplied "from10646"
> > mappings either
> > > >> > purely informative or maybe even unrecommended, since
> > they appear to
> > > >> > be controversial.
> > > >> >
> > > >> > Erik
> > > >> >
> > > >> > On 10/22/06, Kent Karlsson <***@comhem.se> wrote:
> > > >> > >
> > > >> > > Frank Ellermann wrote:
> > > >> > > > > ICU may have chosen 0x1A, but that was their own
> > decision. There is
> > > >> > > > > no interoperability problem here
> > > >> > > >
> > > >> > > > An u2w.icu( x ) != u2w.bestfit( x ) effect could
> > be ugly. For some
> > > >> > >
> > > >> > > As I said, the fallbacks do not belong in the
> > registration. It should be
> > > >> > > perfectly ok to use other fallbacks. E.g. generating
> > higher level
> > > >> > > markup,
> > > >> > > be it character escapes or more [like <sup>...</sup>
> > for instance, or
> > > >> > > <span class="red">...</span>], or some
> > "this-is-even-better-fit".
> > > >> > >
> > > >> > > The fallbacks ("bestfit") of the "bestfit" file
> > should *NOT* be part of
> > > >> > > the IANA charset registration!
> > > >> > >
> > > >> > > > code pages like < http://purl.net/net/cp/858> ICU
> > tries hard to list
> > > >> > > > an "official" substitution character, in that case
> > 0x7F, not 0x1A.
> > > >> > >
> > > >> > > As I mentioned, the ICU API allows the programmer
> > quite a lot of control
> > > >> > > on how to handle conversion errors. One can set it
> > up to automatically
> > > >> > > generate XML-ish or Java-ish escapes (which I
> > prefer, even if not
> > > >> > > targeting
> > > >> > > XML or Java), or to use another "error" character (I
> > would *never*
> > > >> > > choose '?'
> > > >> > > for that). One can set up ones own callback function
> > for conversion
> > > >> > > errors.
> > > >> > >
> > > >> > > > > Should we strip the best fit mappings from the
> > table and post it
> > > >> > > > > somewhere?
> > > >> > >
> > > >> > > There's one already.
> > > >> > >
> > > >> > > > They're fine, but could be improved by adding a
> > hint how they were
> > > >> > > > determined, and who could fix them if needed.
> > > >> > >
> > > >> > > The "bestit" one should NOT be used for the
> > registration. It could be
> > > >> > > seen as making any "better" converters (e.g.
> > generating XML escapes)
> > > >> > > "non-conforming" (each requiring a different charset
> > registration;
> > > >> > > 'Windows-1252-XMLescapes',
> > > >> 'Windows-1252-XMLescapes-boldnredCSS',
> > > >> > >
> > > >> 'Windows-1252-XMLescapes-boldnredCSS-butSUPforsuperscripts',
> > > >> > > 'Windows-1252-johndoesbetterfit', ...). I hope you
> > > >> don't want that.
> > > >> > >
> > > >> > > /kent k
> > > >> > >
> > > >> > >
> > > >> >
> > > >>
> > > >>
> > >
> > >
> > > #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
> > > #-#-# http://www.sw.it.aoyama.ac.jp
> > mailto:***@it.aoyama.ac.jp
> > >
> > >
> >
>
>
Martin Duerst
2006-10-24 03:43:06 UTC
Permalink
Hello Erik,

I think I agree with you.

I'm sorry I forgot about the BIG5 anomaly of a few instances of
encoding the same character twice. I think in such a case, both
mappings should be listed, and a comment mentioning the anomaly
should be added. The number of such cases, as far as I know, is
very small, both in terms of affected charsets and well as in
terms of affected characters. In most cases, it is due to an
error when designing the charset, in some cases it is due to
a design guideline that differs from Unicode.

I think this is very different from "best fit" fallback mechanisms,
which are really up to the application to invoke (or avoid).
Such mappings should definitely not be listed in equivalence
tables.

Regards, Martin.

At 02:05 06/10/24, Erik van der Poel wrote:
>It is quite clear that an implementor must take non-round-trippable
>mappings from the charset to 10646 into account, in order to
>interoperate with other implementations. It may not be as clear that
>IANA should take these into account, but it would be in the spirit of
>the IETF, since that organization is quite concerned about
>interoperability.
>
>Now, in the other direction, i.e. from 10646 to the charset, re:
>"u2w.icu( x ) != u2w.bestfit( x )", it may not be so easy to come up
>with scenarios where an implementor would have to mimic another
>implementation. One contrived scenario might be a kind of gateway that
>converts from a Unicode-based encoding to windows-*, using
>Microsoft-chosen best fit mappings. If another implementor wanted to
>replace that gateway, they would have to know the best fit mappings.
>
>If noone can come up with realistic scenarios requiring the knowledge
>of best fit mappings in the "from 10646" direction, and if there is
>consensus that the non-round-trippable mappings in the other direction
>(to 10646) are important, perhaps it is time to discuss updating RFC
>2978 (or adding another erratum).
>
>http://www.rfc-editor.org/cgi-bin/errataSearch.pl?rfc=2978&
>
>Erik
>
>On 10/23/06, Kent Karlsson <***@comhem.se> wrote:
>>
>> http://dev.icu-project.org/cgi-bin/viewcvs.cgi/charset/data/ucm/glibc-BI
>> G5-2.3.3.ucm?revision=1.1&view=markup
>>
>> gives 10 similar cases. They are listed as "|3 for
>> the best reverse fallback Unicode scaler [sic] value".
>>
>> I think such cases may well be included in IANA charset
>> registry (referred) mapping tables, as they represent a
>> character "more equivalent than canonically equivalent"
>> to the character it is mapped to.
>>
>> They are not fallbacks in the sense I referred to previously;
>> the latter would be "|1 for the best fallback codepage byte
>> sequence", with a large question mark for the "best" part.
>> (None are given in the glibc-BIG5-2.3.3.ucm file.)
>>
>> (I haven't scanned the 1003 other ucm files in
>> http://dev.icu-project.org/cgi-bin/viewcvs.cgi/charset/data/ucm/.)
>>
>> /kent k
>>
>> > -----Original Message-----
>> > From: Erik van der Poel [mailto:***@google.com]
>> > Sent: Monday, October 23, 2006 4:27 PM
>> > To: Martin Duerst
>> > Cc: Mark Davis; Kent Karlsson; Frank Ellermann;
>> > ietf-***@mail.apps.ietf.org
>> > Subject: Re: Best fit
>> >
>> >
>> > There are Web pages out there that use \xA2\xCC in Big5, and there is
>> > at least one implementation out there that does not include this in
>> > its Unicode mapping table. So you end up with garbled text, e.g. a '?'
>> > question mark or missing glyph symbol, looking out of place in the
>> > middle of Chinese text. If both mappings had been specified in the
>> > table at the time that the implementation was created, then this
>> > problem would not have occurred. Of course, it would have been better
>> > if only one of the Big5 encodings of that character were in use, but
>> > this is, in fact, not the case.
>> >
>> > Erik
>> >
>> > On 10/23/06, Martin Duerst <***@it.aoyama.ac.jp> wrote:
>> > > At 04:58 06/10/23, Erik van der Poel wrote:
>> > > >I have come across an interoperability problem
>> > >
>> > > Can you better explain what exactly the interoperability
>> > > problem is/how it will be solved by including non-round-trip
>> > > mappings?
>> > >
>> > > Regards, Martin.
>> > >
>> > > >where one
>> > > >implementation supports two mappings to a particular 10646
>> > codepoint
>> > > >and another implementation only supports one of those
>> > mappings, in the
>> > > >Big5 charset:
>> > > >
>> > > >\xA4\x51 <-> U+5341 (reverse mapping from MS best fit table)
>> > > >\xA2\xCC -> U+5341
>> > > >
>> > > >(\x introduces a Big5 byte in hex, U+ introduces a 10646
>> > codepoint in hex)
>> > > >
>> > >
>> > >http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/Windows
>> > BestFit/bestfit950.txt
>> > >
>> > >http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS
>> > /CP950.TXT
>> > > >http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/CHINTRAD.TXT
>> > >
>> > >http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHE
>> > R/BIG5.TXT
>> > > >
>> > > >So I don't think it will be sufficient to include only the
>> > > >round-trippable mappings. Now, if we include
>> > non-round-trip mappings,
>> > > >we will probably have to indicate which mapping to use
>> > when converting
>> > > >in the other direction (from 10646). This can be done in at least 2
>> > > >different ways: mark one of the mappings in the "to 10646" table as
>> > > >the one to use in the other direction, or provide a full
>> > "from 10646"
>> > > >table, with or without best fit mappings depending on the
>> > outcome of
>> > > >this discussion.
>> > > >



#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
Continue reading on narkive:
Loading...