Windows-1252 Best Fit tables.

Discussion:

Shawn Steele

2006-10-26 18:54:14 UTC

This is regarding the recent threads about the windows-1252 code page.

Our purpose in providing the best fit tables to Unicode was to resolve
any uncertainty about what our best-fit behavior was. These code page
tables aren't intended to replace the existing windows-1252, etc.
tables. We certainly do expect or want these to be registered as a
separate code page. The best fit tables are merely a superset of the
existing tables on the Unicode site. For the ietf's purposes those
existing tables are preferred.

Regarding the form of the tables. The original windows table on the
Unicode site were apparently massaged into a normal form, which also
removed the ability to preserve the best fit behavior. Additionally the
most convenient and error free method of creating the files was just to
copy them from the Windows Vista source tree, so these are basically our
source tables. The line endings probably got cleaned up in the copying,
but basically its just a raw copy.

As pointed out, some of the character name, etc. comments aren't
accurate or use older versions. Additionally the tables appear to have
been originally created with the comments in the code page they
describe, so some of the double byte code pages that include character
examples are pretty look pretty strange when opened with a different
code page. Personally I'd ignore the comments and look at just the
mappings.

I'd also like to point out that the best-fit behavior itself is pretty
inconsistent, random, and sometimes funny. Mapping Infinity to 8 is
particularly odd. We haven't updated the best-fit tables, and don't
intend to, so many logical mappings of new characters aren't included.
These tables are also pretty old, so "new characters" in this context
could be pretty old as well. Additionally the mappings are error-prone
and could have missed obvious look-alikes or made unexpected mappings
based on an individual whim.

Of course, as always, we prefer that applications use Unicode to persist
data, and we consider the best fit behavior to be an old idea that
hopefully people won't use any more. For those that do need this
information we hope that these tables might assist them.

I've blogged about best fit at
http://blogs.msdn.com/shawnste/archive/2006/01/19/515047.aspx

FWIW: Microsoft also has no intention of updating the windows code
pages, changing them breaks people as we discovered adding the Euro, and
we don't want to do that again. For new locales and users not supported
by the existing code pages we recommend using Unicode.

- Shawn

Shawn Steele

***@microsoft.com

Windows International

Microsoft

Martin Duerst

2006-10-27 01:53:04 UTC

Permalink

I$BCW(Be blogged about best fit at <http://blogs.msdn.com/shawnste/archive/2006/01/19/515047.aspx>http://blogs.msdn.com/shawnste/archive/2006/01/19/515047.aspx

Very interesting, thanks for the pointer. Just one correction.
You write:
... the best plan is to use Unicode when possible, either UTF-8 or
UTF-16 is usually a good choice.
and then later in the same paragraph:
In those cases finding extensions or newer protocols that handle Unicode
are good, but some, like e-mail headers [;)], we're stuck with.

This is not true. First, with the current email standard (RFC 2822),
you can put UTF-8 into headers the same way you can put Latin-1
into headers: Using RFC 2047. The results look terrible
(e.g. =?UTF-8?Q?Martin D=CD=BCrst?= or =?iso-8859-1?Q?Martin D=FCrst?=),
but they usually work.

Second, there is now an effort underway, the IETF EAI WG, to move
to raw UTF-8 email headers (as well as UTF-8 in SMTP, POP, IMAP,...).
The basics are very easy (much easier of course than RFC 2047),
the main problem is fallbacks for those email servers that may
not support this new emerging standard for a while.
You can find more details at
http://www.ietf.org/html.charters/eai-charter.html

Regards, Martin.

#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp

Shawn Steele

2006-10-27 05:36:23 UTC

Permalink

Of course those *should* work, and I'm aware of the effort to allow UTF-8 email headers, but in general I was thinking of those devices (like some cell phones) that only understand shift-jis or some such. I should probably clarify the post.

I'm also not certain that when I said "use Unicode" that I would've included Unicode in the form "=?UTF-8?Q?Martin D=CD=BCrst?= or =?iso-8859-1?Q?Martin D=FCrst?=" :-), although IMHO that's certainly better than choosing other non-Unicode code pages.

- Shawn

________________________________

From: Martin Duerst [mailto:***@it.aoyama.ac.jp]
Sent: Thu 10/26/2006 6:53 PM
To: Shawn Steele; ietf-***@mail.apps.ietf.org
Cc: Mike Ksar
Subject: [OT] Re: Windows-1252 Best Fit tables.

I?e blogged about best fit at <http://blogs.msdn.com/shawnste/archive/2006/01/19/515047.aspx>http://blogs.msdn.com/shawnste/archive/2006/01/19/515047.aspx

Martin Duerst

2006-10-27 06:19:09 UTC

Permalink

Post by Shawn Steele
Of course those *should* work, and I'm aware of the effort to allow UTF-8
email headers, but in general I was thinking of those devices (like some
cell phones) that only understand shift-jis or some such. I should
probably clarify the post.
I'm also not certain that when I said "use Unicode" that I would've
included Unicode in the form "=?UTF-8?Q?Martin D=CD=BCrst?= or

I don't like that one myself either, but with regards to best
fit, I would include it.

Post by Shawn Steele
=?iso-8859-1?Q?Martin D=FCrst?=" :-),

That one of course isn't Unicode; for many characters,
some best fit mapping would be applied.

Regards, Martin.

#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp