Shawn Steele
2006-10-26 18:54:14 UTC
This is regarding the recent threads about the windows-1252 code page.
Our purpose in providing the best fit tables to Unicode was to resolve
any uncertainty about what our best-fit behavior was. These code page
tables aren't intended to replace the existing windows-1252, etc.
tables. We certainly do expect or want these to be registered as a
separate code page. The best fit tables are merely a superset of the
existing tables on the Unicode site. For the ietf's purposes those
existing tables are preferred.
Regarding the form of the tables. The original windows table on the
Unicode site were apparently massaged into a normal form, which also
removed the ability to preserve the best fit behavior. Additionally the
most convenient and error free method of creating the files was just to
copy them from the Windows Vista source tree, so these are basically our
source tables. The line endings probably got cleaned up in the copying,
but basically its just a raw copy.
As pointed out, some of the character name, etc. comments aren't
accurate or use older versions. Additionally the tables appear to have
been originally created with the comments in the code page they
describe, so some of the double byte code pages that include character
examples are pretty look pretty strange when opened with a different
code page. Personally I'd ignore the comments and look at just the
mappings.
I'd also like to point out that the best-fit behavior itself is pretty
inconsistent, random, and sometimes funny. Mapping Infinity to 8 is
particularly odd. We haven't updated the best-fit tables, and don't
intend to, so many logical mappings of new characters aren't included.
These tables are also pretty old, so "new characters" in this context
could be pretty old as well. Additionally the mappings are error-prone
and could have missed obvious look-alikes or made unexpected mappings
based on an individual whim.
Of course, as always, we prefer that applications use Unicode to persist
data, and we consider the best fit behavior to be an old idea that
hopefully people won't use any more. For those that do need this
information we hope that these tables might assist them.
I've blogged about best fit at
http://blogs.msdn.com/shawnste/archive/2006/01/19/515047.aspx
FWIW: Microsoft also has no intention of updating the windows code
pages, changing them breaks people as we discovered adding the Euro, and
we don't want to do that again. For new locales and users not supported
by the existing code pages we recommend using Unicode.
- Shawn
Shawn Steele
***@microsoft.com
Windows International
Microsoft
Our purpose in providing the best fit tables to Unicode was to resolve
any uncertainty about what our best-fit behavior was. These code page
tables aren't intended to replace the existing windows-1252, etc.
tables. We certainly do expect or want these to be registered as a
separate code page. The best fit tables are merely a superset of the
existing tables on the Unicode site. For the ietf's purposes those
existing tables are preferred.
Regarding the form of the tables. The original windows table on the
Unicode site were apparently massaged into a normal form, which also
removed the ability to preserve the best fit behavior. Additionally the
most convenient and error free method of creating the files was just to
copy them from the Windows Vista source tree, so these are basically our
source tables. The line endings probably got cleaned up in the copying,
but basically its just a raw copy.
As pointed out, some of the character name, etc. comments aren't
accurate or use older versions. Additionally the tables appear to have
been originally created with the comments in the code page they
describe, so some of the double byte code pages that include character
examples are pretty look pretty strange when opened with a different
code page. Personally I'd ignore the comments and look at just the
mappings.
I'd also like to point out that the best-fit behavior itself is pretty
inconsistent, random, and sometimes funny. Mapping Infinity to 8 is
particularly odd. We haven't updated the best-fit tables, and don't
intend to, so many logical mappings of new characters aren't included.
These tables are also pretty old, so "new characters" in this context
could be pretty old as well. Additionally the mappings are error-prone
and could have missed obvious look-alikes or made unexpected mappings
based on an individual whim.
Of course, as always, we prefer that applications use Unicode to persist
data, and we consider the best fit behavior to be an old idea that
hopefully people won't use any more. For those that do need this
information we hope that these tables might assist them.
I've blogged about best fit at
http://blogs.msdn.com/shawnste/archive/2006/01/19/515047.aspx
FWIW: Microsoft also has no intention of updating the windows code
pages, changing them breaks people as we discovered adding the Euro, and
we don't want to do that again. For new locales and users not supported
by the existing code pages we recommend using Unicode.
- Shawn
Shawn Steele
***@microsoft.com
Windows International
Microsoft