Leif Halvard Silli
2011-12-15 11:19:56 UTC
Charset name:
unicode
Charset aliases:
No aliases. (This is a willful violation of the spec upon
which the registration of 'unicode' is based, see the NB!)
Suitability for use in MIME text:
The 'unicode' charset labels the little-endian 'subset' of
'UTF-16' and thus shares the same issue: It does 'not encode
line endings in the way required for MIME "text" media'.
[1] http://tools.ietf.org/rfc/rfc2781.txt
Published specification(s):
The 'unicode' charset label covers 'codepage 1200':
[2] http://msdn.microsoft.com/en-us/library/aa752010(v=VS.85).aspx
Codepage 1200 covers a little-endian representation of UTF-16,
including the BOM: 'Unicode UTF-16, little endian byte order
BMP of ISO 10646);'.
[3] http://msdn.microsoft.com/en-us/library/dd317756(v=VS.85).aspx
The reference to 'Unicode UTF-16' is taken to mean that the
BOM MUST be present.
ISO 10646 equivalency table:
The 'unicode' charset is equivalent of the BMP.[1][2]
Additional information:
The 'unicode' charset can be understood as the little-endian
'subset' of 'UTF-16'. Thus, like 'UTF-16', it includes the BOM: If
the resource doesn't contain a BOM, then it isn't 'unicode'-encoded.
Applications generating resources with the 'unicode' label on
(example: <META content="text/html; charset=unicode"
http-equiv=Content-Type>), are known to insert the BOM. When parsing
e.g. media of MIME type 'text/html', then Internet Explorer is known
to NOT pick 'unicode' (or any other of the 16-bit UTF variants)
as the encoding unless there is a BOM. (Minor exception for
'text/html': If the HTTP Content-Type: header contains 'unicode'
in the charset parameter, then IE renders the 'text/html' resource
fine even without a BOM - but only as long as the resource isn't
loaded from cache.)
NB! Alias: At the time of this registration, the spec upon which
the registration of the 'unicode' and the 'unicodeFFFE' charset is
based, defines 'utf-16' (lowercase) as alias for 'unicode'.[2]
This is incompatible with the registered semantics of (uppercase)
'UTF-16' (RFC2781) as it causes implementations - such as Internet
Explorer (IE) - to interpret 'utf-16' (irrespective of case) to mean
'little-endian'. Usually, because a BOM takes precedence (the BOM is
a MUST for both 'unicode', 'unicodeFFFE' and 'UTF-16'), the problem is
solved by the BOM. But otherwise, unless implementations adheres to
the 'unicode'-registration and thus rejects 'utf-16' as alias for
'unicode', then big-endian MIME text resources that are labelled as
'UTF-16' risk being mis-rendered (causing 'mojibake').
Intended usage:
LIMITED USE. It is used by a large community of Microsoft product
users, but is also supported, across different platforms, by products
that want to be compatible. By 'compatible' is meant e.g. tools, such
as editors, in need of determining the encoding or advice about the
best charset label. In that regard: Any resource that can be validly
labeled as 'unicode' could also validly (and probably ought to) be
labelled as 'UTF-16'. Another example is the encoding sniffing
algorithm of HTML5, which in certain circumstances require charset
labels that contain 'a UTF-16 encoding' (such as 'unicode') as its
value, to be interpreted as if its value instead was 'UTF-8'.
Person & email address to contact for further information:
Leif Halvard Silli, xn--mlform-iua&xn--mlform-iua.no
unicode
Charset aliases:
No aliases. (This is a willful violation of the spec upon
which the registration of 'unicode' is based, see the NB!)
Suitability for use in MIME text:
The 'unicode' charset labels the little-endian 'subset' of
'UTF-16' and thus shares the same issue: It does 'not encode
line endings in the way required for MIME "text" media'.
[1] http://tools.ietf.org/rfc/rfc2781.txt
Published specification(s):
The 'unicode' charset label covers 'codepage 1200':
[2] http://msdn.microsoft.com/en-us/library/aa752010(v=VS.85).aspx
Codepage 1200 covers a little-endian representation of UTF-16,
including the BOM: 'Unicode UTF-16, little endian byte order
BMP of ISO 10646);'.
[3] http://msdn.microsoft.com/en-us/library/dd317756(v=VS.85).aspx
The reference to 'Unicode UTF-16' is taken to mean that the
BOM MUST be present.
ISO 10646 equivalency table:
The 'unicode' charset is equivalent of the BMP.[1][2]
Additional information:
The 'unicode' charset can be understood as the little-endian
'subset' of 'UTF-16'. Thus, like 'UTF-16', it includes the BOM: If
the resource doesn't contain a BOM, then it isn't 'unicode'-encoded.
Applications generating resources with the 'unicode' label on
(example: <META content="text/html; charset=unicode"
http-equiv=Content-Type>), are known to insert the BOM. When parsing
e.g. media of MIME type 'text/html', then Internet Explorer is known
to NOT pick 'unicode' (or any other of the 16-bit UTF variants)
as the encoding unless there is a BOM. (Minor exception for
'text/html': If the HTTP Content-Type: header contains 'unicode'
in the charset parameter, then IE renders the 'text/html' resource
fine even without a BOM - but only as long as the resource isn't
loaded from cache.)
NB! Alias: At the time of this registration, the spec upon which
the registration of the 'unicode' and the 'unicodeFFFE' charset is
based, defines 'utf-16' (lowercase) as alias for 'unicode'.[2]
This is incompatible with the registered semantics of (uppercase)
'UTF-16' (RFC2781) as it causes implementations - such as Internet
Explorer (IE) - to interpret 'utf-16' (irrespective of case) to mean
'little-endian'. Usually, because a BOM takes precedence (the BOM is
a MUST for both 'unicode', 'unicodeFFFE' and 'UTF-16'), the problem is
solved by the BOM. But otherwise, unless implementations adheres to
the 'unicode'-registration and thus rejects 'utf-16' as alias for
'unicode', then big-endian MIME text resources that are labelled as
'UTF-16' risk being mis-rendered (causing 'mojibake').
Intended usage:
LIMITED USE. It is used by a large community of Microsoft product
users, but is also supported, across different platforms, by products
that want to be compatible. By 'compatible' is meant e.g. tools, such
as editors, in need of determining the encoding or advice about the
best charset label. In that regard: Any resource that can be validly
labeled as 'unicode' could also validly (and probably ought to) be
labelled as 'UTF-16'. Another example is the encoding sniffing
algorithm of HTML5, which in certain circumstances require charset
labels that contain 'a UTF-16 encoding' (such as 'unicode') as its
value, to be interpreted as if its value instead was 'UTF-8'.
Person & email address to contact for further information:
Leif Halvard Silli, xn--mlform-iua&xn--mlform-iua.no