Registration of new charset 'unicode'

Discussion:

Leif Halvard Silli

2011-12-15 11:19:56 UTC

Charset name:
unicode

Charset aliases:
No aliases. (This is a willful violation of the spec upon
which the registration of 'unicode' is based, see the NB!)

Suitability for use in MIME text:
The 'unicode' charset labels the little-endian 'subset' of
'UTF-16' and thus shares the same issue: It does 'not encode
line endings in the way required for MIME "text" media'.
[1] http://tools.ietf.org/rfc/rfc2781.txt

Published specification(s):
The 'unicode' charset label covers 'codepage 1200':
[2] http://msdn.microsoft.com/en-us/library/aa752010(v=VS.85).aspx
Codepage 1200 covers a little-endian representation of UTF-16,
including the BOM: 'Unicode UTF-16, little endian byte order
BMP of ISO 10646);'.
[3] http://msdn.microsoft.com/en-us/library/dd317756(v=VS.85).aspx
The reference to 'Unicode UTF-16' is taken to mean that the
BOM MUST be present.

ISO 10646 equivalency table:
The 'unicode' charset is equivalent of the BMP.[1][2]

Additional information:
The 'unicode' charset can be understood as the little-endian
'subset' of 'UTF-16'. Thus, like 'UTF-16', it includes the BOM: If
the resource doesn't contain a BOM, then it isn't 'unicode'-encoded.
Applications generating resources with the 'unicode' label on
(example: <META content="text/html; charset=unicode"
http-equiv=Content-Type>), are known to insert the BOM. When parsing
e.g. media of MIME type 'text/html', then Internet Explorer is known
to NOT pick 'unicode' (or any other of the 16-bit UTF variants)
as the encoding unless there is a BOM. (Minor exception for
'text/html': If the HTTP Content-Type: header contains 'unicode'
in the charset parameter, then IE renders the 'text/html' resource
fine even without a BOM - but only as long as the resource isn't
loaded from cache.)
NB! Alias: At the time of this registration, the spec upon which
the registration of the 'unicode' and the 'unicodeFFFE' charset is
based, defines 'utf-16' (lowercase) as alias for 'unicode'.[2]
This is incompatible with the registered semantics of (uppercase)
'UTF-16' (RFC2781) as it causes implementations - such as Internet
Explorer (IE) - to interpret 'utf-16' (irrespective of case) to mean
'little-endian'. Usually, because a BOM takes precedence (the BOM is
a MUST for both 'unicode', 'unicodeFFFE' and 'UTF-16'), the problem is
solved by the BOM. But otherwise, unless implementations adheres to
the 'unicode'-registration and thus rejects 'utf-16' as alias for
'unicode', then big-endian MIME text resources that are labelled as
'UTF-16' risk being mis-rendered (causing 'mojibake').

Intended usage:
LIMITED USE. It is used by a large community of Microsoft product
users, but is also supported, across different platforms, by products
that want to be compatible. By 'compatible' is meant e.g. tools, such
as editors, in need of determining the encoding or advice about the
best charset label. In that regard: Any resource that can be validly
labeled as 'unicode' could also validly (and probably ought to) be
labelled as 'UTF-16'. Another example is the encoding sniffing
algorithm of HTML5, which in certain circumstances require charset
labels that contain 'a UTF-16 encoding' (such as 'unicode') as its
value, to be interpreted as if its value instead was 'UTF-8'.

Person & email address to contact for further information:
Leif Halvard Silli, xn--mlform-iua&xn--mlform-iua.no

Leif Halvard Silli

2011-12-19 23:53:16 UTC

Permalink

Charset name:
'unicode'

Charset aliases:
The 'unicode' spec defines 'utf-16' as its alias, but this of
course contradicts with 'utf-16' as defined in the IANA registry.

Suitability for use in MIME text:
The 'unicode' charset has same MIME text media issue as utf-16.
[1] http://tools.ietf.org/rfc/rfc2781.txt

Published specification(s):
Microsoft's 'Character Set Recognition' document, [1]
together with the 'Code Page Identifiers' document.[2]
[2] http://msdn.microsoft.com/en-us/library/aa752010(v=VS.85).aspx
[3] http://msdn.microsoft.com/en-us/library/dd317756(v=VS.85).aspx

ISO 10646 equivalency table:
The 'unicode' charset represents codepage 1200, whose definition
is: [3] 'Unicode UTF-16, little endian byte order (BMP of ISO
10646);'

Additional information:
* Byte order mark (BOM): The 'unicode' charset specifications do
not explain whether the BOM is required or recommended. However,
without it, products may fail to determine the encoding. And also: The
BOM allows products that do not support 'unicode' to perceive the
encoding as 'utf-16'. Hence it is not very surprising that products
that label documents with the 'unicode' label tend to include the BOM.
Hence, BOM in 'unicode'-encoded documents should be seen as strongly
recommended.
ISSUES to consider before adding support for the 'unicode'
charset in a product:
(1) Document-internal charset declarations: If the label says
'unicode' but the resource (including the BOM) is big-endian, products
(including those that support 'unicode') tend to rely on the BOM and
ignore the charset label. But note that error handling (e.g.
mislabelling, including unknown labels, is a fatal error per XML 1.0
5th edition) and encoding detection (e.g. as described in Appendix F of
XML 1.0 5th edition), could also make the charset label technically
irrelevant. Finally, internal declarations of 16-bit encodings tend to
be without encoding determinative effect in 16-bit encoded documents -
in fact, as labels, the different utf-16 labels tend to be more
effective inside 8-bit encoded documents, where they tend to be treated
like UTF-8 declarations
(2) Document-external encoding declarations:
(a) Products that implement Microsoft's 'unicode' specifications
(in particular Web browsers Internet Explorer and Webkit) in addition
tend to ignore charset info from HTTP for documents that include the
BOM. Whereas current web standards (HTTP, XML 1.0, HTML 4 and HTML 5)
tend to see the charset set by the higher protocol as authoritative.
Thus, adding support for 'unicode' will not increase the product
convergence for those cases when the problem is disagreement about the
order of priority with regard to BOM and HTTP.
(b) Little-endian default: If the BOM is lacking while the
'unicode' label is present (and detected), then products that support
the 'unicode' charset will default to little-endian. (Whereas RFC2781
requires them to default to big-endian.)
(c) The 'utf-16' label: Because the 'unicode' specs define
'utf-16' as an alias for 'unicode', a lacking BOM when the 'utf-16'
label is present (and detected), will cause a little-endian default as
well. This goes against the utf-16 specification [RFC2781], which for
'utf-16' asks for a default to big-endian. (Whereas RFC2781 requires
them to default to big-endian.)

Intended usage:
LIMITED USE. 'Unicode' is used by several Microsoft products
(.NET, Internet Explorer and more) and products that want to be
compatible, such as Webkit.

Person & email address to contact for further information:
Leif Halvard Silli, xn--mlform-iua&xn--mlform-iua.no

Martin J. Dürst

2011-12-20 05:19:58 UTC

Permalink

Hello Leif,

Below are some comments. Please also incorporate the proposals given in
other mail, they all look good to me.

Post by Leif Halvard Silli
'unicode'
The 'unicode' spec defines 'utf-16' as its alias, but this of
course contradicts with 'utf-16' as defined in the IANA registry.

Please reword along the foollowing linkes:

Charset aliases:
None.

Note: Documents by Microsoft mention 'utf-16' as an alias,
but this contradicts the registration of 'utf-16'.

(Main points: Put the actual information first, the explanations later,
and separate them clearly. Don't use the word 'spec' for the Microsoft
side. Don't use "of course" and similar argumentative wording.)

Post by Leif Halvard Silli
The 'unicode' charset has same MIME text media issue as utf-16.
[1] http://tools.ietf.org/rfc/rfc2781.txt

Again, put the actual information first. The reader should not have to
check another document just to get this information.

Post by Leif Halvard Silli
Microsoft's 'Character Set Recognition' document, [1]
together with the 'Code Page Identifiers' document.[2]
[2] http://msdn.microsoft.com/en-us/library/aa752010(v=VS.85).aspx
[3] http://msdn.microsoft.com/en-us/library/dd317756(v=VS.85).aspx

There seems to be some mixup with the numbers in {}. Just don't use
them, give URIs in the text itself.

Post by Leif Halvard Silli
The 'unicode' charset represents codepage 1200, whose definition
is: [3] 'Unicode UTF-16, little endian byte order (BMP of ISO
10646);'

I'd just say "see published specification".

Post by Leif Halvard Silli
* Byte order mark (BOM): The 'unicode' charset specifications do
not explain whether the BOM is required or recommended. However,
without it, products may fail to determine the encoding. And also: The
BOM allows products that do not support 'unicode' to perceive the
encoding as 'utf-16'. Hence it is not very surprising that products
that label documents with the 'unicode' label tend to include the BOM.
Hence, BOM in 'unicode'-encoded documents should be seen as strongly
recommended.

Please remove argumentation such as "it is not very surprising" or "and
also".

Also, I'd separate advice for generation and for reception. E.g. say:
- When sending content, instead of using 'unicode', use 'utf-16' and
make sure that you have a BOM.

and so on. This also applies to the text below.

Post by Leif Halvard Silli
ISSUES to consider before adding support for the 'unicode'
(1) Document-internal charset declarations: If the label says
'unicode' but the resource (including the BOM) is big-endian, products
(including those that support 'unicode') tend to rely on the BOM and
ignore the charset label. But note that error handling (e.g.
mislabelling, including unknown labels, is a fatal error per XML 1.0
5th edition) and encoding detection (e.g. as described in Appendix F of
XML 1.0 5th edition), could also make the charset label technically
irrelevant. Finally, internal declarations of 16-bit encodings tend to
be without encoding determinative effect in 16-bit encoded documents -
in fact, as labels, the different utf-16 labels tend to be more
effective inside 8-bit encoded documents, where they tend to be treated
like UTF-8 declarations
(a) Products that implement Microsoft's 'unicode' specifications
(in particular Web browsers Internet Explorer and Webkit) in addition
tend to ignore charset info from HTTP for documents that include the
BOM. Whereas current web standards (HTTP, XML 1.0, HTML 4 and HTML 5)
tend to see the charset set by the higher protocol as authoritative.
Thus, adding support for 'unicode' will not increase the product
convergence for those cases when the problem is disagreement about the
order of priority with regard to BOM and HTTP.
(b) Little-endian default: If the BOM is lacking while the
'unicode' label is present (and detected), then products that support
the 'unicode' charset will default to little-endian. (Whereas RFC2781
requires them to default to big-endian.)
(c) The 'utf-16' label: Because the 'unicode' specs define
'utf-16' as an alias for 'unicode', a lacking BOM when the 'utf-16'
label is present (and detected), will cause a little-endian default as
well. This goes against the utf-16 specification [RFC2781], which for
'utf-16' asks for a default to big-endian. (Whereas RFC2781 requires
them to default to big-endian.)
LIMITED USE. 'Unicode' is used by several Microsoft products
(.NET, Internet Explorer and more) and products that want to be
compatible, such as Webkit.

Separate "LIMITED USE" and the remaining text by a line.

Regards, Martin.

Post by Leif Halvard Silli
Leif Halvard Silli, xn--mlform-iua&xn--mlform-iua.no

Leif Halvard Silli

2011-12-21 11:32:41 UTC

Permalink

Martin, thank's for the comments. Here is a new try. Leif H. S.

Charset name:
'unicode'

Hereafter referred to as MS 'unicode'.

Charset aliases:
none

NOTE: The published specification mentions 'utf-16' as an alias,
but this contradicts the registration of 'utf-16'. However,
this does allow us to recommend sending as 'utf-16' rather
than sending as the of LIMITED USE MS 'unicode' charset.
See Additional Notes.

Suitability for use in MIME text:
Not suitable.

Reason: The MS 'unicode' charset has same MIME text media issue
as 'utf-16', see http://tools.ietf.org/rfc/rfc2781.txt

Published specification(s):
The document 'Character Set Recognition'
http://msdn.microsoft.com/en-us/library/aa752010(v=VS.85).aspx
together with the document 'Code Page Identifiers'.
http://msdn.microsoft.com/en-us/library/dd317756(v=VS.85).aspx

ISO 10646 equivalency table:
See the published specifications.

Additional information:

* The label:
MS 'unicode' label can be seen as representing the little-endian
'half' - or variant - of 'utf-16'. Thus, unlike 'utf-16', which is a
suitable label for both the little-endian and the big-endian variant,
MS 'unicode' can no longer be used if the encoding is switched from
little-endian to big-endian. Thus, MS 'unicode' is complicated to use
compared with 'utf-16'.

* When to send the MS 'unicode' label:
1. First determine whether it is advisable to send any charset
label at all. For example, starting with HTML5, then conforming, utf-16
encoded HTML documents are not allowed to contain a document-internal
encoding declaration. Also: When the BOM is included, the charset is
self-describing to receivers, and XML does then not require any charset
label for the particular encoding that MS 'unicode' represents. Note as
well that even if there is a BOM, XML parsers are nevertheless
obligated to emit fatal error if the charset label is unknown to the
parser. They must also emit fatal error if the charset is known but
differs from the actualy encoding. (Thus, a fatal error would be due if
MS 'unicode' labelled a big-endian file.) In other applications, then
an unknown charset might trigger charset defaulting or encoding
guessing, which might make no difference from not sending any label at
all.
2. When a label is wanted, the general rule should be to not send
the MS 'unicode' label, but to instead include the BOM and send it as
'utf-16' - see note under Charset aliases above. By doing this one is
complying with both the MS 'unicode' published specification as well as
with the 'utf-16' standard.
3. If one sends the MS 'unicode' label anyway, then one should be
sure to include the BOM, as this increases the chance that consumers
might handle it even if the label is unknown.
4. If one sends the label without the accompanying BOM, and if
the document is little-endian, as it should be according to the
published specifications of MS 'unicode', then note that per the
'utf-16' registration, products are required default to big-endian.
I.e. this is not advisable.

* Receiving the MS 'unicode' label:
(a) The MS 'unicode' label should be treated like 'utf-16', which
means that receivers should expect to see and interpret the BOM. In XML
applications, parsers must check that they know the MS 'unicode'
charset label and if they don't, they must emit fatal error. Also, if
they know it but the actual encoding does not comply with the label -
e.g. because the encoding is big-endian, then they must emit fatal
error as well.
(b) If the receiver knows the MS 'unicode' label and the label is
seen in a resource that lacks the BOM, then receivers should treat the
label as equivalent of 'utf-16le'.
(c) If the MS 'unicode' label is seen in a 8-bit encoded
resource, then products should treat it as they would have treated a
'utf-16' label in the same context. E.g. the HTML5 parser in that
context requires 'utf-16' label to be replaced with the 'utf-8' label.

* Utility of MS 'unicode' in face of receivers that disagree about
priority and other encoding interpretation details:
Products that support MS 'unicode' do, at this time, tend to
disagree with competing products about the order of priority with
regard to BOM and HTTP. Simply adding MS 'unicode' support without
aligning the priorities, would in these cases not increase the
convergence with the competing products. A similar example: Some
products might prefer sniffing the encoding rather than reading labels.

Intended usage:
LIMITED USE.

MS 'Unicode' is added - and interpreted - by several Microsoft
products (.NET, Internet Explorer, Microsoft Office and more) as well
as by competitors, such as Webkit.

Person & email address to contact for further information:
Leif Halvard Silli, xn--mlform-iua&xn--mlform-iua.no