ban the use and implementation of UTF-7

Discussion:

Misha Wolf

2006-12-14 22:18:02 UTC

fyi

-----Original Message-----
From: www-tag-***@w3.org [mailto:www-tag-***@w3.org] On Behalf
Of Roy T. Fielding
Sent: 14 December 2006 22:13
To: W3C TAG
Subject: ban the use and implementation of UTF-7

Over the years I have seen a number of security exploits that make
use of broken browsers that sniff character encodings in combination
with UTF-7 encoded tags or javascript commands. I have never actually
seen anyone use UTF-7 for anything legitimate (other than testing).

Is there some reason why WWW clients need to support UTF-7?

It seems completely unnecessary given the now ubiquitous use of 8-bit
clean transports and the presence of UTF-8, which IIRC was defined
long after UTF-7. However, the wider community may be aware of
some reason why browsers should support it, so I'd like to hear
your comments.

If there is no need for UTF-7, I'd like the TAG to consider it an
issue for the sake of asking browsers to remove its implementation
and banning its use by servers.

I know this won't solve any problems for deployed clients, and
wouldn't be an issue at all if servers used the same algorithm for
escaping characters that clients used to interpret them, but in the
long term it will simplify some checks for XSS attacks and I don't
think it will harm the Web. That is, unless there is some significant
body of content out there that is encoded as UTF-7.

Cheers,

Roy T. Fielding <http://roy.gbiv.com/>
Chief Scientist, Day Software <http://www.day.com/>

This email was sent to you by Reuters, the global news and information company.
To find out more about Reuters visit www.about.reuters.com

Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd.

Mark Davis

2006-12-15 00:39:00 UTC

Permalink

Speaking as one of the authors, I think it is clear that UTF-7 should only
be supported where really necessary; only in environments that are not 8-bit
clean. It was originally designed for email, but in this day and age, 8-bit
clean email transport is really not much of an issue.

Mark

Post by Misha Wolf
fyi
-----Original Message-----
Of Roy T. Fielding
Sent: 14 December 2006 22:13
To: W3C TAG
Subject: ban the use and implementation of UTF-7
Over the years I have seen a number of security exploits that make
use of broken browsers that sniff character encodings in combination
with UTF-7 encoded tags or javascript commands. I have never actually
seen anyone use UTF-7 for anything legitimate (other than testing).
Is there some reason why WWW clients need to support UTF-7?
It seems completely unnecessary given the now ubiquitous use of 8-bit
clean transports and the presence of UTF-8, which IIRC was defined
long after UTF-7. However, the wider community may be aware of
some reason why browsers should support it, so I'd like to hear
your comments.
If there is no need for UTF-7, I'd like the TAG to consider it an
issue for the sake of asking browsers to remove its implementation
and banning its use by servers.
I know this won't solve any problems for deployed clients, and
wouldn't be an issue at all if servers used the same algorithm for
escaping characters that clients used to interpret them, but in the
long term it will simplify some checks for XSS attacks and I don't
think it will harm the Web. That is, unless there is some significant
body of content out there that is encoded as UTF-7.
Cheers,
Roy T. Fielding <http://roy.gbiv.com/>
Chief Scientist, Day Software <http://www.day.com/>
This email was sent to you by Reuters, the global news and information
company.
To find out more about Reuters visit www.about.reuters.com
Any views expressed in this message are those of the individual sender,
except where the sender specifically states them to be the views of Reuters
Ltd.

Kari Hurtta

2006-12-15 14:49:16 UTC

Permalink

Well, UTF-7 is attractive on mail header fields, because 8-bit is not yet allowed.

/ Kari Hurtta

Keith Moore

2006-12-15 23:00:02 UTC

Permalink

Post by Kari Hurtta
Well, UTF-7 is attractive on mail header fields, because 8-bit is not yet allowed.

for better or worse, RFC 2047 is well-established for mail header
fields. UTF-7 would not be worth the disruption it would cause.

Frank Ellermann

2006-12-16 09:19:28 UTC

Permalink

Post by Keith Moore
for better or worse, RFC 2047 is well-established for mail header
fields. UTF-7 would not be worth the disruption it would cause.

Sometimes I got "misdirected bounces" (spam in the style of a DSN)
using the variant of UTF-7 registered as UNICODE-1-1-UTF-7. It's
not really worse than SCSU.

Frank

Kari Hurtta

2006-12-16 05:45:25 UTC

Permalink

Post by Keith Moore

Post by Kari Hurtta

Keith Moore

Well, UTF-7 is attractive on mail header fields, because 8-bit is not yet allowed.

for better or worse, RFC 2047 is well-established for mail header
fields. UTF-7 would not be worth the disruption it would cause.

Martin Duerst

2006-12-16 02:40:37 UTC

Permalink

Post by Kari Hurtta
Well, UTF-7 is attractive on mail header fields, because 8-bit is not yet allowed.

I don't understand this. Raw 8-bit isn't allowed, but base64 or qp
can be used (including the identification of the encoding).
You may be able to put UTF-7 into email header fields, even at
some positions where you can't put RFC 2047 stuff, but no
MUA (or MTA, for that matter) will recognize it and use the actual
characters, which means it's pretty useless.

Regards, Martin.

#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp

Kari Hurtta

2006-12-17 13:02:59 UTC

Permalink

Post by Martin Duerst

Post by Kari Hurtta
Well, UTF-7 is attractive on mail header fields, because 8-bit is not yet allowed.

Of course, when UTF-7 used on mail headers, RFC 2047 is used for labeling.

UTF-7 with RFC 2047 simply produces much shorter encoding than
UTF-8 with RFC 2047

UTF-7 is used on here ... :-)

When UTF-7 is used, RFC 2047's quote-printable encoding do not
encode any characters in practise. It only produces labeling.

Compare

Subject: =?UTF-7?Q?T+AOQ-ss+AOQ-_esimerkki?=

versus

Subject: =?UTF-8?Q?T=C3=A4ss=C3=A4_esimerkki?=

On here there was only two non-ascii characters on input.

When there is more non-ascii characters it is more clear:

Subject: =?UTF-8?Q?=C3=84iti_k=C3=A4vi_t=C3=A4=C3=A4ll=C3=A4?=

versus

Subject: =?UTF-7?Q?+AMQ-iti_k+AOQ-vi_t+AOQA5A-ll+AOQ-?=

There was 5 non-ascii characters.

Post by Martin Duerst
#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University

/ Kari Hurtta

Deborah Goldsmith

2006-12-15 03:29:28 UTC

Permalink

Speaking as the other author, I agree. :-)

Deborah

Post by Mark Davis
Speaking as one of the authors, I think it is clear that UTF-7
should only be supported where really necessary; only in
environments that are not 8-bit clean. It was originally designed
for email, but in this day and age, 8-bit clean email transport is
really not much of an issue.
Mark
fyi
-----Original Message-----
Of Roy T. Fielding
Sent: 14 December 2006 22:13
To: W3C TAG
Subject: ban the use and implementation of UTF-7
Over the years I have seen a number of security exploits that make
use of broken browsers that sniff character encodings in combination
with UTF-7 encoded tags or javascript commands. I have never actually
seen anyone use UTF-7 for anything legitimate (other than testing).
Is there some reason why WWW clients need to support UTF-7?
It seems completely unnecessary given the now ubiquitous use of 8-bit
clean transports and the presence of UTF-8, which IIRC was defined
long after UTF-7. However, the wider community may be aware of
some reason why browsers should support it, so I'd like to hear
your comments.
If there is no need for UTF-7, I'd like the TAG to consider it an
issue for the sake of asking browsers to remove its implementation
and banning its use by servers.
I know this won't solve any problems for deployed clients, and
wouldn't be an issue at all if servers used the same algorithm for
escaping characters that clients used to interpret them, but in the
long term it will simplify some checks for XSS attacks and I don't
think it will harm the Web. That is, unless there is some significant
body of content out there that is encoded as UTF-7.
Cheers,
Roy T. Fielding <http://roy.gbiv.com/>
Chief Scientist, Day Software < http://www.day.com/>
This email was sent to you by Reuters, the global news and
information company.
To find out more about Reuters visit www.about.reuters.com
Any views expressed in this message are those of the individual
sender, except where the sender specifically states them to be the
views of Reuters Ltd.

Martin Duerst

2006-12-15 07:25:33 UTC

Permalink

Hello Roy,

As you can see at
http://lists.w3.org/Archives/Public/www-international/2006OctDec/0144,
Mark Davis, one of the authors, essentially agrees with you.
In a followup on the ietf-charsets mailing list, Deborah Goldsmith,
the other author of the UTF-7 spec, also agrees.

The only place I'm aware that (a variant!) of UTF-8 is used is
for IMAP folder name internationalization. See e.g.
http://www.ietf.org/rfc/rfc2192.txt for details.
In hindsight, using an UTF-7 variant in the protocol seems
unnecessary, but the original idea (mostly by Mark Crispin,
as far as I understand it) was that it could be used as is
on the server side, even on totally un-internationalized
operating systems.

As for the browsers, I think they just added UTF-7 at one time
because the name looked similar to UTF-8 and UTF-16, and it was
difficult to predict exactly how these encodings would deploy.
And as in any software, it's difficult to get rid of something,
but security reasons are about the best you can come up with
for cleaning up.

As for the IANA charset registry
(http://www.iana.org/assignments/character-sets), Ned and
me (who are currently the expert reviewers) as well as the
other list participants have been talking about cleaning it
up. We don't currently yet have an exact idea of what needs
to be done, but being able to attach security warnings or
similar comments to an entry might be one possible way to
proceed. The problem might be that RFC 2152
(http://www.ietf.org/rfc/rfc2152.txt) might have to be updated.

But as far as the browsers are concerned, if the TAG can come
up with a finding that e.g. also gives some more details and
examples about the security issues you mention, then we might
also be able to point to this document from anything on the
IETF or IANA side.

Regards, Martin.

Post by Misha Wolf
Over the years I have seen a number of security exploits that make
use of broken browsers that sniff character encodings in combination
with UTF-7 encoded tags or javascript commands. I have never actually
seen anyone use UTF-7 for anything legitimate (other than testing).
Is there some reason why WWW clients need to support UTF-7?
It seems completely unnecessary given the now ubiquitous use of 8-bit
clean transports and the presence of UTF-8, which IIRC was defined
long after UTF-7. However, the wider community may be aware of
some reason why browsers should support it, so I'd like to hear
your comments.
If there is no need for UTF-7, I'd like the TAG to consider it an
issue for the sake of asking browsers to remove its implementation
and banning its use by servers.
I know this won't solve any problems for deployed clients, and
wouldn't be an issue at all if servers used the same algorithm for
escaping characters that clients used to interpret them, but in the
long term it will simplify some checks for XSS attacks and I don't
think it will harm the Web. That is, unless there is some significant
body of content out there that is encoded as UTF-7.
Cheers,
Roy T. Fielding <http://roy.gbiv.com/>
Chief Scientist, Day Software <http://www.day.com/>

#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp

Mark Crispin

2006-12-15 20:06:40 UTC

Permalink

Post by Martin Duerst
The only place I'm aware that (a variant!) of UTF-8 is used is
for IMAP folder name internationalization. See e.g.
http://www.ietf.org/rfc/rfc2192.txt for details.
In hindsight, using an UTF-7 variant in the protocol seems
unnecessary, but the original idea (mostly by Mark Crispin,
as far as I understand it) was that it could be used as is
on the server side, even on totally un-internationalized
operating systems.

Had punycode existed at the time, it most certainly would have been used
in IMAP instead of Modified UTF-7.

It would have been impossible to use UTF-8 at the time. First, there were
still active 7-bit servers. Second there were certain foreign servers
which followed the "just send 8-bits" Pied Piper and transmitted strings
in ISO-8859-1, Shift-JIS, etc. Spread of the latter had to be checked,
and existing deployments exterminated; and Modified UTF-7 was the best of
a set of bad choices at the time.

It took quite some time to exterminate the servers which did "just send
8-bits"; and that's what held up the usage of UTF-8. There's some reason
to believe that the extermination isn't complete yet, and that there are
still a few cockroaches scurrying around. But the situation is much
better than it was a decade ago.

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.

Jon Hanna

2006-12-19 18:31:06 UTC

Permalink

Post by Misha Wolf
It seems completely unnecessary given the now ubiquitous use of 8-bit
clean transports and the presence of UTF-8, which IIRC was defined
long after UTF-7. However, the wider community may be aware of
some reason why browsers should support it, so I'd like to hear
your comments.

I somewhat suspect that browsers support UTF-7 due to either:
1. The necessary transcoding is done by a library or OS service and
UTF-7 is one of the charsets the library gives you, so coding for one
codes for all.
2. The necessary transcoding is done by a charset nerd. S/he really
likes charsets. UTF-7 is yet another they can add to their work and
hence feel that bit more complete.

Addison Phillips

2006-12-19 18:51:10 UTC

Permalink

The problem, as I see it, is not that the browsers support conversion
using UTF-7. It is that they auto-detect UTF-7. Since UTF-7 uses plain
ASCII characters to form escape sequences, it turns out to be trivial to
fool a browser into detecting UTF-7, causing an XSS security hole.

Some of us have spent more than ample time building anti-UTF-7 code
(such as judiciously replacing '+' in UTF-7 spoof sequences with
'+'). It is nutty.

I agree with the basic premise of Roy's that UTF-7 ought to be banned.
But it would be simpler to remove it from the list of things
auto-detected by user agents. A page that actually uses UTF-7 really
REALLY ought to declare that encoding (in which case no security flaw is
present). Otherwise it should display as mojibake.

Addison

--
Addison Phillips
Globalization Architect -- Yahoo! Inc.

Internationalization is an architecture.
It is not a feature.

Markus Scherer

2007-01-04 22:19:01 UTC

Permalink

Speaking as a library implementer and charset nerd, I agree that UTF-7
is unnecessary for the web (HTTP), and given the discussion I think
it's a good idea to not auto-detect it there.

However, for _email_ (SMTP), there is still some portion of MTAs and
MUAs that need 7-bit content. In a 7BIT mail body, for which UTF-7 was
invented, it is usually more compact than UTF-8+base64 or UTF-8+qp.
Thus, for email I would argue that UTF-7 has its place (although not
necessarily _auto-detection_ of UTF-7).

Of course, once 8BIT SMTP is universally implemented, straight UTF-8
is preferrable.

Best regards,
markus

--
Opinions expressed here may not reflect my company's positions unless
otherwise noted.