Discussion:
GSM 03.38 substitution character?
Markus Scherer
2006-11-02 21:32:07 UTC
Permalink
I am formatting the GSM 03.38 conversion table [1] into ICU .ucm file
format. The one piece of information that I am missing is the
substitution character. Is there an official one defined?

0x1A is not available because it's mapped to a graphic character. I
have 0x3F ('?') in my draft .ucm table.

(I don't see any charset registration with "gsm" from which I could
glean this information. And this is not a registration request.)

Thanks,
markus

[1] http://www.unicode.org/Public/MAPPINGS/ETSI/GSM0338.TXT
Kent Karlsson
2006-11-02 22:04:08 UTC
Permalink
Post by Markus Scherer
0x1A is not available because it's mapped to a graphic character. I
Note also that 0x00 is mapped to a graphic character (@) *if* followed
by a non-0x00 (using a carriage return if there are no more printable
characters before the end of the (normally fixed) length terminated
string). Otherwise ((0x00)*<eos>) it is mapped to nothing.
(Not really suitable for an ICU ucnv-conversion, IIUC.)

Note also that the fallback for (most) Greek letters is the same-looking
uppercase Latin letter.

/kent k
Frank Ellermann
2006-11-02 22:36:56 UTC
Permalink
Post by Markus Scherer
Is there an official one defined?
http://www.3gpp.org/ftp/Specs/archive/03_series/03.38/0338-720.zip
I can't really read it, it's a *.doc, but it might be what you want.

Frank
Markus Scherer
2006-11-03 00:01:24 UTC
Permalink
Post by Frank Ellermann
http://www.3gpp.org/ftp/Specs/archive/03_series/03.38/0338-720.zip
I can't really read it, it's a *.doc, but it might be what you want.
Thanks, this is great!

I can't find any reference to a substitution character (searching for
"subs" and "repl" and reading portions of the text). I guess the
intention is for a sender to switch to UCS-2 if the text can't be
represented in the "default alphabet". In which case any value should
do for a conversion table because the conversion should really stop on
unmappable characters. If anyone can find something that I am not
seeing, please let me know.

There are a couple of places where the unicode.org table seems to
differ from this text (0338-720.doc): (In addition to the ç mentioned
in the unicode.org table.)

1. A 0x1B alone should be a space (one-way to-Unicode) but the Unicode
GSM table has it round-trip to U+00A0 NBSP. (The text says "This code
is an escape to an extension of the 7 bit default alphabet table. A
receiving entity which does not understand the meaning of this escape
mechanism shall display it as a space character.")

2. The pair 0x1B+0x1B should also map to a space (one-way to-Unicode)
but the Unicode GSM table does not include this combination. (The text
says "This code value is reserved for the extension to another
extension table. On receipt of this code, a receiving entity shall
display a space until another extension table is defined.")

Thanks,
markus
Kent Karlsson
2006-11-03 01:09:00 UTC
Permalink
-----Original Message-----
Sent: Friday, November 03, 2006 1:01 AM
To: Frank Ellermann
Subject: Re: GSM 03.38 substitution character?
Post by Frank Ellermann
http://www.3gpp.org/ftp/Specs/archive/03_series/03.38/0338-720.zip
I can't really read it, it's a *.doc, but it might be what you want.
Thanks, this is great!
I can't find any reference to a substitution character (searching for
"subs" and "repl" and reading portions of the text). I guess the
intention is for a sender to switch to UCS-2 if the text can't be
represented in the "default alphabet". In which case any value should
do for a conversion table because the conversion should really stop on
unmappable characters. If anyone can find something that I am not
seeing, please let me know.
There are a couple of places where the unicode.org table seems to
differ from this text (0338-720.doc): (In addition to the ç mentioned
in the unicode.org table.)
1. A 0x1B alone should be a space (one-way to-Unicode) but the Unicode
GSM table has it round-trip to U+00A0 NBSP. (The text says "This code
is an escape to an extension of the 7 bit default alphabet table. A
receiving entity which does not understand the meaning of this escape
mechanism shall display it as a space character.")
An NBSP *displays* as a space... And does not cause spurious line
breaks.
2. The pair 0x1B+0x1B should also map to a space (one-way to-Unicode)
but the Unicode GSM table does not include this combination. (The text
says "This code value is reserved for the extension to another
extension table. On receipt of this code, a receiving entity shall
display a space until another extension table is defined.")
In my "original" table I had (as comments):

#0x1B # DBCS LEAD BYTE (may be two in a row)
#0x1B1B # Double lead byte (which may lead to a secondary 7-bit
extension)

I think the first one was turned into a mapping to NBSP and the second
one
just removed in order to maintain roundtripability for the mapping
(apart
from the non-linebreakness of NBSP).

/kent k
Thanks,
markus
Markus Scherer
2006-11-03 17:53:08 UTC
Permalink
Post by Kent Karlsson
Post by Markus Scherer
There are a couple of places where the unicode.org table seems to
differ from this text (0338-720.doc): (In addition to the ç mentioned
in the unicode.org table.)
1. A 0x1B alone should be a space (one-way to-Unicode) but the Unicode
GSM table has it round-trip to U+00A0 NBSP. (The text says "This code
is an escape to an extension of the 7 bit default alphabet table. A
receiving entity which does not understand the meaning of this escape
mechanism shall display it as a space character.")
An NBSP *displays* as a space... And does not cause spurious line
breaks.
True, but the standard only _says_ "space". Even if we choose NBSP for
the mapping, I don't think it's warranted to list 0x1B=U+00A0 as a
regular (round-trip) mapping. From my reading of the text, it should
be a one-way, to-Unicode mapping. Right?
Post by Kent Karlsson
Post by Markus Scherer
2. The pair 0x1B+0x1B should also map to a space (one-way to-Unicode)
but the Unicode GSM table does not include this combination. (The text
says "This code value is reserved for the extension to another
extension table. On receipt of this code, a receiving entity shall
display a space until another extension table is defined.")
#0x1B # DBCS LEAD BYTE (may be two in a row)
#0x1B1B # Double lead byte (which may lead to a secondary 7-bit
extension)
I think the first one was turned into a mapping to NBSP and the second
one
just removed in order to maintain roundtripability for the mapping
(apart
from the non-linebreakness of NBSP).
Yes, that's what is in the table on unicode.org. I don't think
roundtrippability applies here because the standard just talks about
clients interpreting byte streams with 0x1B+0x1B if they don't know
about a sub-extension -- that is, only in conversion to Unicode as a
fallback. It seems like 0x1B+0x1B would want a one-way mapping to the
same code point as 0x1B alone (be that SP or NBSP).

markus
--
Opinions expressed here may not reflect my company's positions unless
otherwise noted.
Loading...