UTF-8 in strings

comp.lang.ada
 help / color / mirror / Atom feed

* UTF-8 in strings - a bug?
@ 2004-05-05 22:12 Björn Persson
  2004-05-05 23:31 ` Robert I. Eachus
  2004-05-06  9:06 ` David Starner
  0 siblings, 2 replies; 16+ messages in thread
From: Björn Persson @ 2004-05-05 22:12 UTC (permalink / raw)


The reference manual says:

3.5.2(2): The predefined type Character is a character type whose values 
correspond to the 256 code positions of Row 00 (also known as Latin-1) 
of the ISO 10646 Basic Multilingual Plane (BMP).

3.6.3(4): type String is array(Positive range <>) of Character;

It seems clear to me: Strings are Latin-1 (except for programs compiled 
in nonstandard modes). But when I set my Fedora system to use UTF-8, the 
strings I get from Ada.Command_Line.Argument contain UTF-8. This means 
that some of the elements in the string aren't characters, only byte 
values that are parts of multi-byte characters. And of course 'Length 
returns the number of bytes, not the number of characters. This looks 
like a violation of the standard. Should I consider this a bug in the 
library? Or in the compiler (Gnat (GCC) 3.3.2 and 3.4.0)?

-- 
Björn Persson

jor ers @sv ge.
b n_p son eri nu




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UTF-8 in strings - a bug?
  2004-05-05 22:12 UTF-8 in strings - a bug? Björn Persson
@ 2004-05-05 23:31 ` Robert I. Eachus
  2004-05-06  8:34   ` Björn Persson
  2004-05-06  9:06 ` David Starner
  1 sibling, 1 reply; 16+ messages in thread
From: Robert I. Eachus @ 2004-05-05 23:31 UTC (permalink / raw)

Bjï¿½rn Persson wrote:

> The reference manual says:
> 
> 3.5.2(2): The predefined type Character is a character type whose values 
> correspond to the 256 code positions of Row 00 (also known as Latin-1) 
> of the ISO 10646 Basic Multilingual Plane (BMP).
> 
> 3.6.3(4): type String is array(Positive range <>) of Character;
> 
> It seems clear to me: Strings are Latin-1 (except for programs compiled 
> in nonstandard modes). But when I set my Fedora system to use UTF-8, the 
> strings I get from Ada.Command_Line.Argument contain UTF-8. This means 
> that some of the elements in the string aren't characters, only byte 
> values that are parts of multi-byte characters. And of course 'Length 
> returns the number of bytes, not the number of characters. This looks 
> like a violation of the standard. Should I consider this a bug in the 
> library? Or in the compiler (Gnat (GCC) 3.3.2 and 3.4.0)?

Hmmmm...  The technical answer is that GNAT is not validated on Fedora 
with UTF-8.  The practical answer is that with GNAT, you should compile 
using the UTF-8 non-standard mode, if you are using UTF-8.

But what if you want to validate on Fedora in UTF-8 mode?  Then you will 
have to modify the libraries to get this "right."

-- 

                                           Robert I. Eachus

"The terrorist enemy holds no territory, defends no population, is 
unconstrained by rules of warfare, and respects no law of morality. Such 
an enemy cannot be deterred, contained, appeased or negotiated with. It 
can only be destroyed--and that, ladies and gentlemen, is the business 
at hand."  -- Dick Cheney

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UTF-8 in strings - a bug?
  2004-05-05 23:31 ` Robert I. Eachus
@ 2004-05-06  8:34   ` Björn Persson
  2004-05-06  9:25     ` Ludovic Brenta
  0 siblings, 1 reply; 16+ messages in thread
From: Björn Persson @ 2004-05-06  8:34 UTC (permalink / raw)

Robert I. Eachus wrote:

> Hmmmm...  The technical answer is that GNAT is not validated on Fedora 
> with UTF-8.  The practical answer is that with GNAT, you should compile 
> using the UTF-8 non-standard mode, if you are using UTF-8.
> 
> But what if you want to validate on Fedora in UTF-8 mode?  Then you will 
> have to modify the libraries to get this "right."

A library bug it is then. I don't necessarily want to *validate* in 
UTF-8 mode, but now that Mr. Krischik has been so kind to invite my 
parameter handler to AdaCL, I want it to *work* in a multilingual world. 
(It's not just Fedora of course. I expect this to happen in all modern 
Unixoid OSes, and maybe Windows too.)

Recompiling is not a workable solution. The encoding isn't known until 
run time. Software is frequently distributed in precompiled form you 
know, and the users may use many different encodings. It might even be 
that different users on the same system use different encodings. So I 
guess a transcoding library will have to be wrapped around 
Ada.Command_Line, and probably around Ada.Command_Line.Environment and 
the standard input, output and error files too.

Or could it be possible to get a function Argument(Number : in Positive) 
return Wide_Wide_String into Ada 2005?

(Besides I couldn't see that "-gnatiw -gnatW8" made any difference. 
Perhaps they're only for ACT-Gnat? But it doesn't really matter to me.)

-- 
Björn Persson

jor ers @sv ge.
b n_p son eri nu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UTF-8 in strings - a bug?
  2004-05-05 22:12 UTF-8 in strings - a bug? Björn Persson
  2004-05-05 23:31 ` Robert I. Eachus
@ 2004-05-06  9:06 ` David Starner
  2004-05-06 17:36   ` Björn Persson
  1 sibling, 1 reply; 16+ messages in thread
From: David Starner @ 2004-05-06  9:06 UTC (permalink / raw)

On Wed, 05 May 2004 22:12:03 +0000, Bjï¿½rn Persson wrote:

> It seems clear to me: Strings are Latin-1 (except for programs compiled
> in nonstandard modes). But when I set my Fedora system to use UTF-8, the
> strings I get from Ada.Command_Line.Argument contain UTF-8. 

Strings you get from anywhere, not just Ada.Command_Line.Argument contain
UTF-8. Try setting up an i/o loop with Ada.Text_IO and watch it pass
Cyrillic right through despite being stored as Strings. Ada.Text_IO
doesn't recode on output, so if you do include Latin-1 characters in a
string, you won't get acceptable output. I suspect that all current Ada
compilers on Unix systems work this way.

> This means
> that some of the elements in the string aren't characters, only byte
> values that are parts of multi-byte characters. And of course 'Length
> returns the number of bytes, not the number of characters. This looks
> like a violation of the standard. Should I consider this a bug in the
> library? Or in the compiler (Gnat (GCC) 3.3.2 and 3.4.0)?

I've always considered it a "feature". It may be suboptimal, but it's not
easy to fix and fixing it brings a large number of problems with it.

Consider this: what happens when the command line contains a filename?
That filename may not be in UTF-8; in fact, under Unix, filenames are
merely byte strings that don't contain 16#00# and 16#2F#. (On my
UTF-8 system, I've had several filenames copied from other systems in
Latin-1.) If you do recode it and ignore those files, you must turn it
back to its original encoding before passing the name to any system
function. Any string that may have to be exactly matched may have this
problem; far from all of my text files are UTF-8, but I may want to search
for a byte string in one of them.

Whether or not this is right depends on where Ada should be on the C-Java
spectrum. (Yes, I realize some will complain, but in geometry I may draw a
line through any two points and define that as an axis.) C doesn't worry
about it at all; it's harder to deal with, but it simplifies viewing the
world as a stream of bytes and interfacing with Unix filenames and files.
Java gets it "right" in converting everything to Unicode, but this is
often inefficient, and can make it harder to deal with the real world. I
doubt that Ada is going to get it "right" without too many incompatible
changes, so I'd rather it stayed "C-ish" here.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UTF-8 in strings - a bug?
  2004-05-06  8:34   ` Björn Persson
@ 2004-05-06  9:25     ` Ludovic Brenta
  2004-05-06 17:13       ` Björn Persson
  2004-05-06 18:24       ` Martin Krischik
  0 siblings, 2 replies; 16+ messages in thread
From: Ludovic Brenta @ 2004-05-06  9:25 UTC (permalink / raw)

Bjorn Persson wrote:
> Recompiling is not a workable solution. The encoding isn't known
> until run time. Software is frequently distributed in precompiled
> form you know, and the users may use many different encodings. It
> might even be that different users on the same system use different
> encodings. So I guess a transcoding library will have to be wrapped
> around Ada.Command_Line, and probably around
> Ada.Command_Line.Environment and the standard input, output and
> error files too.

You are correct: the encoding depends not only on the operating system
but also on the particular user who runs the software.  You can learn
about which encoding is currently in effect using the getlocale(3)
library call.  glibc also has transcoding facilities, which you can
import into your Ada program; the most powerful and general one is
iconv.

I am not aware of a thick binding to either getlocale or iconv (both
are in glibc).  If you write such a binding, it would be nice to make
it GMGPL.

In the general case, though, you do not necessarily have to transcode
unless you want to manipulate the string data with algorithms that
depend on the internal encoding.

Whenever your program interacts with GTK+, it must use UTF-8 as the
internal encoding.  Even if you don't use GTK+, I'd recommend you use
gettext for all user-visible strings and store them in UTF-8 in .po
file(s).  There is a thick binding to Gettext as part of GtkAda, FWIW.

So, I would personally depart from the Ada standard in this respect,
and declare that all Strings are in UTF-8, both internally and
externally.  GtkAda does this explicitly with a separate type,
UTF8_String.

-- 
Ludovic Brenta.

-- 
Use our news server 'news.foorum.com' from anywhere.
More details at: http://nnrpinfo.go.foorum.com/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UTF-8 in strings - a bug?
  2004-05-06  9:25     ` Ludovic Brenta
@ 2004-05-06 17:13       ` Björn Persson
  2004-05-06 18:24       ` Martin Krischik
  1 sibling, 0 replies; 16+ messages in thread
From: Björn Persson @ 2004-05-06 17:13 UTC (permalink / raw)

Ludovic Brenta wrote:

> You can learn 
> about which encoding is currently in effect using the getlocale(3) 
> library call.

My understanding from the manpages is that you must first call 
setlocale(LC_ALL, "") to import the locale settings from the environment 
into the program, and then you call either nl_langinfo or localeconv to 
get information about the locale. I don't seem to have a manpage for 
getlocale.

> I am not aware of a thick binding to either getlocale or iconv (both 
> are in glibc).  If you write such a binding, it would be nice to make 
> it GMGPL.

There are lots of things I'd want to write. And now I can't stop 
thinking about how such a binding might be written ... :-/

> In the general case, though, you do not necessarily have to transcode 
> unless you want to manipulate the string data with algorithms that 
> depend on the internal encoding.

Of course. I just wish the OS interface wouldn't use String when the 
encoding is undefined. Better define a type System_String or something, 
and state explicitly that this type contains strings in whatever 
encoding is used in the environment.

> GtkAda does this explicitly with a separate type, UTF8_String.

That's good. What bothers me is when String is used for anything so you 
don't know what you really have in your strings. The C programmers can 
keep that kind of confusion to themselves. Separate types is clearly the 
way to go.

-- 
Björn Persson

jor ers @sv ge.
b n_p son eri nu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UTF-8 in strings - a bug?
  2004-05-06  9:06 ` David Starner
@ 2004-05-06 17:36   ` Björn Persson
  0 siblings, 0 replies; 16+ messages in thread
From: Björn Persson @ 2004-05-06 17:36 UTC (permalink / raw)


David Starner wrote:

> C doesn't worry about it at all;

C worries enough to define the type wchar_t, the macro 
__STDC_ISO_10646__ and several library functions for transcoding.

-- 
Björn Persson

jor ers @sv ge.
b n_p son eri nu




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UTF-8 in strings - a bug?
  2004-05-06  9:25     ` Ludovic Brenta
  2004-05-06 17:13       ` Björn Persson
@ 2004-05-06 18:24       ` Martin Krischik
  2004-05-07 23:32         ` Björn Persson
  1 sibling, 1 reply; 16+ messages in thread
From: Martin Krischik @ 2004-05-06 18:24 UTC (permalink / raw)


Ludovic Brenta wrote:

> 
> Bjorn Persson wrote:
 
> I am not aware of a thick binding to either getlocale or iconv (both
> are in glibc).  If you write such a binding, it would be nice to make
> it GMGPL.

XMLAda comes with a Unicode library which can do some transcoding.

 
With Regards

Martin

-- 
mailto://krischik@users.sourceforge.net
http://www.ada.krischik.com




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UTF-8 in strings - a bug?
  2004-05-06 18:24       ` Martin Krischik
@ 2004-05-07 23:32         ` Björn Persson
  2004-05-08  6:38           ` Martin Krischik
                             ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Björn Persson @ 2004-05-07 23:32 UTC (permalink / raw)


Martin Krischik wrote:

> XMLAda comes with a Unicode library which can do some transcoding.

Well, I suppose the existence of that library is a good thing, but after 
reading the introduction in unicode.ads I have to wonder whether it's 
them or me who have misunderstood Unicode. It mentions "Utf32 Latin1" 
and "Utf8 Latin2" strings. This looks really weird to me. You don't 
encode Latin-1 in UTF-32 or Latin-2 in UTF-8. You encode Unicode in 
UTF-8 or UTF-32, or you encode a subset of Unicode in Latin-1, or 
another subset in Latin-2.

-- 
Björn Persson

jor ers @sv ge.
b n_p son eri nu




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UTF-8 in strings - a bug?
  2004-05-07 23:32         ` Björn Persson
@ 2004-05-08  6:38           ` Martin Krischik
  2004-05-08  7:44           ` Jacob Sparre Andersen
  2004-05-08 12:10           ` Georg Bauhaus
  2 siblings, 0 replies; 16+ messages in thread
From: Martin Krischik @ 2004-05-08  6:38 UTC (permalink / raw)


Bjï¿œrn Persson wrote:

> Martin Krischik wrote:
> 
>> XMLAda comes with a Unicode library which can do some transcoding.
> 
> Well, I suppose the existence of that library is a good thing, but after
> reading the introduction in unicode.ads I have to wonder whether it's
> them or me who have misunderstood Unicode. It mentions "Utf32 Latin1"
> and "Utf8 Latin2" strings. This looks really weird to me. You don't
> encode Latin-1 in UTF-32 or Latin-2 in UTF-8. You encode Unicode in
> UTF-8 or UTF-32, or you encode a subset of Unicode in Latin-1, or
> another subset in Latin-2.

Well, I have worked a bit more with that library and it seems that there are
special versions of UTF-8 and that you can place some info block at the
beginning at the UTF-8 String for fine tuning.

UTF-16 and UTF-32 are variable length encodings as well. Just in case
extrateritials finally drop in and we need 64 bit character sets. 

So the XMLAda seems more complete then the average Unicode implementation.

With Regards

Martin

-- 
mailto://krischik@users.sourceforge.net
http://www.ada.krischik.com




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UTF-8 in strings - a bug?
  2004-05-07 23:32         ` Björn Persson
  2004-05-08  6:38           ` Martin Krischik
@ 2004-05-08  7:44           ` Jacob Sparre Andersen
  2004-05-08 11:06             ` Björn Persson
  2004-05-08 12:10           ` Georg Bauhaus
  2 siblings, 1 reply; 16+ messages in thread
From: Jacob Sparre Andersen @ 2004-05-08  7:44 UTC (permalink / raw)


Bjï¿½rn Persson wrote:

> Well, I suppose the existence of that library is a good thing, but
> after reading the introduction in unicode.ads I have to wonder
> whether it's them or me who have misunderstood Unicode. It mentions
> "Utf32 Latin1" and "Utf8 Latin2" strings. This looks really weird to
> me. You don't encode Latin-1 in UTF-32 or Latin-2 in UTF-8. You
> encode Unicode in UTF-8 or UTF-32, or you encode a subset of Unicode
> in Latin-1, or another subset in Latin-2.

Your quotes (which may be unfair :-) look like it isn't you who has
misunderstood the subject of character encodings.

Jacob
-- 
No trees were killed in the sending of this message.
However a large number of electrons were terribly inconvenienced.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UTF-8 in strings - a bug?
  2004-05-08  7:44           ` Jacob Sparre Andersen
@ 2004-05-08 11:06             ` Björn Persson
  2004-05-08 16:25               ` Martin Krischik
  0 siblings, 1 reply; 16+ messages in thread
From: Björn Persson @ 2004-05-08 11:06 UTC (permalink / raw)


Jacob Sparre Andersen wrote:

> Your quotes (which may be unfair :-)

Sorry, I should have provided more context. Here's the relevant part of 
unicode/unicode.ads in XML/Ada version 1.0 from ACT-Europe, so you don't 
have to download the library just to see what I'm talking about:


--  Coded character sets  (packages Unicode.CCS.*)
--  ====================
--  Mapping from a set of abstract characters to the set of non-negative
--  integers
--  The integer associated with a character is called "code point", and the
--  character is called "encoded character"
--  Examples of these are:  ISO/8859-1, JIS X 0208, ...
--
--  Character naming (packages Unicode.Names.*)
--  ================
--  A unique name is assigned to each abstract character, so that it is
--  possible to get the same character no matter what repertoire is used.
--
--  Character Encoding Forms
--  ========================
--  Mapping from the set of integers used in a Coded Character Set to 
the set
--  of sequences of code units.
--  A "code unit" is integer occupying a specified binary width in a 
computer
--  architecture
--  Examples of fixed-width encoding forms:  7-bit, 8-bit, EBCDIC
--  Examples of variable-width encoding forms:  Utf-8, Utf-16,...
--
--  Character Encoding Scheme (packages Unicode.CES.*)
--  =========================
--  Mapping of code units into serialized byte sequences. It also takes into
--  account the byte-order serialization.

--  As a summary, converting a file containing latin-1 characters coded on
--  8 bits to a Utf8 latin2 file, the following steps are involved:
--
--     Latin1 string  (contains bytes associated with code points in Latin1)
--       |    "use Unicode.CES.Basic_8bit.To_Utf32"
--       v
--     Utf32 latin1 string (contains code points in Latin1)
--       |    "Convert argument to To_Utf32 should be
--       v         Unicode.CCS.Iso_8859_1.Convert"
--     Utf32 Unicode string (contains code points in Unicode)
--       |    "use Unicode.CES.Utf8.From_Utf32"
--       v
--     Utf8 Unicode string (contains code points in Unicode)
--       |    "Convert argument to From_Utf32 should be
--       v         Unicode.CCS.Iso_8859_2.Convert"
--     Utf8 Latin2 string (contains code points in Latin2)


Investigating furter, I see that docs/xml_2.html shows the exact same 
example of converting Latin-1 to "Utf8 Latin2".

-- 
Björn Persson

jor ers @sv ge.
b n_p son eri nu




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UTF-8 in strings - a bug?
  2004-05-07 23:32         ` Björn Persson
  2004-05-08  6:38           ` Martin Krischik
  2004-05-08  7:44           ` Jacob Sparre Andersen
@ 2004-05-08 12:10           ` Georg Bauhaus
  2 siblings, 0 replies; 16+ messages in thread
From: Georg Bauhaus @ 2004-05-08 12:10 UTC (permalink / raw)


Bjï¿½rn Persson <spam-away@nowhere.nil> wrote:
: Well, I suppose the existence of that library is a good thing, but after 
: reading the introduction in unicode.ads I have to wonder whether it's 
: them or me who have misunderstood Unicode. It mentions "Utf32 Latin1" 
: and "Utf8 Latin2" strings. This looks really weird to me. You don't 
: encode Latin-1 in UTF-32 or Latin-2 in UTF-8. You encode Unicode in 
: UTF-8 or UTF-32, or you encode a subset of Unicode in Latin-1, or 
: another subset in Latin-2.

what is meant I think, is that there are Latin-1 characters that
as abstract characters have a code point in Unicode that corresponds
to some UTF32 encoded character.
They could as well be encoded using UTF8 or UTF16.
Latin_Capital_Letter_E_With_Acute is present in ISO 8859-1 as well
as in Unicode, and in Unicode, various bit combinations may be
used to encode it for a computer.
Unicode does have various Latin blocks, but I'm not sure about
the Latin-2 line either.

-- Georg



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UTF-8 in strings - a bug?
  2004-05-08 11:06             ` Björn Persson
@ 2004-05-08 16:25               ` Martin Krischik
  2004-05-09 12:16                 ` Georg Bauhaus
  0 siblings, 1 reply; 16+ messages in thread
From: Martin Krischik @ 2004-05-08 16:25 UTC (permalink / raw)


Bjï¿œrn Persson wrote:


> --     Latin1 string  (contains bytes associated with code points in
> Latin1)
> --       |    "use Unicode.CES.Basic_8bit.To_Utf32"
> --       v

Basic_8bit.To_Utf32 only make an 8bit -> 32bit expansion that is 16#xx#
becomes 16#000000xx#. The result is not really unicode but needed for
further convertions.

> --     Utf32 latin1 string (contains code points in Latin1)
> --       |    "Convert argument to To_Utf32 should be
> --       v         Unicode.CCS.Iso_8859_1.Convert"

This does the actual convertion. The result is now unicode.

> --     Utf32 Unicode string (contains code points in Unicode)
> --       |    "use Unicode.CES.Utf8.From_Utf32"
> --       v

Now we have standart UTF-8. 

> --     Utf8 Unicode string (contains code points in Unicode)
> --       |    "Convert argument to From_Utf32 should be
> --       v         Unicode.CCS.Iso_8859_2.Convert"

Now this is some Latin-2 optimized UTF-8. If this is truly usefull I don't
know.

> --     Utf8 Latin2 string (contains code points in Latin2)
> 
> 
> Investigating furter, I see that docs/xml_2.html shows the exact same
> example of converting Latin-1 to "Utf8 Latin2".

The UTF-X encodings can start with a BOM "Byte-order mark". This changes the
behaviour of the encoding:

   ------------------------------
   -- Byte-order mark handling --
   ------------------------------

   type Bom_Type is
     (Utf8_All,  --  Utf8-encoding
      Utf16_LE,  --  Utf16 little-endian encoding
      Utf16_BE,  --  Utf16 big-endian encoding
      Utf32_LE,  --  Utf32 little-endian encoding
      Utf32_BE,  --  Utf32 big-endian encoding
      Ucs4_BE,   --  UCS-4, big endian machine (1234 order)
      Ucs4_LE,   --  UCS-4, little endian machine (4321 order)
      Ucs4_2143, --  UCS-4, unusual byte order (2143 order)
      Ucs4_3412, --  UCS-4, unusual byte order (3412 order)
      Unknown);  --  Unknown, assumed to be ASCII compatible

BTW: I am currently adding Wide_Character support to the XMLAda/Unicode
package.

With Regards

Martin

-- 
mailto://krischik@users.sourceforge.net
http://www.ada.krischik.com




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UTF-8 in strings - a bug?
  2004-05-08 16:25               ` Martin Krischik
@ 2004-05-09 12:16                 ` Georg Bauhaus
  2004-05-10  6:29                   ` Martin Krischik
  0 siblings, 1 reply; 16+ messages in thread
From: Georg Bauhaus @ 2004-05-09 12:16 UTC (permalink / raw)


Martin Krischik <krischik@users.sourceforge.net> wrote:
 
: The UTF-X encodings can start with a BOM "Byte-order mark".

However, systems are allowed to define protocols which may
restrict the use of a BOM in case of UTF-8 (require/forbid).
A #!/shell script is an example.

A BOM is said to be useful to distinguish a UTF-8 Unicode file
from a file using another 8bit encoding. Though I wonder how by
the absence of the Unicode BOM they think a program can find
out which of the other encodings has been used...


-- Georg



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UTF-8 in strings - a bug?
  2004-05-09 12:16                 ` Georg Bauhaus
@ 2004-05-10  6:29                   ` Martin Krischik
  0 siblings, 0 replies; 16+ messages in thread
From: Martin Krischik @ 2004-05-10  6:29 UTC (permalink / raw)


Georg Bauhaus wrote:

> Martin Krischik <krischik@users.sourceforge.net> wrote:
>  
> : The UTF-X encodings can start with a BOM "Byte-order mark".
> 
> However, systems are allowed to define protocols which may
> restrict the use of a BOM in case of UTF-8 (require/forbid).
> A #!/shell script is an example.
> 
> A BOM is said to be useful to distinguish a UTF-8 Unicode file
> from a file using another 8bit encoding. Though I wonder how by
> the absence of the Unicode BOM they think a program can find
> out which of the other encodings has been used...

XML/Ada does some guessing on the the usual beginning of an xml file.

Apart from that I guess they can't..

With Regards

Martin
-- 
mailto://krischik@users.sourceforge.net
http://www.ada.krischik.com




^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2004-05-10  6:29 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-05-05 22:12 UTF-8 in strings - a bug? Björn Persson
2004-05-05 23:31 ` Robert I. Eachus
2004-05-06  8:34   ` Björn Persson
2004-05-06  9:25     ` Ludovic Brenta
2004-05-06 17:13       ` Björn Persson
2004-05-06 18:24       ` Martin Krischik
2004-05-07 23:32         ` Björn Persson
2004-05-08  6:38           ` Martin Krischik
2004-05-08  7:44           ` Jacob Sparre Andersen
2004-05-08 11:06             ` Björn Persson
2004-05-08 16:25               ` Martin Krischik
2004-05-09 12:16                 ` Georg Bauhaus
2004-05-10  6:29                   ` Martin Krischik
2004-05-08 12:10           ` Georg Bauhaus
2004-05-06  9:06 ` David Starner
2004-05-06 17:36   ` Björn Persson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox