From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM
	autolearn=ham autolearn_force=no version=3.4.4
X-Google-Thread: 103376,e4abd14106db0029,start
X-Google-NewGroupId: yes
X-Google-Attributes: gida07f3367d7,domainid0,public,usenet
X-Google-Language: ENGLISH,UTF8
Path: 
 g2news1.google.com!news4.google.com!proxad.net!feeder1-2.proxad.net!usenet-fr.net!gegeweb.org!aioe.org!not-for-mail
From: =?utf-8?Q?Yannick_Duch=C3=AAne_=28Hibou57?=
 =?utf-8?Q?=29?=
 <yannick_duchene@yahoo.fr>
Newsgroups: comp.lang.ada
Subject: Ada 2012 and Unicode package (UTF-nn encodings handling)
Date: Fri, 20 Aug 2010 23:38:20 +0200
Organization: Ada At Home
Message-ID: <op.vhrad6mjule2fv@garhos>
NNTP-Posting-Host: C0dyV9+bbkFboHRKTUMzpg.user.speranza.aioe.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed; delsp=yes
Content-Transfer-Encoding: Quoted-Printable
X-Complaints-To: abuse@aioe.org
X-Notice: Filtered by postfilter v. 0.8.2
User-Agent: Opera Mail/10.61 (Win32)
Xref: g2news1.google.com comp.lang.ada:13546
Date: 2010-08-20T23:38:20+02:00
List-Id: <comp.lang.ada>

Extract from the thread =E2=80=9CS-expression I/O in Ada=E2=80=9D. Subto=
pic moved in a  =

separate thread for clarity.

Le Wed, 18 Aug 2010 15:16:50 +0200, J-P. Rosen <rosen@adalog.fr> a =C3=A9=
crit:
> Slightly OT, but you (and others) might be interested to know that Ada=

> 2012 will include string encoding packages to the various UTF-X
> encodings. These will be (are?) provided very soon by GNAT.
>
> See AI05-137-2
> (http://www.ada-auth.org/cgi-bin/cvsweb.cgi/ai05s/ai05-0137-2.txt?rev=3D=
1.2)

Time for my stupid question of the day :)

I've noticed this introduction in the last amendment, because Unicode ha=
s  =

always been an issue/matter for me (actually use my own).

I could not avoid two questions: why no UTF-32 ? (this would not be an  =

implementation nightmare) and why BOM handled for each string while BOM =
is  =

to be used at stream/file level ? (see XML or HTML files for example). O=
r  =

are these strings supposed to hold the whole content of a file/stream ?

Quote:
http://www.unicode.org/faq/utf_bom.html
> A: A byte order mark (BOM) consists of the character code U+FEFF at th=
e  =

> beginning of a data stream

This is a FAQ at Unicode.org; but all references (Unicode PDF files, XML=
  =

reference, HTTML reference) all says the same.

This matter, because the code point U+FEFF can stands for two different =
 =

things: Zero Width No Break Space or encoding Byte Order Mark. The only =
 =

way to distinguish both usage, is where-it-appears.

If it appears as the first code point of a stream, this is a BOM  =

(heuristics may be applied to automatically switch encoding with an  =

analysis of the first byte of a stream, this is what I do) ; if this  =

appears any where else in a stream, this is a character code point.