From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM autolearn=ham autolearn_force=no version=3.4.4 X-Google-Thread: 103376,e4abd14106db0029,start X-Google-NewGroupId: yes X-Google-Attributes: gida07f3367d7,domainid0,public,usenet X-Google-Language: ENGLISH,UTF8 Path: g2news1.google.com!news4.google.com!proxad.net!feeder1-2.proxad.net!usenet-fr.net!gegeweb.org!aioe.org!not-for-mail From: =?utf-8?Q?Yannick_Duch=C3=AAne_=28Hibou57?= =?utf-8?Q?=29?= Newsgroups: comp.lang.ada Subject: Ada 2012 and Unicode package (UTF-nn encodings handling) Date: Fri, 20 Aug 2010 23:38:20 +0200 Organization: Ada At Home Message-ID: NNTP-Posting-Host: C0dyV9+bbkFboHRKTUMzpg.user.speranza.aioe.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed; delsp=yes Content-Transfer-Encoding: Quoted-Printable X-Complaints-To: abuse@aioe.org X-Notice: Filtered by postfilter v. 0.8.2 User-Agent: Opera Mail/10.61 (Win32) Xref: g2news1.google.com comp.lang.ada:13546 Date: 2010-08-20T23:38:20+02:00 List-Id: Extract from the thread =E2=80=9CS-expression I/O in Ada=E2=80=9D. Subto= pic moved in a = separate thread for clarity. Le Wed, 18 Aug 2010 15:16:50 +0200, J-P. Rosen a =C3=A9= crit: > Slightly OT, but you (and others) might be interested to know that Ada= > 2012 will include string encoding packages to the various UTF-X > encodings. These will be (are?) provided very soon by GNAT. > > See AI05-137-2 > (http://www.ada-auth.org/cgi-bin/cvsweb.cgi/ai05s/ai05-0137-2.txt?rev=3D= 1.2) Time for my stupid question of the day :) I've noticed this introduction in the last amendment, because Unicode ha= s = always been an issue/matter for me (actually use my own). I could not avoid two questions: why no UTF-32 ? (this would not be an = implementation nightmare) and why BOM handled for each string while BOM = is = to be used at stream/file level ? (see XML or HTML files for example). O= r = are these strings supposed to hold the whole content of a file/stream ? Quote: http://www.unicode.org/faq/utf_bom.html > A: A byte order mark (BOM) consists of the character code U+FEFF at th= e = > beginning of a data stream This is a FAQ at Unicode.org; but all references (Unicode PDF files, XML= = reference, HTTML reference) all says the same. This matter, because the code point U+FEFF can stands for two different = = things: Zero Width No Break Space or encoding Byte Order Mark. The only = = way to distinguish both usage, is where-it-appears. If it appears as the first code point of a stream, this is a BOM = (heuristics may be applied to automatically switch encoding with an = analysis of the first byte of a stream, this is what I do) ; if this = appears any where else in a stream, this is a character code point.