From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM autolearn=ham autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII-7-bit X-Google-Thread: 103376,d7340a24f4e8fef1 X-Google-Attributes: gid103376,public X-Google-ArrivalTime: 2004-02-15 14:31:08 PST Path: archiver1.google.com!news2.google.com!news.maxwell.syr.edu!wn14feed!worldnet.att.net!bgtnsc05-news.ops.worldnet.att.net.POSTED!not-for-mail From: David Starner Subject: Re: UTF-8 (was: AI-285 - Comment from Unicode list) User-Agent: Pan/0.14.2 (This is not a psychotic episode. It's a cleansing moment of clarity. (Debian GNU/Linux)) Message-Id: Newsgroups: comp.lang.ada References: MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Date: Sun, 15 Feb 2004 22:31:08 GMT NNTP-Posting-Host: 12.72.70.80 X-Complaints-To: abuse@worldnet.att.net X-Trace: bgtnsc05-news.ops.worldnet.att.net 1076884268 12.72.70.80 (Sun, 15 Feb 2004 22:31:08 GMT) NNTP-Posting-Date: Sun, 15 Feb 2004 22:31:08 GMT Organization: AT&T Worldnet Xref: archiver1.google.com comp.lang.ada:5585 Date: 2004-02-15T22:31:08+00:00 List-Id: On Sun, 15 Feb 2004 09:45:02 -0500, Wes Groleau wrote: > I'd like to see a package (or built-in) to support UTF-8. > But that's just me. I do a little bit of Polish and Japanese > and might do a little Burmese, so I need Unicode. But since > I'm mostly English and Spanish and French, if I used UTF-16 > my files would be 49.x% zero bytes. But the internal character set has nothing to do with the external. We could output UTF-8 and use UTF-16 or UTF-32 internally. In fact, if you set the character set of the source code to UTF-8 with GNAT, it will input and output UTF-8. (This is not a great design, IMO.) > I have often been tempted to write such a package. Has it already been > done? http://sourceforge.net/projects/ngeadal/ will do it, among a few other Unicode related things. I never really completed it, and it doesn't have any sort of stream I/O (instead dumping files as a whole), but it should work, and I'm willing to answer questions. > I admit it--I don't even know what UCS-2 is. :-) Unicode is broken down into 17 planes, 4 of which are used in anyway. All but one were empty until a couple years ago. UCS-2 is like UTF-16, but doesn't support the surrogate code points needed to access planes besides the first. That means that Gothic, Linear-A, Cuniform (in the future) won't be supported; but it also means that the mathematical alphanumerics and Cantonese won't be supported, as well as a lot of older literary Chinese, Japanese, Korean and Vietnamese, and other minor Chinese languages.