From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.4 X-Google-Thread: 103376,421baaa91aa096a7 X-Google-Attributes: gid103376,domainid0,public,usenet X-Google-Language: ENGLISH,ASCII-7-bit Path: g2news2.google.com!postnews.google.com!m45g2000hsb.googlegroups.com!not-for-mail From: Adam Beneschan Newsgroups: comp.lang.ada Subject: Re: Wide_[Wide_]Character Date: Tue, 22 Jul 2008 12:18:41 -0700 (PDT) Organization: http://groups.google.com Message-ID: <072713a0-7c1f-4e29-a2e7-4f43a89f6ebf@m45g2000hsb.googlegroups.com> References: NNTP-Posting-Host: 66.126.103.122 Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Trace: posting.google.com 1216754322 12567 127.0.0.1 (22 Jul 2008 19:18:42 GMT) X-Complaints-To: groups-abuse@google.com NNTP-Posting-Date: Tue, 22 Jul 2008 19:18:42 +0000 (UTC) Complaints-To: groups-abuse@google.com Injection-Info: m45g2000hsb.googlegroups.com; posting-host=66.126.103.122; posting-account=duW0ogkAAABjRdnxgLGXDfna0Gc6XqmQ User-Agent: G2/1.0 X-HTTP-UserAgent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.7.12) Gecko/20050922 Fedora/1.7.12-1.3.1,gzip(gfe),gzip(gfe) Xref: g2news2.google.com comp.lang.ada:6979 Date: 2008-07-22T12:18:41-07:00 List-Id: On Jul 12, 12:44 am, Dale Stanbrough wrote: > Unicode can be represented using UTF-8, UTF-16 and UTF-32 (amongst > others). > > I gather that Character is simply ISO-8859-1 (Latin-1). > > I suspect that Wide_Character is UCS-2 (simple 2 byte values, no escapes > like UTF-16). > > Is Wide_Wide_Character > > * UTF-16 > * UTF-32 (i.e. UCS-4) > * System dependent > * Something else > > Thanks, > > Dale I'm not convinced that the question makes sense. Wide_Character refers to an enumeration type with 2**16 literals, where Wide_Charater'Val(N) denotes the corresponding character in the ISO 10646 Basic Multilingual Plane, i.e. Unicode. Unicode is a *character* *set*, i.e. a definition of what character corresponds to each integer; it says nothing about how characters are represented. Wide_Wide_Character is similarly an enumeration type with 2**32 literals. When a sequence of characters is represented in internal memory, it's up to an implementation to decide how to represent each character in memory. But in most cases, it makes no sense to represent it as anything other than a flat array. Thus, a Wide_String would be, in essence, an array of 16-bit integers, and a Wide_Wide_String would be an array of 32-bit integers. If it were represented otherwise, how could a program access, say, S(1000) where S is declared as a Wide_Wide_String(1..2000)? If it were represented as, say, UTF-8 or UTF-16, the program would have to start at the beginning of the string and do an expensive search every time it wanted to access one particular character of the string. This would not make sense. So I think that any implementation would implement those character (and string) types as an integer (or array of integers), with whatever endianness is most convenient for that processor. When a sequence of characters is represented in a file (or is communicated some other way e.g. over a socket), the characters may well be encoded as UTF-8 or UTF-16 or something. The language doesn't define how different encodings are handled. I believe GNAT uses the "form" parameter when a file is opened or created to specify the encoding; it supports a number of different possible encodings, because different files that come from different places may be encoded in different ways. When a line is read from one of those files into memory, though, I'm sure that the runtime will convert it to an internal representation that is a flat array. I'm not sure if this tells you what you need to know or not; if not, then if you tell us why you're asking the question (i.e. what you want to accomplish), this will give us a better idea of what we need to tell you. If you're trying to do some sort of overlay, where you read in raw bytes from a file and then use Unchecked_Conversion or something to convert it to a Wide_Wide_String, or something of that nature, my advice is: Just don't do that. P.S. I know I'm coming in late to this thread---I just got back from vacation. If your question has already been answered, my apologies. -- Adam