From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Thread: 103376,421baaa91aa096a7
X-Google-Attributes: gid103376,domainid0,public,usenet
X-Google-Language: ENGLISH,ASCII-7-bit
Path: 
 g2news2.google.com!postnews.google.com!m45g2000hsb.googlegroups.com!not-for-mail
From: Adam Beneschan <adam@irvine.com>
Newsgroups: comp.lang.ada
Subject: Re: Wide_[Wide_]Character
Date: Tue, 22 Jul 2008 12:18:41 -0700 (PDT)
Organization: http://groups.google.com
Message-ID: 
 <072713a0-7c1f-4e29-a2e7-4f43a89f6ebf@m45g2000hsb.googlegroups.com>
References: <MrNoSpam-A54511.17443812072008@news-server.bigpond.net.au>
NNTP-Posting-Host: 66.126.103.122
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-Trace: posting.google.com 1216754322 12567 127.0.0.1 (22 Jul 2008 19:18:42
 GMT)
X-Complaints-To: groups-abuse@google.com
NNTP-Posting-Date: Tue, 22 Jul 2008 19:18:42 +0000 (UTC)
Complaints-To: groups-abuse@google.com
Injection-Info: m45g2000hsb.googlegroups.com; posting-host=66.126.103.122;
	posting-account=duW0ogkAAABjRdnxgLGXDfna0Gc6XqmQ
User-Agent: G2/1.0
X-HTTP-UserAgent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.7.12)
	Gecko/20050922 Fedora/1.7.12-1.3.1,gzip(gfe),gzip(gfe)
Xref: g2news2.google.com comp.lang.ada:6979
Date: 2008-07-22T12:18:41-07:00
List-Id: <comp.lang.ada>

On Jul 12, 12:44 am, Dale Stanbrough <MrNoS...@bigpoop.net.au> wrote:
> Unicode can be represented using UTF-8, UTF-16 and UTF-32 (amongst
> others).
>
> I gather that Character is simply ISO-8859-1 (Latin-1).
>
> I suspect that Wide_Character is UCS-2 (simple 2 byte values, no escapes
> like UTF-16).
>
> Is Wide_Wide_Character
>
>    * UTF-16
>    * UTF-32 (i.e. UCS-4)
>    * System dependent
>    * Something else
>
> Thanks,
>
> Dale

I'm not convinced that the question makes sense.  Wide_Character
refers to an enumeration type with 2**16 literals, where
Wide_Charater'Val(N) denotes the corresponding character in the ISO
10646 Basic Multilingual Plane, i.e. Unicode.  Unicode is a
*character* *set*, i.e. a definition of what character corresponds to
each integer; it says nothing about how characters are represented.
Wide_Wide_Character is similarly an enumeration type with 2**32
literals.

When a sequence of characters is represented in internal memory, it's
up to an implementation to decide how to represent each character in
memory.  But in most cases, it makes no sense to represent it as
anything other than a flat array.  Thus, a Wide_String would be, in
essence, an array of 16-bit integers, and a Wide_Wide_String would be
an array of 32-bit integers.  If it were represented otherwise, how
could a program access, say, S(1000) where S is declared as a
Wide_Wide_String(1..2000)?  If it were represented as, say, UTF-8 or
UTF-16, the program would have to start at the beginning of the string
and do an expensive search every time it wanted to access one
particular character of the string.  This would not make sense.  So I
think that any implementation would implement those character (and
string) types as an integer (or array of integers), with whatever
endianness is most convenient for that processor.

When a sequence of characters is represented in a file (or is
communicated some other way e.g. over a socket), the characters may
well be encoded as UTF-8 or UTF-16 or something.  The language doesn't
define how different encodings are handled.  I believe GNAT uses the
"form" parameter when a file is opened or created to specify the
encoding; it supports a number of different possible encodings,
because different files that come from different places may be encoded
in different ways.  When a line is read from one of those files into
memory, though, I'm sure that the runtime will convert it to an
internal representation that is a flat array.

I'm not sure if this tells you what you need to know or not; if not,
then if you tell us why you're asking the question (i.e. what you want
to accomplish), this will give us a better idea of what we need to
tell you.  If you're trying to do some sort of overlay, where you read
in raw bytes from a file and then use Unchecked_Conversion or
something to convert it to a Wide_Wide_String, or something of that
nature, my advice is: Just don't do that.

P.S. I know I'm coming in late to this thread---I just got back from
vacation.  If your question has already been answered, my apologies.

                                -- Adam