Wide_[Wide_]Character

comp.lang.ada
 help / color / mirror / Atom feed

* Wide_[Wide_]Character
@ 2008-07-12  7:44 Dale Stanbrough
  2008-07-12  8:11 ` Wide_[Wide_]Character Dmitry A. Kazakov
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Dale Stanbrough @ 2008-07-12  7:44 UTC (permalink / raw)


Unicode can be represented using UTF-8, UTF-16 and UTF-32 (amongst 
others).

I gather that Character is simply ISO-8859-1 (Latin-1).

I suspect that Wide_Character is UCS-2 (simple 2 byte values, no escapes 
like UTF-16).

Is Wide_Wide_Character

   * UTF-16
   * UTF-32 (i.e. UCS-4)
   * System dependent
   * Something else


Thanks,

Dale

-- 
dstanbro@spam.o.matic.bigpond.net.au



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Wide_[Wide_]Character
  2008-07-12  7:44 Wide_[Wide_]Character Dale Stanbrough
@ 2008-07-12  8:11 ` Dmitry A. Kazakov
  2008-07-12 11:00   ` Wide_[Wide_]Character Dale Stanbrough
  2008-07-12 10:11 ` Wide_[Wide_]Character anon
  2008-07-22 19:18 ` Wide_[Wide_]Character Adam Beneschan
  2 siblings, 1 reply; 12+ messages in thread
From: Dmitry A. Kazakov @ 2008-07-12  8:11 UTC (permalink / raw)


On Sat, 12 Jul 2008 07:44:38 GMT, Dale Stanbrough wrote:

> Unicode can be represented using UTF-8, UTF-16 and UTF-32 (amongst 
> others).
> 
> I gather that Character is simply ISO-8859-1 (Latin-1).
> 
> I suspect that Wide_Character is UCS-2 (simple 2 byte values, no escapes 
> like UTF-16).
> 
> Is Wide_Wide_Character
> 
>    * UTF-16
>    * UTF-32 (i.e. UCS-4)
>    * System dependent
>    * Something else

RM 3.5.2 talks about "code positions" (=code points, I guess), represented
by Wide_Wide_Character. From this I conclude that it shall be UCS-4 with
some implementation-defined endianness.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Wide_[Wide_]Character
  2008-07-12  8:11 ` Wide_[Wide_]Character Dmitry A. Kazakov
@ 2008-07-12 11:00   ` Dale Stanbrough
  2008-07-12 11:27     ` Wide_[Wide_]Character Peter C. Chapin
  2008-07-12 20:56     ` Wide_[Wide_]Character Dmitry A. Kazakov
  0 siblings, 2 replies; 12+ messages in thread
From: Dale Stanbrough @ 2008-07-12 11:00 UTC (permalink / raw)

Dmitry A. Kazakov wrote:

> RM 3.5.2 talks about "code positions" (=code points, I guess), represented
> by Wide_Wide_Character. From this I conclude that it shall be UCS-4 with
> some implementation-defined endianness.

Code points can be represented by any set of encodings. Wide_Character 
seems to deliberately confine itself to the BMP, so UCS-2 would suffice 
(and seems implied).

I can't see any implication that would cause me to think 
Wide_Wide_Character is definitely UCS-4 (and not UTF-16).

Dale

-- 
dstanbro@spam.o.matic.bigpond.net.au

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Wide_[Wide_]Character
  2008-07-12 11:00   ` Wide_[Wide_]Character Dale Stanbrough
@ 2008-07-12 11:27     ` Peter C. Chapin
  2008-07-12 12:25       ` Wide_[Wide_]Character Georg Bauhaus
  2008-07-12 20:56     ` Wide_[Wide_]Character Dmitry A. Kazakov
  1 sibling, 1 reply; 12+ messages in thread
From: Peter C. Chapin @ 2008-07-12 11:27 UTC (permalink / raw)

Dale Stanbrough wrote:

> I can't see any implication that would cause me to think 
> Wide_Wide_Character is definitely UCS-4 (and not UTF-16).

Well, section 3.5.2 (Character Types) in the Ada 2005 reference manual says:

   "The predefined type Wide_Wide_Character is a character type whose 
values correspond to the 2147483648 code positions of the ISO/IEC 
10646:2003 character set. Each of the graphic_characters has a 
corresponding character_literal in Wide_Wide_Character. The first 65536 
values of Wide_Wide_Character have the same character_literal or 
language-defined name as defined for Wide_Character."

I understand that this doesn't speak to the issue of encoding, but 
perhaps that is intended to be left unspecified. In any event it seems 
fairly clear that you should be able to store any of 2147483648 values 
in a single Wide_Wide_Character variable. Doesn't that mean 
Wide_Wide_Character needs to be (at least) 32 bits?

Peter

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Wide_[Wide_]Character
  2008-07-12 11:27     ` Wide_[Wide_]Character Peter C. Chapin
@ 2008-07-12 12:25       ` Georg Bauhaus
  2008-07-15 12:37         ` Wide_[Wide_]Character Dale Stanbrough
  0 siblings, 1 reply; 12+ messages in thread
From: Georg Bauhaus @ 2008-07-12 12:25 UTC (permalink / raw)

Peter C. Chapin wrote:

> I understand that this doesn't speak to the issue of encoding, but
> perhaps that is intended to be left unspecified. In any event it seems
> fairly clear that you should be able to store any of 2147483648 values
> in a single Wide_Wide_Character variable. Doesn't that mean
> Wide_Wide_Character needs to be (at least) 32 bits?

package Standard specifize 'Size of Wide_Wide_Character,

        type Wide_Wide_Character is
            (nul, soh ... Hex_7FFFFFFE, Hex_7FFFFFFF);
        for Wide_Wide_Character'Size use 32;

Annex B has some hints as to the internal representation:

43.a/2 Discussion: The C types wchar_t and char16_t seem to be the same.
   However, wchar_t has an implementation-defined size, whereas
   char16_t is guaranteed to be an unsigned type of at least 16 bits.
   Also, char16_t and char32_t are encouraged to have UTF-16 and UTF-32
   representations; that means that they are not directly the same as
   the Ada types, which most likely don't use any UTF encoding.

Isn't this just like the RM not specifying the bit layout of
numeric objects?

-- 
Georg Bauhaus
Y A Time Drain  http://www.9toX.de

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Wide_[Wide_]Character
  2008-07-12 12:25       ` Wide_[Wide_]Character Georg Bauhaus
@ 2008-07-15 12:37         ` Dale Stanbrough
  2008-07-15 14:06           ` Wide_[Wide_]Character Georg Bauhaus
  0 siblings, 1 reply; 12+ messages in thread
From: Dale Stanbrough @ 2008-07-15 12:37 UTC (permalink / raw)


Georg Bauhaus wrote:

> package Standard specifize 'Size of Wide_Wide_Character,
> 
>         type Wide_Wide_Character is
>             (nul, soh ... Hex_7FFFFFFE, Hex_7FFFFFFF);
>         for Wide_Wide_Character'Size use 32;

thanks, I hadn't seen that.

> Annex B has some hints as to the internal representation:
> 
> 43.a/2 Discussion: The C types wchar_t and char16_t seem to be the same.
>    However, wchar_t has an implementation-defined size, whereas
>    char16_t is guaranteed to be an unsigned type of at least 16 bits.
>    Also, char16_t and char32_t are encouraged to have UTF-16 and UTF-32
>    representations; that means that they are not directly the same as
>    the Ada types, which most likely don't use any UTF encoding.

This seems to be in reference to the Ada C.Interfaces type, not 
Wide_Wide_Character.


> Isn't this just like the RM not specifying the bit layout of
> numeric objects?

I'm not sure what the point of Wide_Wide_Character is if not to deal 
with Unicode (or ISO-10646:2003).

You could invent your own 32 bit Character code (or use the one the 
vendor gives you), but playing in your own backyard doesn't seem very 
productive. To me the only point is if it implements the code.


Dale

-- 
dstanbro@spam.o.matic.bigpond.net.au



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Wide_[Wide_]Character
  2008-07-15 12:37         ` Wide_[Wide_]Character Dale Stanbrough
@ 2008-07-15 14:06           ` Georg Bauhaus
  0 siblings, 0 replies; 12+ messages in thread
From: Georg Bauhaus @ 2008-07-15 14:06 UTC (permalink / raw)


Dale Stanbrough schrieb:

> 
>> Isn't this just like the RM not specifying the bit layout of
>> numeric objects?
> 
> I'm not sure what the point of Wide_Wide_Character is if not to deal 
> with Unicode (or ISO-10646:2003).

Sure, Wide_Wide_Character deals with ISO-1646:2003, the normative
reference is listed in the LRM; you get I/O of those characters,
and compilers will document the external encodings you can use.

I also got to know how to pass Wide_Wide_Character objects into
and out of my program in case I must (that's the Interfaces[.C] part).
But why and when should I wonder what the internal bit layout of
Wide_Wide_Character objects actually is?


> You could invent your own 32 bit Character code (or use the one the 
> vendor gives you), but playing in your own backyard doesn't seem very 
> productive.

Why not? If it is faster to use 64 bit words for Wide_Wide_Character
operations, if this does not waste too much first level cache,
then it seems like a good idea for a compiler to use 64 bits for
Wide_Wide_Character.



> To me the only point is if it implements the code.

Why?

--
Georg Bauhaus
Y A Time Drain  http://www.9toX.de



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Wide_[Wide_]Character
  2008-07-12 11:00   ` Wide_[Wide_]Character Dale Stanbrough
  2008-07-12 11:27     ` Wide_[Wide_]Character Peter C. Chapin
@ 2008-07-12 20:56     ` Dmitry A. Kazakov
  1 sibling, 0 replies; 12+ messages in thread
From: Dmitry A. Kazakov @ 2008-07-12 20:56 UTC (permalink / raw)

On Sat, 12 Jul 2008 11:00:05 GMT, Dale Stanbrough wrote:

> Dmitry A. Kazakov wrote:
> 
>> RM 3.5.2 talks about "code positions" (=code points, I guess), represented
>> by Wide_Wide_Character. From this I conclude that it shall be UCS-4 with
>> some implementation-defined endianness.
> 
> Code points can be represented by any set of encodings. Wide_Character 
> seems to deliberately confine itself to the BMP, so UCS-2 would suffice 
> (and seems implied).
> 
> I can't see any implication that would cause me to think 
> Wide_Wide_Character is definitely UCS-4 (and not UTF-16).

How about this: Wide_Wide_Character may obviously use only the encodings
which would make  any Wide_Wide_String composed out of Wide_Wide_Characters
a properly encoded string in the same encoding. This automatically excludes
UTF-8 and UTF-16.

BTW, why do you care? (:-)) I wonder if there is any use of
Wide_[Wide_]Strings. IMO, anything one could wish from Unicode is provided
by UTF-8 and plain Strings...

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Wide_[Wide_]Character
  2008-07-12  7:44 Wide_[Wide_]Character Dale Stanbrough
  2008-07-12  8:11 ` Wide_[Wide_]Character Dmitry A. Kazakov
@ 2008-07-12 10:11 ` anon
  2008-07-12 10:58   ` Wide_[Wide_]Character Dale Stanbrough
  2008-07-22 19:18 ` Wide_[Wide_]Character Adam Beneschan
  2 siblings, 1 reply; 12+ messages in thread
From: anon @ 2008-07-12 10:11 UTC (permalink / raw)


Ada Wide_Character is defined as ISO-10646:2003 (32-bit) (RM 3.2.2 (3/2)). 
The unicode version is 4.0.  
Verified at http://www.unicode.org/versions/Unicode4.0.0/


In <MrNoSpam-A54511.17443812072008@news-server.bigpond.net.au>, Dale Stanbrough <MrNoSpam@bigpoop.net.au> writes:
>Unicode can be represented using UTF-8, UTF-16 and UTF-32 (amongst 
>others).
>
>I gather that Character is simply ISO-8859-1 (Latin-1).
>
>I suspect that Wide_Character is UCS-2 (simple 2 byte values, no escapes 
>like UTF-16).
>
>Is Wide_Wide_Character
>
>   * UTF-16
>   * UTF-32 (i.e. UCS-4)
>   * System dependent
>   * Something else
>
>
>Thanks,
>
>Dale
>
>-- 
>dstanbro@spam.o.matic.bigpond.net.au




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Wide_[Wide_]Character
  2008-07-12 10:11 ` Wide_[Wide_]Character anon
@ 2008-07-12 10:58   ` Dale Stanbrough
  2008-07-13  1:38     ` Wide_[Wide_]Character anon
  0 siblings, 1 reply; 12+ messages in thread
From: Dale Stanbrough @ 2008-07-12 10:58 UTC (permalink / raw)

In article <Dr%dk.113840$102.42319@bgtnsc05-news.ops.worldnet.att.net>,
 anon@anon.org (anon) wrote:

> Ada Wide_Character is defined as ISO-10646:2003 (32-bit) (RM 3.2.2 (3/2)). 
> The unicode version is 4.0.  
> Verified at http://www.unicode.org/versions/Unicode4.0.0/

I think you mean 3.5.2.

It only says that it follows ISO-10646, but says nothing about it being 
a 32 bit version (see http://unicode.org/faq/unicode_iso.html#3).

The wikipedia entry also mentions that UTF-16 was an early extension to 
UCS-2 (and by implication also supported by ISO-10646).

The character codes are the same as those supported by Unicode (in fact 
106464 seems to be the Unicode character code point values but without 
all of the sorting, script, locale etc support).

The encodings are independent of the code set.

Dale

-- 
dstanbro@spam.o.matic.bigpond.net.au

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Wide_[Wide_]Character
  2008-07-12 10:58   ` Wide_[Wide_]Character Dale Stanbrough
@ 2008-07-13  1:38     ` anon
  0 siblings, 0 replies; 12+ messages in thread
From: anon @ 2008-07-13  1:38 UTC (permalink / raw)

It is RM 3.5.2 (3/2) But the RM just defines that Ada uses 
ISO-10646:2003 (32-bit).  The 32-bit came from the Standard package and 
other place which also defines the ISO-10646:2003 as 32-bits. The unicode 
version is 4.0 came from the web page which states "The character 
repertoire corresponds to ISO/IEC 10646:2003."  Which is owned by an 
agency that deals will the unicode standard. And on other locations and at 
that site it states that for evey "ISO/IEC" there is one, not multiple 
corresponding unicode version.

Also, unicode version 4.0 suport all pervious version with some changes 
listed on the web page.  Just like unicode version 5.0 supports version 4.0, 
4.0.1, 4.1.0 and etc with some other changes.

Now, Ada does it does not define how to use all of its the character set. 
That's up to the programmers that is using Ada. 

In <MrNoSpam-D2E6B0.20581212072008@news-server.bigpond.net.au>, Dale Stanbrough <MrNoSpam@bigpoop.net.au> writes:
>In article <Dr%dk.113840$102.42319@bgtnsc05-news.ops.worldnet.att.net>,
> anon@anon.org (anon) wrote:
>
>> Ada Wide_Character is defined as ISO-10646:2003 (32-bit) (RM 3.2.2 (3/2)). 
>> The unicode version is 4.0.  
>> Verified at http://www.unicode.org/versions/Unicode4.0.0/
>
>I think you mean 3.5.2.
>
>It only says that it follows ISO-10646, but says nothing about it being 
>a 32 bit version (see http://unicode.org/faq/unicode_iso.html#3).
>
>The wikipedia entry also mentions that UTF-16 was an early extension to 
>UCS-2 (and by implication also supported by ISO-10646).
>
>
>The character codes are the same as those supported by Unicode (in fact 
>106464 seems to be the Unicode character code point values but without 
>all of the sorting, script, locale etc support).
>
>The encodings are independent of the code set.
>
>Dale
>
>-- 
>dstanbro@spam.o.matic.bigpond.net.au

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Wide_[Wide_]Character
  2008-07-12  7:44 Wide_[Wide_]Character Dale Stanbrough
  2008-07-12  8:11 ` Wide_[Wide_]Character Dmitry A. Kazakov
  2008-07-12 10:11 ` Wide_[Wide_]Character anon
@ 2008-07-22 19:18 ` Adam Beneschan
  2 siblings, 0 replies; 12+ messages in thread
From: Adam Beneschan @ 2008-07-22 19:18 UTC (permalink / raw)

On Jul 12, 12:44 am, Dale Stanbrough <MrNoS...@bigpoop.net.au> wrote:
> Unicode can be represented using UTF-8, UTF-16 and UTF-32 (amongst
> others).
>
> I gather that Character is simply ISO-8859-1 (Latin-1).
>
> I suspect that Wide_Character is UCS-2 (simple 2 byte values, no escapes
> like UTF-16).
>
> Is Wide_Wide_Character
>
>    * UTF-16
>    * UTF-32 (i.e. UCS-4)
>    * System dependent
>    * Something else
>
> Thanks,
>
> Dale

I'm not convinced that the question makes sense.  Wide_Character
refers to an enumeration type with 2**16 literals, where
Wide_Charater'Val(N) denotes the corresponding character in the ISO
10646 Basic Multilingual Plane, i.e. Unicode.  Unicode is a
*character* *set*, i.e. a definition of what character corresponds to
each integer; it says nothing about how characters are represented.
Wide_Wide_Character is similarly an enumeration type with 2**32
literals.

When a sequence of characters is represented in internal memory, it's
up to an implementation to decide how to represent each character in
memory.  But in most cases, it makes no sense to represent it as
anything other than a flat array.  Thus, a Wide_String would be, in
essence, an array of 16-bit integers, and a Wide_Wide_String would be
an array of 32-bit integers.  If it were represented otherwise, how
could a program access, say, S(1000) where S is declared as a
Wide_Wide_String(1..2000)?  If it were represented as, say, UTF-8 or
UTF-16, the program would have to start at the beginning of the string
and do an expensive search every time it wanted to access one
particular character of the string.  This would not make sense.  So I
think that any implementation would implement those character (and
string) types as an integer (or array of integers), with whatever
endianness is most convenient for that processor.

When a sequence of characters is represented in a file (or is
communicated some other way e.g. over a socket), the characters may
well be encoded as UTF-8 or UTF-16 or something.  The language doesn't
define how different encodings are handled.  I believe GNAT uses the
"form" parameter when a file is opened or created to specify the
encoding; it supports a number of different possible encodings,
because different files that come from different places may be encoded
in different ways.  When a line is read from one of those files into
memory, though, I'm sure that the runtime will convert it to an
internal representation that is a flat array.

I'm not sure if this tells you what you need to know or not; if not,
then if you tell us why you're asking the question (i.e. what you want
to accomplish), this will give us a better idea of what we need to
tell you.  If you're trying to do some sort of overlay, where you read
in raw bytes from a file and then use Unchecked_Conversion or
something to convert it to a Wide_Wide_String, or something of that
nature, my advice is: Just don't do that.

P.S. I know I'm coming in late to this thread---I just got back from
vacation.  If your question has already been answered, my apologies.

                                -- Adam

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2008-07-22 19:18 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-07-12  7:44 Wide_[Wide_]Character Dale Stanbrough
2008-07-12  8:11 ` Wide_[Wide_]Character Dmitry A. Kazakov
2008-07-12 11:00   ` Wide_[Wide_]Character Dale Stanbrough
2008-07-12 11:27     ` Wide_[Wide_]Character Peter C. Chapin
2008-07-12 12:25       ` Wide_[Wide_]Character Georg Bauhaus
2008-07-15 12:37         ` Wide_[Wide_]Character Dale Stanbrough
2008-07-15 14:06           ` Wide_[Wide_]Character Georg Bauhaus
2008-07-12 20:56     ` Wide_[Wide_]Character Dmitry A. Kazakov
2008-07-12 10:11 ` Wide_[Wide_]Character anon
2008-07-12 10:58   ` Wide_[Wide_]Character Dale Stanbrough
2008-07-13  1:38     ` Wide_[Wide_]Character anon
2008-07-22 19:18 ` Wide_[Wide_]Character Adam Beneschan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox