* Wide_[Wide_]Character @ 2008-07-12 7:44 Dale Stanbrough 2008-07-12 8:11 ` Wide_[Wide_]Character Dmitry A. Kazakov ` (2 more replies) 0 siblings, 3 replies; 12+ messages in thread From: Dale Stanbrough @ 2008-07-12 7:44 UTC (permalink / raw) Unicode can be represented using UTF-8, UTF-16 and UTF-32 (amongst others). I gather that Character is simply ISO-8859-1 (Latin-1). I suspect that Wide_Character is UCS-2 (simple 2 byte values, no escapes like UTF-16). Is Wide_Wide_Character * UTF-16 * UTF-32 (i.e. UCS-4) * System dependent * Something else Thanks, Dale -- dstanbro@spam.o.matic.bigpond.net.au ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Wide_[Wide_]Character 2008-07-12 7:44 Wide_[Wide_]Character Dale Stanbrough @ 2008-07-12 8:11 ` Dmitry A. Kazakov 2008-07-12 11:00 ` Wide_[Wide_]Character Dale Stanbrough 2008-07-12 10:11 ` Wide_[Wide_]Character anon 2008-07-22 19:18 ` Wide_[Wide_]Character Adam Beneschan 2 siblings, 1 reply; 12+ messages in thread From: Dmitry A. Kazakov @ 2008-07-12 8:11 UTC (permalink / raw) On Sat, 12 Jul 2008 07:44:38 GMT, Dale Stanbrough wrote: > Unicode can be represented using UTF-8, UTF-16 and UTF-32 (amongst > others). > > I gather that Character is simply ISO-8859-1 (Latin-1). > > I suspect that Wide_Character is UCS-2 (simple 2 byte values, no escapes > like UTF-16). > > Is Wide_Wide_Character > > * UTF-16 > * UTF-32 (i.e. UCS-4) > * System dependent > * Something else RM 3.5.2 talks about "code positions" (=code points, I guess), represented by Wide_Wide_Character. From this I conclude that it shall be UCS-4 with some implementation-defined endianness. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Wide_[Wide_]Character 2008-07-12 8:11 ` Wide_[Wide_]Character Dmitry A. Kazakov @ 2008-07-12 11:00 ` Dale Stanbrough 2008-07-12 11:27 ` Wide_[Wide_]Character Peter C. Chapin 2008-07-12 20:56 ` Wide_[Wide_]Character Dmitry A. Kazakov 0 siblings, 2 replies; 12+ messages in thread From: Dale Stanbrough @ 2008-07-12 11:00 UTC (permalink / raw) Dmitry A. Kazakov wrote: > RM 3.5.2 talks about "code positions" (=code points, I guess), represented > by Wide_Wide_Character. From this I conclude that it shall be UCS-4 with > some implementation-defined endianness. Code points can be represented by any set of encodings. Wide_Character seems to deliberately confine itself to the BMP, so UCS-2 would suffice (and seems implied). I can't see any implication that would cause me to think Wide_Wide_Character is definitely UCS-4 (and not UTF-16). Dale -- dstanbro@spam.o.matic.bigpond.net.au ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Wide_[Wide_]Character 2008-07-12 11:00 ` Wide_[Wide_]Character Dale Stanbrough @ 2008-07-12 11:27 ` Peter C. Chapin 2008-07-12 12:25 ` Wide_[Wide_]Character Georg Bauhaus 2008-07-12 20:56 ` Wide_[Wide_]Character Dmitry A. Kazakov 1 sibling, 1 reply; 12+ messages in thread From: Peter C. Chapin @ 2008-07-12 11:27 UTC (permalink / raw) Dale Stanbrough wrote: > I can't see any implication that would cause me to think > Wide_Wide_Character is definitely UCS-4 (and not UTF-16). Well, section 3.5.2 (Character Types) in the Ada 2005 reference manual says: "The predefined type Wide_Wide_Character is a character type whose values correspond to the 2147483648 code positions of the ISO/IEC 10646:2003 character set. Each of the graphic_characters has a corresponding character_literal in Wide_Wide_Character. The first 65536 values of Wide_Wide_Character have the same character_literal or language-defined name as defined for Wide_Character." I understand that this doesn't speak to the issue of encoding, but perhaps that is intended to be left unspecified. In any event it seems fairly clear that you should be able to store any of 2147483648 values in a single Wide_Wide_Character variable. Doesn't that mean Wide_Wide_Character needs to be (at least) 32 bits? Peter ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Wide_[Wide_]Character 2008-07-12 11:27 ` Wide_[Wide_]Character Peter C. Chapin @ 2008-07-12 12:25 ` Georg Bauhaus 2008-07-15 12:37 ` Wide_[Wide_]Character Dale Stanbrough 0 siblings, 1 reply; 12+ messages in thread From: Georg Bauhaus @ 2008-07-12 12:25 UTC (permalink / raw) Peter C. Chapin wrote: > I understand that this doesn't speak to the issue of encoding, but > perhaps that is intended to be left unspecified. In any event it seems > fairly clear that you should be able to store any of 2147483648 values > in a single Wide_Wide_Character variable. Doesn't that mean > Wide_Wide_Character needs to be (at least) 32 bits? package Standard specifize 'Size of Wide_Wide_Character, type Wide_Wide_Character is (nul, soh ... Hex_7FFFFFFE, Hex_7FFFFFFF); for Wide_Wide_Character'Size use 32; Annex B has some hints as to the internal representation: 43.a/2 Discussion: The C types wchar_t and char16_t seem to be the same. However, wchar_t has an implementation-defined size, whereas char16_t is guaranteed to be an unsigned type of at least 16 bits. Also, char16_t and char32_t are encouraged to have UTF-16 and UTF-32 representations; that means that they are not directly the same as the Ada types, which most likely don't use any UTF encoding. Isn't this just like the RM not specifying the bit layout of numeric objects? -- Georg Bauhaus Y A Time Drain http://www.9toX.de ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Wide_[Wide_]Character 2008-07-12 12:25 ` Wide_[Wide_]Character Georg Bauhaus @ 2008-07-15 12:37 ` Dale Stanbrough 2008-07-15 14:06 ` Wide_[Wide_]Character Georg Bauhaus 0 siblings, 1 reply; 12+ messages in thread From: Dale Stanbrough @ 2008-07-15 12:37 UTC (permalink / raw) Georg Bauhaus wrote: > package Standard specifize 'Size of Wide_Wide_Character, > > type Wide_Wide_Character is > (nul, soh ... Hex_7FFFFFFE, Hex_7FFFFFFF); > for Wide_Wide_Character'Size use 32; thanks, I hadn't seen that. > Annex B has some hints as to the internal representation: > > 43.a/2 Discussion: The C types wchar_t and char16_t seem to be the same. > However, wchar_t has an implementation-defined size, whereas > char16_t is guaranteed to be an unsigned type of at least 16 bits. > Also, char16_t and char32_t are encouraged to have UTF-16 and UTF-32 > representations; that means that they are not directly the same as > the Ada types, which most likely don't use any UTF encoding. This seems to be in reference to the Ada C.Interfaces type, not Wide_Wide_Character. > Isn't this just like the RM not specifying the bit layout of > numeric objects? I'm not sure what the point of Wide_Wide_Character is if not to deal with Unicode (or ISO-10646:2003). You could invent your own 32 bit Character code (or use the one the vendor gives you), but playing in your own backyard doesn't seem very productive. To me the only point is if it implements the code. Dale -- dstanbro@spam.o.matic.bigpond.net.au ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Wide_[Wide_]Character 2008-07-15 12:37 ` Wide_[Wide_]Character Dale Stanbrough @ 2008-07-15 14:06 ` Georg Bauhaus 0 siblings, 0 replies; 12+ messages in thread From: Georg Bauhaus @ 2008-07-15 14:06 UTC (permalink / raw) Dale Stanbrough schrieb: > >> Isn't this just like the RM not specifying the bit layout of >> numeric objects? > > I'm not sure what the point of Wide_Wide_Character is if not to deal > with Unicode (or ISO-10646:2003). Sure, Wide_Wide_Character deals with ISO-1646:2003, the normative reference is listed in the LRM; you get I/O of those characters, and compilers will document the external encodings you can use. I also got to know how to pass Wide_Wide_Character objects into and out of my program in case I must (that's the Interfaces[.C] part). But why and when should I wonder what the internal bit layout of Wide_Wide_Character objects actually is? > You could invent your own 32 bit Character code (or use the one the > vendor gives you), but playing in your own backyard doesn't seem very > productive. Why not? If it is faster to use 64 bit words for Wide_Wide_Character operations, if this does not waste too much first level cache, then it seems like a good idea for a compiler to use 64 bits for Wide_Wide_Character. > To me the only point is if it implements the code. Why? -- Georg Bauhaus Y A Time Drain http://www.9toX.de ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Wide_[Wide_]Character 2008-07-12 11:00 ` Wide_[Wide_]Character Dale Stanbrough 2008-07-12 11:27 ` Wide_[Wide_]Character Peter C. Chapin @ 2008-07-12 20:56 ` Dmitry A. Kazakov 1 sibling, 0 replies; 12+ messages in thread From: Dmitry A. Kazakov @ 2008-07-12 20:56 UTC (permalink / raw) On Sat, 12 Jul 2008 11:00:05 GMT, Dale Stanbrough wrote: > Dmitry A. Kazakov wrote: > >> RM 3.5.2 talks about "code positions" (=code points, I guess), represented >> by Wide_Wide_Character. From this I conclude that it shall be UCS-4 with >> some implementation-defined endianness. > > Code points can be represented by any set of encodings. Wide_Character > seems to deliberately confine itself to the BMP, so UCS-2 would suffice > (and seems implied). > > I can't see any implication that would cause me to think > Wide_Wide_Character is definitely UCS-4 (and not UTF-16). How about this: Wide_Wide_Character may obviously use only the encodings which would make any Wide_Wide_String composed out of Wide_Wide_Characters a properly encoded string in the same encoding. This automatically excludes UTF-8 and UTF-16. BTW, why do you care? (:-)) I wonder if there is any use of Wide_[Wide_]Strings. IMO, anything one could wish from Unicode is provided by UTF-8 and plain Strings... -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Wide_[Wide_]Character 2008-07-12 7:44 Wide_[Wide_]Character Dale Stanbrough 2008-07-12 8:11 ` Wide_[Wide_]Character Dmitry A. Kazakov @ 2008-07-12 10:11 ` anon 2008-07-12 10:58 ` Wide_[Wide_]Character Dale Stanbrough 2008-07-22 19:18 ` Wide_[Wide_]Character Adam Beneschan 2 siblings, 1 reply; 12+ messages in thread From: anon @ 2008-07-12 10:11 UTC (permalink / raw) Ada Wide_Character is defined as ISO-10646:2003 (32-bit) (RM 3.2.2 (3/2)). The unicode version is 4.0. Verified at http://www.unicode.org/versions/Unicode4.0.0/ In <MrNoSpam-A54511.17443812072008@news-server.bigpond.net.au>, Dale Stanbrough <MrNoSpam@bigpoop.net.au> writes: >Unicode can be represented using UTF-8, UTF-16 and UTF-32 (amongst >others). > >I gather that Character is simply ISO-8859-1 (Latin-1). > >I suspect that Wide_Character is UCS-2 (simple 2 byte values, no escapes >like UTF-16). > >Is Wide_Wide_Character > > * UTF-16 > * UTF-32 (i.e. UCS-4) > * System dependent > * Something else > > >Thanks, > >Dale > >-- >dstanbro@spam.o.matic.bigpond.net.au ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Wide_[Wide_]Character 2008-07-12 10:11 ` Wide_[Wide_]Character anon @ 2008-07-12 10:58 ` Dale Stanbrough 2008-07-13 1:38 ` Wide_[Wide_]Character anon 0 siblings, 1 reply; 12+ messages in thread From: Dale Stanbrough @ 2008-07-12 10:58 UTC (permalink / raw) In article <Dr%dk.113840$102.42319@bgtnsc05-news.ops.worldnet.att.net>, anon@anon.org (anon) wrote: > Ada Wide_Character is defined as ISO-10646:2003 (32-bit) (RM 3.2.2 (3/2)). > The unicode version is 4.0. > Verified at http://www.unicode.org/versions/Unicode4.0.0/ I think you mean 3.5.2. It only says that it follows ISO-10646, but says nothing about it being a 32 bit version (see http://unicode.org/faq/unicode_iso.html#3). The wikipedia entry also mentions that UTF-16 was an early extension to UCS-2 (and by implication also supported by ISO-10646). The character codes are the same as those supported by Unicode (in fact 106464 seems to be the Unicode character code point values but without all of the sorting, script, locale etc support). The encodings are independent of the code set. Dale -- dstanbro@spam.o.matic.bigpond.net.au ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Wide_[Wide_]Character 2008-07-12 10:58 ` Wide_[Wide_]Character Dale Stanbrough @ 2008-07-13 1:38 ` anon 0 siblings, 0 replies; 12+ messages in thread From: anon @ 2008-07-13 1:38 UTC (permalink / raw) It is RM 3.5.2 (3/2) But the RM just defines that Ada uses ISO-10646:2003 (32-bit). The 32-bit came from the Standard package and other place which also defines the ISO-10646:2003 as 32-bits. The unicode version is 4.0 came from the web page which states "The character repertoire corresponds to ISO/IEC 10646:2003." Which is owned by an agency that deals will the unicode standard. And on other locations and at that site it states that for evey "ISO/IEC" there is one, not multiple corresponding unicode version. Also, unicode version 4.0 suport all pervious version with some changes listed on the web page. Just like unicode version 5.0 supports version 4.0, 4.0.1, 4.1.0 and etc with some other changes. Now, Ada does it does not define how to use all of its the character set. That's up to the programmers that is using Ada. In <MrNoSpam-D2E6B0.20581212072008@news-server.bigpond.net.au>, Dale Stanbrough <MrNoSpam@bigpoop.net.au> writes: >In article <Dr%dk.113840$102.42319@bgtnsc05-news.ops.worldnet.att.net>, > anon@anon.org (anon) wrote: > >> Ada Wide_Character is defined as ISO-10646:2003 (32-bit) (RM 3.2.2 (3/2)). >> The unicode version is 4.0. >> Verified at http://www.unicode.org/versions/Unicode4.0.0/ > >I think you mean 3.5.2. > >It only says that it follows ISO-10646, but says nothing about it being >a 32 bit version (see http://unicode.org/faq/unicode_iso.html#3). > >The wikipedia entry also mentions that UTF-16 was an early extension to >UCS-2 (and by implication also supported by ISO-10646). > > >The character codes are the same as those supported by Unicode (in fact >106464 seems to be the Unicode character code point values but without >all of the sorting, script, locale etc support). > >The encodings are independent of the code set. > >Dale > >-- >dstanbro@spam.o.matic.bigpond.net.au ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Wide_[Wide_]Character 2008-07-12 7:44 Wide_[Wide_]Character Dale Stanbrough 2008-07-12 8:11 ` Wide_[Wide_]Character Dmitry A. Kazakov 2008-07-12 10:11 ` Wide_[Wide_]Character anon @ 2008-07-22 19:18 ` Adam Beneschan 2 siblings, 0 replies; 12+ messages in thread From: Adam Beneschan @ 2008-07-22 19:18 UTC (permalink / raw) On Jul 12, 12:44 am, Dale Stanbrough <MrNoS...@bigpoop.net.au> wrote: > Unicode can be represented using UTF-8, UTF-16 and UTF-32 (amongst > others). > > I gather that Character is simply ISO-8859-1 (Latin-1). > > I suspect that Wide_Character is UCS-2 (simple 2 byte values, no escapes > like UTF-16). > > Is Wide_Wide_Character > > * UTF-16 > * UTF-32 (i.e. UCS-4) > * System dependent > * Something else > > Thanks, > > Dale I'm not convinced that the question makes sense. Wide_Character refers to an enumeration type with 2**16 literals, where Wide_Charater'Val(N) denotes the corresponding character in the ISO 10646 Basic Multilingual Plane, i.e. Unicode. Unicode is a *character* *set*, i.e. a definition of what character corresponds to each integer; it says nothing about how characters are represented. Wide_Wide_Character is similarly an enumeration type with 2**32 literals. When a sequence of characters is represented in internal memory, it's up to an implementation to decide how to represent each character in memory. But in most cases, it makes no sense to represent it as anything other than a flat array. Thus, a Wide_String would be, in essence, an array of 16-bit integers, and a Wide_Wide_String would be an array of 32-bit integers. If it were represented otherwise, how could a program access, say, S(1000) where S is declared as a Wide_Wide_String(1..2000)? If it were represented as, say, UTF-8 or UTF-16, the program would have to start at the beginning of the string and do an expensive search every time it wanted to access one particular character of the string. This would not make sense. So I think that any implementation would implement those character (and string) types as an integer (or array of integers), with whatever endianness is most convenient for that processor. When a sequence of characters is represented in a file (or is communicated some other way e.g. over a socket), the characters may well be encoded as UTF-8 or UTF-16 or something. The language doesn't define how different encodings are handled. I believe GNAT uses the "form" parameter when a file is opened or created to specify the encoding; it supports a number of different possible encodings, because different files that come from different places may be encoded in different ways. When a line is read from one of those files into memory, though, I'm sure that the runtime will convert it to an internal representation that is a flat array. I'm not sure if this tells you what you need to know or not; if not, then if you tell us why you're asking the question (i.e. what you want to accomplish), this will give us a better idea of what we need to tell you. If you're trying to do some sort of overlay, where you read in raw bytes from a file and then use Unchecked_Conversion or something to convert it to a Wide_Wide_String, or something of that nature, my advice is: Just don't do that. P.S. I know I'm coming in late to this thread---I just got back from vacation. If your question has already been answered, my apologies. -- Adam ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2008-07-22 19:18 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2008-07-12 7:44 Wide_[Wide_]Character Dale Stanbrough 2008-07-12 8:11 ` Wide_[Wide_]Character Dmitry A. Kazakov 2008-07-12 11:00 ` Wide_[Wide_]Character Dale Stanbrough 2008-07-12 11:27 ` Wide_[Wide_]Character Peter C. Chapin 2008-07-12 12:25 ` Wide_[Wide_]Character Georg Bauhaus 2008-07-15 12:37 ` Wide_[Wide_]Character Dale Stanbrough 2008-07-15 14:06 ` Wide_[Wide_]Character Georg Bauhaus 2008-07-12 20:56 ` Wide_[Wide_]Character Dmitry A. Kazakov 2008-07-12 10:11 ` Wide_[Wide_]Character anon 2008-07-12 10:58 ` Wide_[Wide_]Character Dale Stanbrough 2008-07-13 1:38 ` Wide_[Wide_]Character anon 2008-07-22 19:18 ` Wide_[Wide_]Character Adam Beneschan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox