Character Sets

comp.lang.ada
 help / color / mirror / Atom feed

* Character Sets
@ 2002-11-26 21:41 Robert C. Leif
  0 siblings, 0 replies; 13+ messages in thread
From: Robert C. Leif @ 2002-11-26 21:41 UTC (permalink / raw)


From: Bob Leif
I am trying to test if a character is not in the Latin_1 character set.
I choose the Euro because it is in Latin_9 and not in Latin_1. I tested
the function Ada.Strings.Maps.Is_In. It returns that the Euro_Sign is in
the Latin_1 character set. What have I done wrong?
My test program, which compiled and executed under GNAT 3.15p under
Windows XP, produced:
------------------------Starting Test-----------------------
Is_In_Character_Set is TRUE
------------------------Ending Test-----------------------
 The test program is as follows:
---------------------------------------------------------
with Ada.Text_Io;
with Ada.Io_Exceptions;
with Ada.Exceptions;
with Ada.Strings;
with Ada.Strings.Maps;
with  Ada.Characters.Latin_1;
with  Ada.Characters.Latin_9;
procedure Char_Sets_Test is 
   ------------------Table of Contents------------- 
   package T_Io renames Ada.Text_Io;
   package Str_Maps renames Ada.Strings.Maps;
   package Latin_1 renames Ada.Characters.Latin_1;
   package Latin_9 renames Ada.Characters.Latin_9;
   subtype Character_Set_Type is Str_Maps.Character_Set;
   -----------------End Table of Contents-------------
   Latin_1_Range    : constant Str_Maps.Character_Range
	 := (Low => Latin_1.Nul, High => Latin_1.Lc_Y_Diaeresis);  
   Latin_1_Char_Set :          Character_Set_Type       :=
Str_Maps.To_Set 	(Span => Latin_1_Range);  
   --Standard for Ada '95
   Is_In_Character_Set : Boolean := False;  
   ---------------------------------------------
begin--Bd_W_Char_Sets_Test
   T_Io.Put_Line("-----------------------Starting
Test--------------------);
   ---------------------------------------------
   --Test Character_Sets
   Is_In_Character_Set:=Ada.Strings.Maps.Is_In (
      Element => Latin_9.Euro_Sign, 
      Set     => Latin_1_Char_Set);
   T_Io.Put_Line("Is_In_Character_Set is " & Boolean'Image
(Is_In_Character_Set));
   ---------------------------------------------   
   ---------------------------------------------
   T_Io.Put_Line("-----------------------Ending
Test----------------------);

exception
   when A: Ada.Io_Exceptions.Status_Error =>
      T_io.Put_Line("Status_Error in Char_Sets_Test.");
      T_Io.Put_Line(Ada.Exceptions.Exception_Information(A));
   when O: others =>
      T_Io.Put_Line("Others_Error in Char_Sets_Test.");
      T_Io.Put_Line(Ada.Exceptions.Exception_Information(O));
end Char_Sets_Test;




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Character Sets
@ 2002-11-27  9:00 Grein, Christoph
  0 siblings, 0 replies; 13+ messages in thread
From: Grein, Christoph @ 2002-11-27  9:00 UTC (permalink / raw)


> From: Bob Leif
> I am trying to test if a character is not in the Latin_1 character set.
> I choose the Euro because it is in Latin_9 and not in Latin_1. I tested
> the function Ada.Strings.Maps.Is_In. It returns that the Euro_Sign is in
> the Latin_1 character set. What have I done wrong?
> My test program, which compiled and executed under GNAT 3.15p under
> Windows XP, produced:
> ------------------------Starting Test-----------------------
> Is_In_Character_Set is TRUE
> ------------------------Ending Test-----------------------
>  The test program is as follows:
> ---------------------------------------------------------
> with Ada.Text_Io;
> with Ada.Io_Exceptions;
> with Ada.Exceptions;
> with Ada.Strings;
> with Ada.Strings.Maps;
> with  Ada.Characters.Latin_1;
> with  Ada.Characters.Latin_9;
> procedure Char_Sets_Test is 
>    ------------------Table of Contents------------- 
>    package T_Io renames Ada.Text_Io;
>    package Str_Maps renames Ada.Strings.Maps;
>    package Latin_1 renames Ada.Characters.Latin_1;
>    package Latin_9 renames Ada.Characters.Latin_9;
>    subtype Character_Set_Type is Str_Maps.Character_Set;
>    -----------------End Table of Contents-------------
>    Latin_1_Range    : constant Str_Maps.Character_Range
> 	 := (Low => Latin_1.Nul, High => Latin_1.Lc_Y_Diaeresis);  

This is the full range of type Character, isn't it.

>    Latin_1_Char_Set :          Character_Set_Type       :=
> Str_Maps.To_Set 	(Span => Latin_1_Range);  

So this is the set of all characters.

>    --Standard for Ada '95
>    Is_In_Character_Set : Boolean := False;  
>    ---------------------------------------------
> begin--Bd_W_Char_Sets_Test
>    T_Io.Put_Line("-----------------------Starting
> Test--------------------);
>    ---------------------------------------------
>    --Test Character_Sets
>    Is_In_Character_Set:=Ada.Strings.Maps.Is_In (
>       Element => Latin_9.Euro_Sign, 
>       Set     => Latin_1_Char_Set);

Latin_9.Euro_Sign is a name for a character. The same character in Latin1 has a 
different name, it is the Currency_Sign.

So why do you expect this character not to be in the set only because you use a 
different name for it?

>    T_Io.Put_Line("Is_In_Character_Set is " & Boolean'Image
> (Is_In_Character_Set));
>    ---------------------------------------------   
>    ---------------------------------------------
>    T_Io.Put_Line("-----------------------Ending
> Test----------------------);
> 
> exception
>    when A: Ada.Io_Exceptions.Status_Error =>
>       T_io.Put_Line("Status_Error in Char_Sets_Test.");
>       T_Io.Put_Line(Ada.Exceptions.Exception_Information(A));
>    when O: others =>
>       T_Io.Put_Line("Others_Error in Char_Sets_Test.");
>       T_Io.Put_Line(Ada.Exceptions.Exception_Information(O));
> end Char_Sets_Test;
> 
> _______________________________________________
> comp.lang.ada mailing list
> comp.lang.ada@ada.eu.org
> http://ada.eu.org/mailman/listinfo/comp.lang.ada



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Character Sets
@ 2002-11-28 17:53 Robert C. Leif
  2002-11-29 12:28 ` Georg Bauhaus
  2002-12-02 18:28 ` Stephen Leake
  0 siblings, 2 replies; 13+ messages in thread
From: Robert C. Leif @ 2002-11-28 17:53 UTC (permalink / raw)


Christoph Grein responded to my inquiry by stating that,
" Latin_9.Euro_Sign is a name for a character. The same character in Latin_1 has a different name, it is the Currency_Sign."
"So why do you expect this character not to be in the set only because you use a different name for it?"
The Euro_Sign and the Currency_Sign have a different representation according to The ISO 8859 Alphabet Soup http://czyborra.com/charsets/iso8859.html
------------------------------------------------
GNAT Latin_9 (ISO-8859-15)includes the following:
   -- Summary of Changes from Latin-1 => Latin-9 --
   ------------------------------------------------

   --   164     Currency                => Euro_Sign
   --   166     Broken_Bar              => UC_S_Caron
   --   168     Diaeresis               => LC_S_Caron
   --   180     Acute                   => UC_Z_Caron
   --   184     Cedilla                 => LC_Z_Caron
   --   188     Fraction_One_Quarter    => UC_Ligature_OE
   --   189     Fraction_One_Half       => LC_Ligature_OE
   --   190     Fraction_Three_Quarters => UC_Y_Diaeresis
Since these are changes, they should not be the same character.
Below are the results of an extension of my original program that now tests the characters of Latin_9 from character number 164 through 190 and prints them out. I understand that choice of the Windows font will change their representation. The correct glyphs can be found at The ISO 8859 Alphabet Soup. For anyone interested, I have put my program at the end of this note.
I suspect that the best solution would be to introduce UniCode, ISO/IEC 10646, into the Ada standard. The arguments for this are contained in W3C Character Model for the World Wide Web 1.0, W3C Working Draft 30 April 2002
http://www.w3.org/TR/charmod/
"The choice of Unicode was motivated by the fact that Unicode: is the only universal character repertoire available, covers the widest possible range, provides a way of referencing characters independent of the encoding of a resource, is being updated/completed carefully, is widely accepted and implemented by industry."
"W3C adopted Unicode as the document character set for HTML in [HTML 4.0]. The same approach was later used for specifications such as XML 1.0 [XML 1.0] and CSS2 [CSS2]. Unicode now serves as a common reference for W3C specifications and applications."
"The IETF has adopted some policies on the use of character sets on the Internet (see [RFC 2277])."
Bob Leif
------------------------Starting Test-----------------------
Latin_9_Diff is ñÑªº¿⌐¬½¼¡«»░▒▓│┤╡╢╖╕╣║╗╝╜╛

The Character ñ is in Latin_1 is TRUE. Its position is  164
The Character Ñ is in Latin_1 is TRUE. Its position is  165
The Character ª is in Latin_1 is TRUE. Its position is  166
The Character º is in Latin_1 is TRUE. Its position is  167
The Character ¿ is in Latin_1 is TRUE. Its position is  168
The Character ⌐ is in Latin_1 is TRUE. Its position is  169
The Character ¬ is in Latin_1 is TRUE. Its position is  170
The Character ½ is in Latin_1 is TRUE. Its position is  171
The Character ¼ is in Latin_1 is TRUE. Its position is  172
The Character ¡ is in Latin_1 is TRUE. Its position is  173
The Character « is in Latin_1 is TRUE. Its position is  174
The Character » is in Latin_1 is TRUE. Its position is  175
The Character ░ is in Latin_1 is TRUE. Its position is  176
The Character ▒ is in Latin_1 is TRUE. Its position is  177
The Character ▓ is in Latin_1 is TRUE. Its position is  178
The Character │ is in Latin_1 is TRUE. Its position is  179
The Character ┤ is in Latin_1 is TRUE. Its position is  180
The Character ╡ is in Latin_1 is TRUE. Its position is  181
The Character ╢ is in Latin_1 is TRUE. Its position is  182
The Character ╖ is in Latin_1 is TRUE. Its position is  183
The Character ╕ is in Latin_1 is TRUE. Its position is  184
The Character ╣ is in Latin_1 is TRUE. Its position is  185
The Character ║ is in Latin_1 is TRUE. Its position is  186
The Character ╗ is in Latin_1 is TRUE. Its position is  187
The Character ╝ is in Latin_1 is TRUE. Its position is  188
The Character ╜ is in Latin_1 is TRUE. Its position is  189
The Character ╛ is in Latin_1 is TRUE. Its position is  190
------------------------Ending Test-----------------------
--Robert C. Leif, Ph.D & Ada_Med Copyright all rights reserved.
--Main Procedure 
--Created 27 November 2002
with Ada.Text_Io;
with Ada.Io_Exceptions;
with Ada.Exceptions;
with Ada.Strings;
with Ada.Strings.Maps;
with  Ada.Characters.Latin_1;
with  Ada.Characters.Latin_9;
procedure Char_Sets_Test is 
   ------------------Table of Contents------------- 
   package T_Io renames Ada.Text_Io;
   package Str_Maps renames Ada.Strings.Maps;
   package Latin_1 renames Ada.Characters.Latin_1;
   package Latin_9 renames Ada.Characters.Latin_9;
   subtype Character_Set_Type is Str_Maps.Character_Set;
   subtype Character_Sequence_Type is Str_Maps.Character_Sequence;

   -----------------End Table of Contents-------------
   Latin_1_Range    : constant Str_Maps.Character_Range
      := (Low => Latin_1.Nul, High => Latin_1.Lc_Y_Diaeresis);  
   Latin_1_Char_Set :          Character_Set_Type      
      := Str_Maps.To_Set (Span => Latin_1_Range);  
   --Standard for Ada '95
   -- Latin_9 Differences: Euro_Sign, Uc_S_Caron, Lc_S_Caron, Uc_Z_Caron, 
   -- Lc_Z_Caron, Uc_Ligature_Oe, Lc_Ligature_Oe, Uc_Y_Diaeresis.
   Latin_9_Diff_Latin_1_Super_Range  : constant Str_Maps.Character_Range
      := (Low => Latin_9.Euro_Sign, High => Latin_9.Uc_Y_Diaeresis);  
   Latin_9_Diff_Latin_1_Super_Set    :          Character_Set_Type      
      := Str_Maps.To_Set (Span => Latin_9_Diff_Latin_1_Super_Range);  
   Latin_9_Diff_Latin_1_Super_String :          Character_Sequence_Type 
      := Str_Maps.To_Sequence (Latin_9_Diff_Latin_1_Super_Set);  
   Character_Set_Name                :          String                 
      := "Latin_1";  
   ---------------------------------------------   
   procedure Test_Character_Sets (
         Character_Sequence_Var : in     Character_Sequence_Type; 
         Set                    : in     Character_Set_Type       ) is 
      Is_In_Character_Set : Boolean   := False;  
      Char                : Character := 'X';  
      Character_Set_Position : Positive := 164; -- Euro_Sign   
   begin--Test_Character_Sets
      T_Io.Put_Line("Latin_9_Diff is " & Latin_9_Diff_Latin_1_Super_String);
      T_Io.Put_Line("");
      Test_Chars:
         for I in Character_Sequence_Var'range loop
         Char:= Character_Sequence_Var(I);
         Is_In_Character_Set:= Str_Maps.Is_In(
            Element => Char,            
            Set     => Latin_1_Char_Set);
         T_Io.Put_Line("The Character " & Char & " is in " & Character_Set_Name
            &  " is " & Boolean'Image (
               Is_In_Character_Set) & ". Its position is "
                  & Positive'Image(Character_Set_Position));
         Character_Set_Position:= Character_Set_Position + 1;
      end loop Test_Chars;
   end Test_Character_Sets;
   ---------------------------------------------     
begin--Bd_W_Char_Sets_Test
   T_Io.Put_Line("----------------------Starting Test---------------------);
   Test_Character_Sets (
      Character_Sequence_Var => Latin_9_Diff_Latin_1_Super_String, 
      Set                    => Latin_1_Char_Set);
   ---------------------------------------------
   T_Io.Put_Line("------------------------Ending Test---------------------);

exception
   when A: Ada.Io_Exceptions.Status_Error =>
      T_Io.Put_Line("Status_Error in Char_Sets_Test.");
      T_Io.Put_Line(Ada.Exceptions.Exception_Information(A));
   when O: others =>
      T_Io.Put_Line("Others_Error in Char_Sets_Test.");
      T_Io.Put_Line(Ada.Exceptions.Exception_Information(O));

end Char_Sets_Test;




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Character Sets
  2002-11-28 17:53 Robert C. Leif
@ 2002-11-29 12:28 ` Georg Bauhaus
  2002-12-02 18:28 ` Stephen Leake
  1 sibling, 0 replies; 13+ messages in thread
From: Georg Bauhaus @ 2002-11-29 12:28 UTC (permalink / raw)


Robert C. Leif <rleif@rleif.com> wrote:
!fmt -w72
: I suspect that the best solution would be to introduce UniCode,
ISO/IEC 10646, into the Ada standard. The arguments for this are
contained in W3C Character Model for the World Wide Web 1.0, W3C
Working Draft 30 April 2002

Yes, and with Wide_String you can have the Basic Multilingual Plain,
as per ISO 10646.  There is at least one compiler with support for
different wide character endocings.

-- georg



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Character Sets
  2002-11-28 17:53 Robert C. Leif
  2002-11-29 12:28 ` Georg Bauhaus
@ 2002-12-02 18:28 ` Stephen Leake
  2002-12-03  2:45   ` Robert C. Leif
  1 sibling, 1 reply; 13+ messages in thread
From: Stephen Leake @ 2002-12-02 18:28 UTC (permalink / raw)

"Robert C. Leif" <rleif@rleif.com> writes:

> Christoph Grein responded to my inquiry by stating that, "
> Latin_9.Euro_Sign is a name for a character. The same character in
> Latin_1 has a different name, it is the Currency_Sign." "So why do
> you expect this character not to be in the set only because you use
> a different name for it?" The Euro_Sign and the Currency_Sign have a
> different representation according to The ISO 8859 Alphabet Soup
> http://czyborra.com/charsets/iso8859.html
> ------------------------------------------------ GNAT Latin_9
> (ISO-8859-15)includes the following: -- Summary of Changes from
> Latin-1 => Latin-9 --
> ------------------------------------------------
> 
>    --   164     Currency                => Euro_Sign
>    --   166     Broken_Bar              => UC_S_Caron
>    --   168     Diaeresis               => LC_S_Caron
>    --   180     Acute                   => UC_Z_Caron
>    --   184     Cedilla                 => LC_Z_Caron
>    --   188     Fraction_One_Quarter    => UC_Ligature_OE
>    --   189     Fraction_One_Half       => LC_Ligature_OE
>    --   190     Fraction_Three_Quarters => UC_Y_Diaeresis

Hmm. This says to me:

"In the Latin-1 character set, the character with internal value 164
is called 'Currency'. In the Latin-9 character set, the character with
internal value 164 is called 'Euro_Sign'".

Presumably, elsewhere in the Latin-1 and Latin-9 standards, they
specify the "glyph" used to display those characters on a screen or
paper, and the glyph for character 164 is different between Latin-1
and Latin-9.

> Since these are changes, they should not be the same character.

By "same character", we (and Ada) mean "same internal value", ie
"164". However, I suspect you mean "same glyph", in which case they
are not the "same character"; they do not have the same glyph.

> Below are the results of an extension of my original program that
> now tests the characters of Latin_9 from character number 164
> through 190 and prints them out. 

What results would you like from this program?

> I understand that choice of the Windows font will change their
> representation.

Yes, because the choice of font determines the glyph.

> anyone interested, I have put my program at the end of this note. I
> suspect that the best solution would be to introduce UniCode,

I'm not clear what the "problem" is, so I can't tell if this is a
"solution". 

-- 
-- Stephe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: Character Sets
  2002-12-02 18:28 ` Stephen Leake
@ 2002-12-03  2:45   ` Robert C. Leif
  2002-12-03 13:33     ` Robert A Duff
  0 siblings, 1 reply; 13+ messages in thread
From: Robert C. Leif @ 2002-12-03  2:45 UTC (permalink / raw)

Since XML documents specify a character set, it would be useful to have
the equivalent in Ada. The common meaning of character refers to what
one sees on a screen or paper. If one only considers position, then
Latin-1 and Latin-9 are identical. 
I might note that I do not see how with Ada 95 one could directly create
a bounded string or unbounded string of wide characters?
Bob Leif

-----Original Message-----
From: comp.lang.ada-admin@ada.eu.org
[mailto:comp.lang.ada-admin@ada.eu.org] On Behalf Of Stephen Leake
Sent: Monday, December 02, 2002 10:29 AM
To: comp.lang.ada@ada.eu.org
Subject: Re: Character Sets

"Robert C. Leif" <rleif@rleif.com> writes:

> Christoph Grein responded to my inquiry by stating that, "
> Latin_9.Euro_Sign is a name for a character. The same character in
> Latin_1 has a different name, it is the Currency_Sign." "So why do
> you expect this character not to be in the set only because you use
> a different name for it?" The Euro_Sign and the Currency_Sign have a
> different representation according to The ISO 8859 Alphabet Soup
> http://czyborra.com/charsets/iso8859.html
> ------------------------------------------------ GNAT Latin_9
> (ISO-8859-15)includes the following: -- Summary of Changes from
> Latin-1 => Latin-9 --
> ------------------------------------------------
> 
>    --   164     Currency                => Euro_Sign
>    --   166     Broken_Bar              => UC_S_Caron
>    --   168     Diaeresis               => LC_S_Caron
>    --   180     Acute                   => UC_Z_Caron
>    --   184     Cedilla                 => LC_Z_Caron
>    --   188     Fraction_One_Quarter    => UC_Ligature_OE
>    --   189     Fraction_One_Half       => LC_Ligature_OE
>    --   190     Fraction_Three_Quarters => UC_Y_Diaeresis

Hmm. This says to me:

"In the Latin-1 character set, the character with internal value 164
is called 'Currency'. In the Latin-9 character set, the character with
internal value 164 is called 'Euro_Sign'".

Presumably, elsewhere in the Latin-1 and Latin-9 standards, they
specify the "glyph" used to display those characters on a screen or
paper, and the glyph for character 164 is different between Latin-1
and Latin-9.

> Since these are changes, they should not be the same character.

By "same character", we (and Ada) mean "same internal value", ie
"164". However, I suspect you mean "same glyph", in which case they
are not the "same character"; they do not have the same glyph.

> Below are the results of an extension of my original program that
> now tests the characters of Latin_9 from character number 164
> through 190 and prints them out. 

What results would you like from this program?

> I understand that choice of the Windows font will change their
> representation.

Yes, because the choice of font determines the glyph.

> anyone interested, I have put my program at the end of this note. I
> suspect that the best solution would be to introduce UniCode,

I'm not clear what the "problem" is, so I can't tell if this is a
"solution". 

-- 
-- Stephe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Character Sets
  2002-12-03  2:45   ` Robert C. Leif
@ 2002-12-03 13:33     ` Robert A Duff
  2002-12-03 15:32       ` Juanma Barranquero
  2002-12-04  0:49       ` Robert C. Leif
  0 siblings, 2 replies; 13+ messages in thread
From: Robert A Duff @ 2002-12-03 13:33 UTC (permalink / raw)

"Robert C. Leif" <rleif@rleif.com> writes:

> I might note that I do not see how with Ada 95 one could directly create
> a bounded string or unbounded string of wide characters?

Umm, you could use the Strings.Wide_Bounded and Strings.Wide_Unbounded
packages.  ;-)

These are documented in RM-A.4.7.

There is also an AI in the works, having something to do with 32-bit
characters.  I don't remember the AI number.

- Bob

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Character Sets
  2002-12-03 13:33     ` Robert A Duff
@ 2002-12-03 15:32       ` Juanma Barranquero
  2002-12-04  0:49       ` Robert C. Leif
  1 sibling, 0 replies; 13+ messages in thread
From: Juanma Barranquero @ 2002-12-03 15:32 UTC (permalink / raw)


On Tue, 3 Dec 2002 13:33:24 GMT, Robert A Duff
<bobduff@shell01.TheWorld.com> wrote:

>There is also an AI in the works, having something to do with 32-bit
>characters.  I don't remember the AI number.

AI-00285, perhaps:

!subject Latin-9, Ada.Characters.Handling, and 32-bit characters


                                                      /L/e/k/t/u




^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: Character Sets
  2002-12-03 13:33     ` Robert A Duff
  2002-12-03 15:32       ` Juanma Barranquero
@ 2002-12-04  0:49       ` Robert C. Leif
  2002-12-14  3:27         ` David Starner
  1 sibling, 1 reply; 13+ messages in thread
From: Robert C. Leif @ 2002-12-04  0:49 UTC (permalink / raw)


Many thanks,
Bob Leif

-----Original Message-----
From: comp.lang.ada-admin@ada.eu.org
[mailto:comp.lang.ada-admin@ada.eu.org] On Behalf Of Robert A Duff
Sent: Tuesday, December 03, 2002 5:33 AM
To: comp.lang.ada@ada.eu.org
Subject: Re: Character Sets

"Robert C. Leif" <rleif@rleif.com> writes:

> I might note that I do not see how with Ada 95 one could directly
create
> a bounded string or unbounded string of wide characters?

Umm, you could use the Strings.Wide_Bounded and Strings.Wide_Unbounded
packages.  ;-)

These are documented in RM-A.4.7.

There is also an AI in the works, having something to do with 32-bit
characters.  I don't remember the AI number.

- Bob




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Character Sets
  2002-12-04  0:49       ` Robert C. Leif
@ 2002-12-14  3:27         ` David Starner
  2002-12-14 22:53           ` Vadim Godunko
  0 siblings, 1 reply; 13+ messages in thread
From: David Starner @ 2002-12-14  3:27 UTC (permalink / raw)


> There is also an AI in the works, having something to do with 32-bit
> characters.  I don't remember the AI number.

In response to AI-00285:

Why is Latin-9's introduction such a big deal? Latin-1 is still the
"standard" 8-bit character set, and so immortalized in HTML and
other places. Latin-9 is just another character set, no more 
important then any other 8-bit set. Sure, people in Western
Europe are using it; but I bet more people still use Latin-1 
then Latin-9, and more people probably use KOI8-R than Latin-9.
There are many character sets out there; adding support for just
one more doesn't help things. Especially as anyone writing for
international systems needs at the very least to set the character
set on startup rather than compile.


From: Pascal Leroy
> I still think
> that we want to retain the capacity of using 16-bit blobs to represent
> characters in the BMP, as 99.5% of practical applications will only need the
> BMP.

I sort of feel like this is saying that 99.5% of practical
applications will never need a "q". For any program that handles text,
there shouldn't be arbitrary restrictions on what comes in and out; a
program that handles Unicode should handle Unicode, instead of the
subset the programmer thought people would use. That's half the use of
Unicode; being able to use Latin letter Kra, and knowing that you
aren't limited to the systems that handle ISO-6937, or Ogham and
NSAI-434.

> Anyway, I don't think it is reasonable to force applications to go to the
> full 32-bit overhead just because they use, say, the french OE ligature.

Applications don't use the French OE ligature; users do. And
arbitrarily limiting users does not make your system a pleasure to
use.

In any case, how much overhead are we talking? In worst case
scenarios, we're talking a doubling of the memory the program uses.
But embedded systems are rarely heavy text users, and can probably
stay with Latin-1. I don't work with text files much larger than a
megabyte, and don't know of anyone who does. And if you're working
with large amounts of data and need to reduce size, compression - both
standard (e.g. LZW) and Unicode-specific (e.g. SCSU or BOCU-1) work
better than just using 16 bits.

> We certainly don't want to get into that business.  The designers of Ada 95
> wisely decided to lump all of the characters in the range 16#0100# ..
> 16#FFFD# into the category special_character, so that they don't have to
> decide which is a letter, a number, etc.  Similarly they didn't provide
> classification functions or upper/lower conversions for wide characters.

So it's left for a dozen implementations to do.

> This seems reasonable if we don't want to have to amend Ada each time a
> bunch of characters are added to 10646.

Why would you have to amend Ada? Add a Unicode version constant, and
define the data in terms of its Unicode properties. Then the
recentness of the characters is just a quality of implementation
issue.

From: Robert Dewar
> We certainly
> put in a lot of work in GNAT in implementing wide character with many
> different representation schemes,

GNAT supports input files in a dozen mostly bizzare or archaic
formats. It doesn't strike me as very useful, especially considering
as it supports Latin-1, Latin-2 (both useful), but also Latin-4
(completely unused) and Latin-3 (good for Maltese and Esperanto, and
most Esperanto users don't use it). It doesn't support ISO-8859-5 or
KOI8-R (Russian), or ISO-8859-7 (Greek). It doesn't support changing
formats on the fly - many users have multiple encodings around,
besides the fact that having to compile a different binary for each
user is a pain. Oh, and last time I submitted a bug on it, it got
ignored, until I brought it up on the gcc list, when it was pointed
out that the feature I was using (style checking on source files)
wasn't supported with UTF-8.

From: Pascal Leroy
> Remember, we are talking Ada applications here.  There are probably many
> applications out there that deal with mathematical symbols or with Tengwar, 
> but I doubt that they are written in Ada.

Mathematical symbols and Tengwar are text. Any text handling system
that supports Unicode should handle them like any other text, because
sooner or later users will expect it to handle them. (If you're
unlucky, it will be the day that you're showing your system off in
Hong Kong, and the potential buyer decides to put in his name that
isn't in the BMP.) If people don't want Ada to be a general-purpose
programming language, then that's fine; but it's not acceptable for a
general-purpose programming language not to be able to handle text,
and for a modern language, that means Unicode.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Character Sets
  2002-12-14  3:27         ` David Starner
@ 2002-12-14 22:53           ` Vadim Godunko
  2002-12-15  3:46             ` David Starner
  2002-12-15 23:26             ` Robert C. Leif
  0 siblings, 2 replies; 13+ messages in thread
From: Vadim Godunko @ 2002-12-14 22:53 UTC (permalink / raw)


starner@okstate.edu (David Starner) wrote in message news:<81f70ac6.0212131927.4fa6b642@posting.google.com>...
> 
> > This seems reasonable if we don't want to have to amend Ada each time a
> > bunch of characters are added to 10646.
> 
> Why would you have to amend Ada? Add a Unicode version constant, and
> define the data in terms of its Unicode properties. Then the
> recentness of the characters is just a quality of implementation
> issue.
> 
How many memory required for save all data from Unicode Character
Database? What you do if this constant changed? Retest all existing
applications?

> From: Robert Dewar
> > We certainly
> > put in a lot of work in GNAT in implementing wide character with many
> > different representation schemes,
> 
> GNAT supports input files in a dozen mostly bizzare or archaic
> formats. It doesn't strike me as very useful, especially considering
> as it supports Latin-1, Latin-2 (both useful), but also Latin-4
> (completely unused) and Latin-3 (good for Maltese and Esperanto, and
> most Esperanto users don't use it). It doesn't support ISO-8859-5 or
> KOI8-R (Russian), or ISO-8859-7 (Greek).
Latest public GNAT version and GCC3/GNAT both support ISO-8859-5
encoding in identifiers. And don't known any GNAT users who use
KOI8-R/U/B encodings outside comment, character and string literals.

> It doesn't support changing
> formats on the fly - many users have multiple encodings around,
> besides the fact that having to compile a different binary for each
> user is a pain. 
> 
You may propose any method for detect encoding of Ada source file "on
the fly"?

> From: Pascal Leroy
> > Remember, we are talking Ada applications here.  There are probably many
> > applications out there that deal with mathematical symbols or with Tengwar, 
> > but I doubt that they are written in Ada.
> 
> Mathematical symbols and Tengwar are text. Any text handling system
> that supports Unicode should handle them like any other text, because
> sooner or later users will expect it to handle them. (If you're
> unlucky, it will be the day that you're showing your system off in
> Hong Kong, and the potential buyer decides to put in his name that
> isn't in the BMP.) If people don't want Ada to be a general-purpose
> programming language, then that's fine; but it's not acceptable for a
> general-purpose programming language not to be able to handle text,
> and for a modern language, that means Unicode.

The main problem with encodings in Ada is a history. 

Many programs assume what Character is Latin-1. If we change semantic
of Ada.Characters.Handling, that results we get?

Ada83 define type Character as enumeration. The order of symbols
defined by its order in this enumeration not by real code. This allow
simple programs portation from, for example, ASCII to EBCDIC
encodings. Ada95 simple extend 7-bit ASCII to 8-bit ISO-8859-1.

The difference between logical code order in encoding and collation
order of current user language environment is another problem. Both
Ada9X and AI-00285 not solve this.

The best way for implement localization/internationalization support
in Ada is define special needs annex, but not change existing
interfaces because (1) this not affect to portability and (2) allow
new applications (if internationalization is critic) use new
interfaces.


Vadim Godunko



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Character Sets
  2002-12-14 22:53           ` Vadim Godunko
@ 2002-12-15  3:46             ` David Starner
  2002-12-15 23:26             ` Robert C. Leif
  1 sibling, 0 replies; 13+ messages in thread
From: David Starner @ 2002-12-15  3:46 UTC (permalink / raw)


vgodunko@vipmail.ru (Vadim Godunko) wrote in message news:<665e587a.0212141453.42386f5d@posting.google.com>...
>
> How many memory required for save all data from Unicode Character
> Database? 

After stripping the converters, ICU takes up 3 MB.
<http://oss.software.ibm.com/icu/userguide/icudata.html> But that
includes a lot of locale data, and could probably be compressed more
with work.
There's no reason it would need to be paged into memory;

> What you do if this constant changed? Retest all existing
> applications?

If the constant changed, then your version of the compiler changed,
and it's certainly possible that it broke your program, constant or
not. Given a stable API, a program should not break from a change in
the Unicode data, especially as they try not to make major changes to
the data between versions.
 
> Latest public GNAT version and GCC3/GNAT both support ISO-8859-5
> encoding in identifiers. 

Which may explain why people weren't using it in earlier versions. 

> And don't known any GNAT users who use
> KOI8-R/U/B encodings outside comment, character and string literals.

The problem is, source encoding is tied into the encoding that I/O
uses.

> The best way for implement localization/internationalization support
> in Ada is define special needs annex, 

The non-BMP Unicode is not l10n/i18n - it's basic text handling just
like the rest of Unicode. As for the character data and encodings -
sure, whatever. Just so long as it's supported in some way.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: Character Sets
  2002-12-14 22:53           ` Vadim Godunko
  2002-12-15  3:46             ` David Starner
@ 2002-12-15 23:26             ` Robert C. Leif
  1 sibling, 0 replies; 13+ messages in thread
From: Robert C. Leif @ 2002-12-15 23:26 UTC (permalink / raw)


I believe that we need to change to Latin_9. The European Economic
Community needs to have a Euro character. In the long-run, an XML_Io or
Unicode_Io package will have to be created. However it should be an
Applications Program Interface, rather than being part of the core
language or an annex.
Bob Leif

-----Original Message-----
From: comp.lang.ada-admin@ada.eu.org
[mailto:comp.lang.ada-admin@ada.eu.org] On Behalf Of Vadim Godunko
Sent: Saturday, December 14, 2002 2:54 PM
To: comp.lang.ada@ada.eu.org
Subject: Re: Character Sets

starner@okstate.edu (David Starner) wrote in message
news:<81f70ac6.0212131927.4fa6b642@posting.google.com>...
> 
> > This seems reasonable if we don't want to have to amend Ada each
time a
> > bunch of characters are added to 10646.
> 
> Why would you have to amend Ada? Add a Unicode version constant, and
> define the data in terms of its Unicode properties. Then the
> recentness of the characters is just a quality of implementation
> issue.
> 
How many memory required for save all data from Unicode Character
Database? What you do if this constant changed? Retest all existing
applications?

> From: Robert Dewar
> > We certainly
> > put in a lot of work in GNAT in implementing wide character with
many
> > different representation schemes,
> 
> GNAT supports input files in a dozen mostly bizzare or archaic
> formats. It doesn't strike me as very useful, especially considering
> as it supports Latin-1, Latin-2 (both useful), but also Latin-4
> (completely unused) and Latin-3 (good for Maltese and Esperanto, and
> most Esperanto users don't use it). It doesn't support ISO-8859-5 or
> KOI8-R (Russian), or ISO-8859-7 (Greek).
Latest public GNAT version and GCC3/GNAT both support ISO-8859-5
encoding in identifiers. And don't known any GNAT users who use
KOI8-R/U/B encodings outside comment, character and string literals.

> It doesn't support changing
> formats on the fly - many users have multiple encodings around,
> besides the fact that having to compile a different binary for each
> user is a pain. 
> 
You may propose any method for detect encoding of Ada source file "on
the fly"?

> From: Pascal Leroy
> > Remember, we are talking Ada applications here.  There are probably
many
> > applications out there that deal with mathematical symbols or with
Tengwar, 
> > but I doubt that they are written in Ada.
> 
> Mathematical symbols and Tengwar are text. Any text handling system
> that supports Unicode should handle them like any other text, because
> sooner or later users will expect it to handle them. (If you're
> unlucky, it will be the day that you're showing your system off in
> Hong Kong, and the potential buyer decides to put in his name that
> isn't in the BMP.) If people don't want Ada to be a general-purpose
> programming language, then that's fine; but it's not acceptable for a
> general-purpose programming language not to be able to handle text,
> and for a modern language, that means Unicode.

The main problem with encodings in Ada is a history. 

Many programs assume what Character is Latin-1. If we change semantic
of Ada.Characters.Handling, that results we get?

Ada83 define type Character as enumeration. The order of symbols
defined by its order in this enumeration not by real code. This allow
simple programs portation from, for example, ASCII to EBCDIC
encodings. Ada95 simple extend 7-bit ASCII to 8-bit ISO-8859-1.

The difference between logical code order in encoding and collation
order of current user language environment is another problem. Both
Ada9X and AI-00285 not solve this.

The best way for implement localization/internationalization support
in Ada is define special needs annex, but not change existing
interfaces because (1) this not affect to portability and (2) allow
new applications (if internationalization is critic) use new
interfaces.


Vadim Godunko




^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2002-12-15 23:26 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-11-26 21:41 Character Sets Robert C. Leif
  -- strict thread matches above, loose matches on Subject: below --
2002-11-27  9:00 Grein, Christoph
2002-11-28 17:53 Robert C. Leif
2002-11-29 12:28 ` Georg Bauhaus
2002-12-02 18:28 ` Stephen Leake
2002-12-03  2:45   ` Robert C. Leif
2002-12-03 13:33     ` Robert A Duff
2002-12-03 15:32       ` Juanma Barranquero
2002-12-04  0:49       ` Robert C. Leif
2002-12-14  3:27         ` David Starner
2002-12-14 22:53           ` Vadim Godunko
2002-12-15  3:46             ` David Starner
2002-12-15 23:26             ` Robert C. Leif

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox