Ada.Strings.UTF_Encoding - search results

comp.lang.ada
 help / color / mirror / Atom feed

Search results ordered by [date|relevance]  view[summary|nested|Atom feed]
thread overview below | download mbox.gz: |

* Re: Ada and Unicode
  2022-04-03 16:51  8%   ` Thomas
@ 2023-04-04  0:02 14%     ` Thomas
  0 siblings, 0 replies; 44+ results
From: Thomas @ 2023-04-04  0:02 UTC (permalink / raw)


In article 
<fantome.forums.tDeContes-079FD6.18515603042022@news.free.fr>,
 Thomas <fantome.forums.tDeContes@free.fr.invalid> wrote:

> In article <f9d91cb0-c9bb-4d42-a1a9-0cd546da436cn@googlegroups.com>,
>  Vadim Godunko <vgodunko@gmail.com> wrote:
> 
> > On Sunday, April 18, 2021 at 1:03:14 AM UTC+3, DrPi wrote:
> 
> > > What's the way to manage Unicode correctly ? 


> > Ada doesn't have good Unicode support. :( So, you need to find suitable set 
> > of "workarounds".
> > 
> > There are few different aspects of Unicode support need to be considered:
> > 
> > 1. Representation of string literals. If you want to use non-ASCII 
> > characters 
> > in source code, you need to use -gnatW8 switch and it will require use of 
> > Wide_Wide_String everywhere.
> > 2. Internal representation during application execution. You are forced to 
> > use Wide_Wide_String at previous step, so it will be UCS4/UTF32.
> 
> > It is hard to say that it is reasonable set of features for modern world.
> 
> I don't think Ada would be lacking that much, for having good UTF-8 
> support.
> 
> the cardinal point is to be able to fill a 
> Ada.Strings.UTF_Encoding.UTF_8_String with a litteral.
> (once you got it, when you'll try to fill a Standard.String with a 
> non-Latin-1 character, it'll make an error, i think it's fine :-) )
> 
> does Ada 202x allow it ?


hi !

I think I found a quite nice solution!
(reading <t3lj44$fh5$1@dont-email.me> again)
(not tested yet)


it's not perfect as in the rules of the art,
but it is:

- Ada 2012 compatible
- better than writing UTF-8 Ada code and then telling gnat it is Latin-1
  (in this way it would take UTF_8_String for what it is:
  an array of octets, but it would not detect an invalid UTF-8 string,
  and if someone tells it's really UTF-8 all goes wrong)
- better than being limited to ASCII in string literals
- never need to explicitely declare Wide_Wide_String:
  it's always implicit, for very short time,
  and AFAIK eligible for optimization



package UTF_Encoding is

   subtype UTF_8_String is Ada.Strings.UTF_Encoding.UTF_8_String;

   function "+" (A : in Wide_Wide_String) return UTF_8_String
   renames Ada.Strings.UTF_Encoding.Wide_Wide_Strings.Encode;

end UTF_Encoding;


then we can do:


package User is

   use UTF_Encoding;

   My_String : UTF_8_String := + "Greek characters + smileys";

end User;


if you want to avoid "use UTF_Encoding;",
i think "use type UTF_Encoding.UTF_8_String;" doesn't work,
but this should work:


package UTF_Encoding is

   subtype UTF_8_String is Ada.Strings.UTF_Encoding.UTF_8_String;

   type Literals_For_UTF_8_String is new Wide_Wide_String;

   function "+" (A : in Literals_For_UTF_8_String) return UTF_8_String
   renames Ada.Strings.UTF_Encoding.Wide_Wide_Strings.Encode;

end UTF_Encoding;


package User is

   use type UTF_Encoding.Literals_For_UTF_8_String;

   My_String : UTF_Encoding.UTF_8_String
               := + "Greek characters + smileys";

end User;



what do you think about that ? good idea or not ? :-)

-- 
RAPID maintainer
http://savannah.nongnu.org/projects/rapid/

^ permalink raw reply	[relevance 14%]

* Re: Ada and Unicode
  @ 2022-04-03 19:20 12%     ` Thomas
  0 siblings, 0 replies; 44+ results
From: Thomas @ 2022-04-03 19:20 UTC (permalink / raw)

In article <lyfszm5xv2.fsf@pushface.org>,
 Simon Wright <simon@pushface.org> wrote:

> But don't use unit names containing international characters, at any
> rate if you're (interested in compiling on) Windows or macOS:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114

if i understand,  Eric Botcazou is a gnu admin who decided to reject your bug?
i find him very "low portability thinking"!

it is the responsability of compilers and other underlying tools, to manage various underlying OS and FS,
not of the user to avoid those that the compiler devs find too bad!
(or to use the right encoding. i heard that Windows uses UTF-16, do you know about it?)

clearly, To_Lower takes Latin-1.
and this kind of problems would be easier to avoid if string types were stronger ...

after:

package Ada.Strings.UTF_Encoding
    ...
    type UTF_8_String is new String;
    ...
end Ada.Strings.UTF_Encoding;

i would have also made:

package Ada.Directories
    ...
    type File_Name_String is new Ada.Strings.UTF_Encoding.UTF_8_String;
    ...
end Ada.Directories;

with probably a validity check and a Dynamic_Predicate which allows "".

then, i would use File_Name_String in all Ada.Directories and Ada.*_IO.

-- 
RAPID maintainer
http://savannah.nongnu.org/projects/rapid/

^ permalink raw reply	[relevance 12%]

* Re: Ada and Unicode
  @ 2022-04-03 18:37 10%           ` Thomas
  0 siblings, 0 replies; 44+ results
From: Thomas @ 2022-04-03 18:37 UTC (permalink / raw)

In article <s5n8nj$cec$1@franka.jacob-sparre.dk>,
 "Randy Brukardt" <randy@rrsoftware.com> wrote:

> "Luke A. Guest" <laguest@archeia.com> wrote in message 
> news:s5jute$1s08$1@gioia.aioe.org...
> >
> >
> > On 19/04/2021 13:52, Dmitry A. Kazakov wrote:
> >
> > > It is practical solution. Ada type system cannot express differently
> > represented/constrained string/array/vector subtypes. Ignoring Latin-1 and 
> > using String as if it were an array of octets is the best available 
> > solution.
> > >
> >
> > They're different types and should be incompatible, because, well, they 
> > are. What does Ada have that allows for this that other languages doesn't? 
> > Oh yeah! Types!
> 
> If they're incompatible, you need an automatic way to convert between 
> representations, since these are all views of the same thing (an abstract 
> string type). You really don't want 35 versions of Open each taking a 
> different string type.

i need not 35 versions of Open.
i need a version of Open with an Unicode string type (not Latin-1 - 
preferably UTF-8), which will use Ada.Strings.UTF_Encoding.Conversions 
as far as needed, regarding the underlying API.

> 
> It's the fact that Ada can't do this that makes Unbounded_Strings unusable 
> (well, barely usable).

knowing Ada, i find it acceptable.
i don't say the same about Ada.Strings.UTF_Encoding.UTF_8_String.

> Ada 202x fixes the literal problem at least, but we'd 
> have to completely abandon Unbounded_Strings and use a different library 
> design in order for for it to allow literals. And if you're going to do 
> that, you might as well do something about UTF-8 as well -- but now you're 
> going to need even more conversions. Yuck.

as i said to Vadim Godunko, i need to fill a string type with an UTF-8 
litteral.
but i don't think this string type has to manage various conversions.

from my point of view, each library has to accept 1 kind of string type 
(preferably UTF-8 everywhere),
and then, this library has to make needed conversions regarding the 
underlying API. not the user.

> 
> I think the only true solution here would be based on a proper abstract 
> Root_String type. But that wouldn't work in Ada, since it would be 
> incompatible with all of the existing code out there. Probably would have to 
> wait for a follow-on language.

of course, it would be very nice to have a more thicker language with a 
garbage collector, only 1 String type which allows all what we need, etc.

-- 
RAPID maintainer
http://savannah.nongnu.org/projects/rapid/

^ permalink raw reply	[relevance 10%]

* Re: Ada and Unicode
  @ 2022-04-03 18:04  8%           ` Thomas
  0 siblings, 0 replies; 44+ results
From: Thomas @ 2022-04-03 18:04 UTC (permalink / raw)

In article <s5k0ai$bb5$1@dont-email.me>, "J-P. Rosen" <rosen@adalog.fr> 
wrote:

> Le 19/04/2021 à 15:00, Luke A. Guest a écrit :
> > They're different types and should be incompatible, because, well, they 
> > are. What does Ada have that allows for this that other languages 
> > doesn't? Oh yeah! Types!
> 
> They are not so different. For example, you may read the first line of a 
> file in a string, then discover that it starts with a BOM, and thus 
> decide it is UTF-8.

could you give me an example of sth that you can do yet, and you could 
not do if UTF_8_String was private, please?
(to discover that it starts with a BOM, you must look at it.)

> 
> BTW, the very first version of this AI had different types, but the ARG 
> felt that it would just complicate the interface for the sake of abusive 
> "purity".

could you explain "abusive purity" please?

i guess it is because of ASCII.
i guess a lot of developpers use only ASCII in a lot of situation, and 
they would find annoying to need Ada.Strings.UTF_Encoding.Strings every 
time.

but I think a simple explicit conversion is acceptable, for a not fully 
compatible type which requires some attention.

the best would be to be required to use ASCII_String as intermediate, 
but i don't know how it could be designed at language level:

UTF_8_Var := UTF_8_String (ASCII_String (Latin_1_Var));
Latin_1_Var:= String (ASCII_String (UTF_8_Var));

and this would be forbidden :
UTF_8_Var := UTF_8_String (Latin_1_Var);

this would ensures to raise Constraint_Error when there are somme 
non-ASCII characters.

-- 
RAPID maintainer
http://savannah.nongnu.org/projects/rapid/

^ permalink raw reply	[relevance 8%]

* Re: Ada and Unicode
  @ 2022-04-03 16:51  8%   ` Thomas
  2023-04-04  0:02 14%     ` Thomas
  0 siblings, 1 reply; 44+ results
From: Thomas @ 2022-04-03 16:51 UTC (permalink / raw)

In article <f9d91cb0-c9bb-4d42-a1a9-0cd546da436cn@googlegroups.com>,
 Vadim Godunko <vgodunko@gmail.com> wrote:

> On Sunday, April 18, 2021 at 1:03:14 AM UTC+3, DrPi wrote:

> > What's the way to manage Unicode correctly ? 
> > 
> 
> Ada doesn't have good Unicode support. :( So, you need to find suitable set 
> of "workarounds".
> 
> There are few different aspects of Unicode support need to be considered:
> 
> 1. Representation of string literals. If you want to use non-ASCII characters 
> in source code, you need to use -gnatW8 switch and it will require use of 
> Wide_Wide_String everywhere.
> 2. Internal representation during application execution. You are forced to 
> use Wide_Wide_String at previous step, so it will be UCS4/UTF32.

> It is hard to say that it is reasonable set of features for modern world.

I don't think Ada would be lacking that much, for having good UTF-8 
support.

the cardinal point is to be able to fill a 
Ada.Strings.UTF_Encoding.UTF_8_String with a litteral.
(once you got it, when you'll try to fill a Standard.String with a 
non-Latin-1 character, it'll make an error, i think it's fine :-) )

does Ada 202x allow it ?

if not, it would probably be easier if it was
    type UTF_8_String is new String;
instead of
    subtype UTF_8_String is String;

for all subprograms it's quite easy:
we just have to duplicate them with the new type, and to mark the old 
one as Obsolescent.

but, now that "subtype UTF_8_String" exists, i don't know what we can do 
for types.
(is the only way to choose a new name?)

> To 
> fix some of drawbacks of current situation we are developing new text 
> processing library, know as VSS. 
> 
> https://github.com/AdaCore/VSS

(are you working at AdaCore ?)

-- 
RAPID maintainer
http://savannah.nongnu.org/projects/rapid/

^ permalink raw reply	[relevance 8%]

* [ANN] UXStrings package available (UXS_20220226).
@ 2022-03-01 20:47 10% Blady
  0 siblings, 0 replies; 44+ results
From: Blady @ 2022-03-01 20:47 UTC (permalink / raw)


Hello,

The objective of UXStrings is Unicode and dynamic length support for 
strings in Ada.

UXStrings API is inspired from Ada.Strings.Unbounded in order to 
minimize adaptation work from existing Ada source codes.

Changes from last publication:
- Ada.Strings.UTF_Encoding.Conversions fix is no longer needed with GNAT 
CE 2021
- A few fix

Available on GitHub (https://github.com/Blady-Com/UXStrings) and also on 
Alire (https://alire.ada.dev/crates/uxstrings.html).

Feedback is welcome on actual use cases.

Regards, Pascal.

^ permalink raw reply	[relevance 10%]

* Re: XMLAda & unicode symbols
  @ 2021-06-21 15:26  7%     ` Simon Wright
  0 siblings, 0 replies; 44+ results
From: Simon Wright @ 2021-06-21 15:26 UTC (permalink / raw)


"196...@googlemail.com" <1963bib@googlemail.com> writes:

> Asking for the degree sign, was probably a slight mistake. There is
> Degree_Celsius and also Degree_Fahrenheit for those who have not yet
> embraced metric. These are the "correct" symbols.

You might equally have meant angular degrees.

> Both of these exist in Unicode.Names.Letterlike_Symbols, and probably
> elsewhere,but trying to shoehorn these in seems impossible.

A scan through XML/Ada shows that the only uses of Unicode_Char are in
the SAX subset. I don't see any way in the DOM subset of XML/Ada of
using them - someone please prove me wrong!

You could build a Unicode_Char to UTF_8_String converter using
Ada.Strings.UTF_Encoding.Wide_Wide_Strings, ARM 4.11(30)
http://www.ada-auth.org/standards/rm12_w_tc1/html/RM-A-4-11.html#p30

> I just wish XMLAda could just accept whatever we throw at it, and if
> we need to convert it, then let us do so outside of it.

That is *exactly* what you have to do (convert outside, not throw any
old sequence of octets and 32-bit values somehow mashed together at
it). It wants a utf-8-encoded string (though XML/Ada doesn't seem to say
so - RFC 3076 implies it, 7303 (8.1) recommends it).

OK, Text_IO might not prove the point to you, but what about this?

   with Ada.Characters.Latin_1;
   with DOM.Core.Documents;
   with DOM.Core.Elements;
   with DOM.Core.Nodes;
   with DOM.Core;
   with Unicode.CES;
   with Unicode.Encodings;

   procedure Utf is
      Impl : DOM.Core.DOM_Implementation;
      Doc : DOM.Core.Document;
      Dummy, Element : DOM.Core.Node;
      Fifty_Degrees_Latin1 : constant String
        := "50" & Ada.Characters.Latin_1.Degree_Sign;
      Fifty_Degrees_UTF8 : constant Unicode.CES.Byte_Sequence
        := Unicode.Encodings.Convert
          (Fifty_Degrees_Latin1,
           From => Unicode.Encodings.Get_By_Name ("iso-8859-15"),
           To => Unicode.Encodings.Get_By_Name ("utf-8"));
   begin
      Doc := DOM.Core.Create_Document (Impl);

      Element := DOM.Core.Documents.Create_Element (Doc, "utf");
      DOM.Core.Elements.Set_Attribute (Element, "temp", Fifty_Degrees_UTF8);
      Dummy := DOM.Core.Nodes.Append_Child (Doc, Element);

      DOM.Core.Nodes.Print (Doc);
   end Utf;

^ permalink raw reply	[relevance 7%]

* Re: Unable to use "Find all references" with GPS CE 2020 !?
  2020-08-28 10:35  8%     ` Jérôme Haguet
@ 2021-06-18 11:02  0%       ` Jérôme Haguet
  0 siblings, 0 replies; 44+ results
From: Jérôme Haguet @ 2021-06-18 11:02 UTC (permalink / raw)


Le vendredi 28 août 2020 à 12:35:39 UTC+2, Jérôme Haguet a écrit :
> Le mercredi 26 août 2020 à 16:47:08 UTC+2, Jérôme Haguet a écrit : 
> > > > These days, I tried to use GPS from Gnat CE 2020, Windows x64. 
> > > > Unfortunatly, 'Find all references' does not seems to work. 
> > > > It happens on 2 different PCS, with any project I have tried, including the one used in the tutorial. 
> > > I have probably an identical setup: 2 PCs, Windows 10, x64, GNAT CE 2020. On both, 'Find all references' works perfectly, even on a relatively low-powered laptop (2 cores, 2 logical processors, 1.1 GHz). I'd say it works better than previously: in earlier versions, the reference finder gave up due to (perhaps) a time-out, or sometimes didn't want to work at all for the whole session. 
> > > Perhaps your problem is with the path? "C:\GNAT\2020\bin;" or equivalent should come first. It's important if you have multiple GNAT installations. 
> > Thanks Gauthier for the suggestion, I have simplified the PATH. But it did not make it work. 
> > I have made some additional tests with a new computer, and it works successfully. 
> > 
> > But I am still facing the problem with my own computer. 
> > I have found one difference using ProcessExplorer : subprocess ada_language_server.exe is not running after my .gpr project is opened. 
> > It can be successfully started from command line "C:\GNAT\2020\libexec\gnatstudio\als\ada_language_server.exe ...", but it is not started from GPS. 
> > 
> > Any idea where to check ? Any log option to activate ? 
> > 
> > Jérôme
> Seen in http://docs.adacore.com/live/wave/gps/html/gps_ug/environment.html#the-ada-language-server 
> "One known limitation of this server is that it doesn’t support file paths that are not valid UTF-8." 
> 
> And in %USERPROFILE%\home\.gnatstudio\log, I found : 
> ... 
> [GPS.KERNEL.XREF] Set up xref database: :memory: (12:24:25.635) 
> [SQL.INSPECT] Loading data from data into database (12:24:25.651) 
> [ENTITIES.TIMING] Created database: 0.017679200 s (12:24:25.653) 
> [GPS.KERNEL.GPS_KERNEL] Refresh_Context: no child focused (12:24:25.668) 
> [GPS.KERNEL.MSG] Not loading C:\GNAT\2020\share\examples\gnatstudio\tutorial\obj\sdc-msg.xml (12:24:25.668) 
> [GPS.LSP_CLIENT] Starting 'C:\GNAT\2020\libexec\gnatstudio\als\ada_language_server.exe' (12:24:25.834) 
> [HOOKS.EXCEPTIONS] While running project_view_changed:GPS.LSP_MODULE.ON_PROJECT_VIEW_CHANGED 
> _HOOKS.EXCEPTIONS_ raised ADA.STRINGS.UTF_ENCODING.ENCODING_ERROR : bad input at Item (34) 
> ... 
> 
> But I do not understand which file path is not valid UTF-8 : I have used standard installation, and I am testing sdc.gpr, which is provided as a tutorial 
> 
> Regards 
> Jérôme

FYI : 
This specific problem was due to an environment variable with French characters
For example : 
C> set ONE_VAR=Déjà
C> gnatstudio.exe 
....
and ada_language_server.exe process will fail to start when opening the 1st gpr project.

Same problem with gnatstudio from GNAT CE 2021

^ permalink raw reply	[relevance 0%]

* Re: Ada and Unicode
  2021-04-19  9:08  9% ` Stephen Leake
  2021-04-19 11:56 11%   ` Luke A. Guest
@ 2021-04-19 16:14  0%   ` DrPi
  1 sibling, 0 replies; 44+ results
From: DrPi @ 2021-04-19 16:14 UTC (permalink / raw)


Le 19/04/2021 à 11:08, Stephen Leake a écrit :
> DrPi <314@drpi.fr> writes:
> 
>> Any way to use source code encoded in UTF-8 ?
> 
>        for Switches ("non_ascii.ads") use ("-gnatiw", "-gnatW8");
> 
That's interesting.
Using these switches at project level is not OK. Project source files 
not always use the same encoding. Especially when using libraries.
Using these switches at source level is better. A little bit complicated 
to use but better.

> from the gnat user guide, 4.3.1 Alphabetical List of All Switches:
> 
> `-gnati`c''
>       Identifier character set (`c' = 1/2/3/4/8/9/p/f/n/w).  For details
>       of the possible selections for `c', see *note Character Set
>       Control: 4e.
> 
> This applies to identifiers in the source code
> 
> `-gnatW`e''
>       Wide character encoding method (`e'=n/h/u/s/e/8).
> 
> This applies to string and character literals.
> 
>> What's the way to manage Unicode correctly ?
> 
> There are two issues: Unicode in source code, that the compiler must
> understand, and Unicode in strings, that your program must understand.
> 
> (I've never written a program that dealt with utf strings other than
> file names).
>   
> -gnati8 tells the compiler that the source code uses utf-8 encoding.
> 
> -gnatW8 tells the compiler that string literals use utf-8 encoding.
> 
> package Ada.Strings.UTF_Encoding provides some facilities for dealing
> with utf. It does _not_ provide walking a string by code point, which
> would seem necessary.
> 
> We could be more helpful if you show what you are trying to do, you've
> tried, and what errors you got.
> 

^ permalink raw reply	[relevance 0%]

* Re: Ada and Unicode
  2021-04-19 12:13  0%     ` Luke A. Guest
@ 2021-04-19 15:48  0%       ` DrPi
  0 siblings, 0 replies; 44+ results
From: DrPi @ 2021-04-19 15:48 UTC (permalink / raw)


Le 19/04/2021 à 14:13, Luke A. Guest a écrit :
> 
> On 19/04/2021 12:56, Luke A. Guest wrote:
> 
>>
>> package Ada.Strings.UTF_Encoding
>>    ...
>>    subtype UTF_8_String is String;
>>    ...
>> end Ada.Strings.UTF_Encoding;
>>
>> Was absolutely and totally wrong.
> 
> ...and, before someone comes back with "but all the upper half of latin 
> 1" are represented and have the same values." Yes, they do, in Code 
> points which is a 32 bit number. In UTF-8 they are encoded as 2 octets!
A code point has no size. Like universal integers in Ada.

^ permalink raw reply	[relevance 0%]

* Re: Ada and Unicode
  2021-04-19 11:56 11%   ` Luke A. Guest
  2021-04-19 12:13  0%     ` Luke A. Guest
@ 2021-04-19 12:52  0%     ` Dmitry A. Kazakov
    1 sibling, 1 reply; 44+ results
From: Dmitry A. Kazakov @ 2021-04-19 12:52 UTC (permalink / raw)


On 2021-04-19 13:56, Luke A. Guest wrote:
> On 19/04/2021 10:08, Stephen Leake wrote:
>>> What's the way to manage Unicode correctly ?
>>
>> There are two issues: Unicode in source code, that the compiler must
>> understand, and Unicode in strings, that your program must understand.
> 
> And this is there the Ada standard gets it wrong, in the encodings 
> package re utf-8.
> 
> Unicode is a superset of 7-bit ASCII not Latin 1. The high bit in the 
> leading octet indicates whether there are trailing octets. See 
> https://github.com/Lucretia/uca/blob/master/src/uca.ads#L70 for the data 
> layout. The first 128 "characters" in Unicode match that of 7-bit ASCII, 
> not 8-bit ASCII, and certainly not Latin 1. Therefore this:
> 
> package Ada.Strings.UTF_Encoding
>    ...
>    subtype UTF_8_String is String;
>    ...
> end Ada.Strings.UTF_Encoding;
> 
> Was absolutely and totally wrong.

It is practical solution. Ada type system cannot express differently 
represented/constrained string/array/vector subtypes. Ignoring Latin-1 
and using String as if it were an array of octets is the best available 
solution.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[relevance 0%]

* Re: Ada and Unicode
  2021-04-19 11:56 11%   ` Luke A. Guest
@ 2021-04-19 12:13  0%     ` Luke A. Guest
  2021-04-19 15:48  0%       ` DrPi
  2021-04-19 12:52  0%     ` Dmitry A. Kazakov
  1 sibling, 1 reply; 44+ results
From: Luke A. Guest @ 2021-04-19 12:13 UTC (permalink / raw)



On 19/04/2021 12:56, Luke A. Guest wrote:

> 
> package Ada.Strings.UTF_Encoding
>    ...
>    subtype UTF_8_String is String;
>    ...
> end Ada.Strings.UTF_Encoding;
> 
> Was absolutely and totally wrong.

...and, before someone comes back with "but all the upper half of latin 
1" are represented and have the same values." Yes, they do, in Code 
points which is a 32 bit number. In UTF-8 they are encoded as 2 octets!

^ permalink raw reply	[relevance 0%]

* Re: Ada and Unicode
  2021-04-19  9:08  9% ` Stephen Leake
@ 2021-04-19 11:56 11%   ` Luke A. Guest
  2021-04-19 12:13  0%     ` Luke A. Guest
  2021-04-19 12:52  0%     ` Dmitry A. Kazakov
  2021-04-19 16:14  0%   ` DrPi
  1 sibling, 2 replies; 44+ results
From: Luke A. Guest @ 2021-04-19 11:56 UTC (permalink / raw)

On 19/04/2021 10:08, Stephen Leake wrote:
>> What's the way to manage Unicode correctly ?
> 
> There are two issues: Unicode in source code, that the compiler must
> understand, and Unicode in strings, that your program must understand.

And this is there the Ada standard gets it wrong, in the encodings 
package re utf-8.

Unicode is a superset of 7-bit ASCII not Latin 1. The high bit in the 
leading octet indicates whether there are trailing octets. See 
https://github.com/Lucretia/uca/blob/master/src/uca.ads#L70 for the data 
layout. The first 128 "characters" in Unicode match that of 7-bit ASCII, 
not 8-bit ASCII, and certainly not Latin 1. Therefore this:

package Ada.Strings.UTF_Encoding
    ...
    subtype UTF_8_String is String;
    ...
end Ada.Strings.UTF_Encoding;

Was absolutely and totally wrong.

^ permalink raw reply	[relevance 11%]

* Re: Ada and Unicode
    @ 2021-04-19  9:08  9% ` Stephen Leake
  2021-04-19 11:56 11%   ` Luke A. Guest
  2021-04-19 16:14  0%   ` DrPi
    2 siblings, 2 replies; 44+ results
From: Stephen Leake @ 2021-04-19  9:08 UTC (permalink / raw)

DrPi <314@drpi.fr> writes:

> Any way to use source code encoded in UTF-8 ?

      for Switches ("non_ascii.ads") use ("-gnatiw", "-gnatW8");

from the gnat user guide, 4.3.1 Alphabetical List of All Switches:

`-gnati`c''
     Identifier character set (`c' = 1/2/3/4/8/9/p/f/n/w).  For details
     of the possible selections for `c', see *note Character Set
     Control: 4e.

This applies to identifiers in the source code

`-gnatW`e''
     Wide character encoding method (`e'=n/h/u/s/e/8).

This applies to string and character literals.

> What's the way to manage Unicode correctly ?

There are two issues: Unicode in source code, that the compiler must
understand, and Unicode in strings, that your program must understand.

(I've never written a program that dealt with utf strings other than
file names).

-gnati8 tells the compiler that the source code uses utf-8 encoding.

-gnatW8 tells the compiler that string literals use utf-8 encoding.

package Ada.Strings.UTF_Encoding provides some facilities for dealing
with utf. It does _not_ provide walking a string by code point, which
would seem necessary.

We could be more helpful if you show what you are trying to do, you've
tried, and what errors you got.

-- 
-- Stephe

^ permalink raw reply	[relevance 9%]

* Re: Unable to use "Find all references" with GPS CE 2020 !?
  @ 2020-08-28 10:35  8%     ` Jérôme Haguet
  2021-06-18 11:02  0%       ` Jérôme Haguet
  0 siblings, 1 reply; 44+ results
From: Jérôme Haguet @ 2020-08-28 10:35 UTC (permalink / raw)


Le mercredi 26 août 2020 à 16:47:08 UTC+2, Jérôme Haguet a écrit :
> > > These days, I tried to use GPS from Gnat CE 2020, Windows x64. 
> > > Unfortunatly, 'Find all references' does not seems to work. 
> > > It happens on 2 different PCS, with any project I have tried, including the one used in the tutorial. 
> > I have probably an identical setup: 2 PCs, Windows 10, x64, GNAT CE 2020. On both, 'Find all references' works perfectly, even on a relatively low-powered laptop (2 cores, 2 logical processors, 1.1 GHz). I'd say it works better than previously: in earlier versions, the reference finder gave up due to (perhaps) a time-out, or sometimes didn't want to work at all for the whole session. 
> > Perhaps your problem is with the path? "C:\GNAT\2020\bin;" or equivalent should come first. It's important if you have multiple GNAT installations.
> Thanks Gauthier for the suggestion, I have simplified the PATH. But it did not make it work. 
> I have made some additional tests with a new computer, and it works successfully. 
> 
> But I am still facing the problem with my own computer. 
> I have found one difference using ProcessExplorer : subprocess ada_language_server.exe is not running after my .gpr project is opened. 
> It can be successfully started from command line "C:\GNAT\2020\libexec\gnatstudio\als\ada_language_server.exe ...", but it is not started from GPS. 
> 
> Any idea where to check ? Any log option to activate ? 
> 
> Jérôme

Seen in http://docs.adacore.com/live/wave/gps/html/gps_ug/environment.html#the-ada-language-server
"One known limitation of this server is that it doesn’t support file paths that are not valid UTF-8."

And in %USERPROFILE%\home\.gnatstudio\log, I found : 
...
   [GPS.KERNEL.XREF] Set up xref database: :memory: (12:24:25.635)
   [SQL.INSPECT] Loading data from data into database (12:24:25.651)
   [ENTITIES.TIMING] Created database: 0.017679200 s (12:24:25.653)
   [GPS.KERNEL.GPS_KERNEL] Refresh_Context: no child focused (12:24:25.668)
   [GPS.KERNEL.MSG] Not loading C:\GNAT\2020\share\examples\gnatstudio\tutorial\obj\sdc-msg.xml (12:24:25.668)
   [GPS.LSP_CLIENT] Starting 'C:\GNAT\2020\libexec\gnatstudio\als\ada_language_server.exe' (12:24:25.834)
   [HOOKS.EXCEPTIONS] While running project_view_changed:GPS.LSP_MODULE.ON_PROJECT_VIEW_CHANGED
   _HOOKS.EXCEPTIONS_ raised ADA.STRINGS.UTF_ENCODING.ENCODING_ERROR : bad input at Item (34)
...

But I do not understand which file path is not valid UTF-8 : I have used standard installation, and I am testing sdc.gpr, which is provided as a tutorial 

Regards
Jérôme

^ permalink raw reply	[relevance 8%]

* Re: I need to show extended Ascii codes in GtkAda environment
  2019-11-22 21:22 13%   ` Randy Brukardt
@ 2019-11-22 21:36 10%     ` Dmitry A. Kazakov
  0 siblings, 0 replies; 44+ results
From: Dmitry A. Kazakov @ 2019-11-22 21:36 UTC (permalink / raw)


On 2019-11-22 22:22, Randy Brukardt wrote:

>      Slash_Null : constant String :=
>            Ada.Strings.UTF_Encoding.Strings.Encode (Character'Val(216) & "");

Or using array aggregate

    Ada.Strings.UTF_Encoding.Strings.Encode ((1=>Character'Val(216)));

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[relevance 10%]

* Re: I need to show extended Ascii codes in GtkAda environment
  @ 2019-11-22 21:22 13%   ` Randy Brukardt
  2019-11-22 21:36 10%     ` Dmitry A. Kazakov
  0 siblings, 1 reply; 44+ results
From: Randy Brukardt @ 2019-11-22 21:22 UTC (permalink / raw)


"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
news:qr8qbp$15gp$1@gioia.aioe.org...
> On 2019-11-22 14:09, L Dries wrote:
>> In a GtkAda environment I have the need to present some characters or 
>> strings with characters in the Extended character Ascii range for 
>> instance O with slash (216). I create a constant character with
>>
>> Slash_null : constant string(1 .. 1) := "" & character'Val(216);
>
> GTK is UTF-8. You must encode and decode anything that is not ASCII 7-bit. 
> [A pragmatic approach is to ignore the reference manual and treat all 
> strings UTF-8 rather than mandated Latin 1.]
>
> Anyway, assuming you need the LATIN CAPITAL LETTER O WITH STROKE character 
> (code point 216), encoded in UTF-8 it is C3 98. Note, two octets. Thus:
>
>    Slash_Null : constant String :=
>                    Character'Val (16#C3#) & Character'Val (16#98#);
>
> With Strings Edit for Ada you could also write:
>
>    Slash_Null : constant String := Strings_Edit.UTF8.Image (216);

You could of course use Ada.Strings.UTF_Encoding (a standard part of Ada 
2012) to do this as well, by converting the "standard" Latin-1 String to 
UTF-8:

    with Ada.Strings.UTF_Encoding.Strings;
    ...
    Slash_Null : constant String :=
          Ada.Strings.UTF_Encoding.Strings.Encode (Character'Val(216) & "");

Note that concatenating the character with the null string effectively 
creates a string with one character. (You could of course use the literal 
directly if your editor/implementation supports that - tough to do that on 
the Internet, though.)

                                         Randy.





^ permalink raw reply	[relevance 13%]

* Re: Latest suggestion for 202x
  2019-06-15 23:59  7% Latest suggestion for 202x Micah Waddoups
  2019-06-16  7:17  0% ` Dmitry A. Kazakov
  2019-06-16 19:34  0% ` Optikos
@ 2019-06-23 20:17  0% ` Per Sandberg
  2 siblings, 0 replies; 44+ results
From: Per Sandberg @ 2019-06-23 20:17 UTC (permalink / raw)


Just a humble comment concerning using square brackets for for indexing 
arrays.
Why care if it is an array or function ?
That's just some implementation details that is of interest when using 
assembler.
/P



On 6/16/19 1:59 AM, Micah Waddoups wrote:
> Following is my comment of appreciation AND my suggestion that is a very basic and important level of support for Unicode.
> 
> Frankly, the former rules for directly specifying the contents of an array were perfect, when you can only use parenthesis.  Being able to use square brackets to improve the readability of an array is brilliant because it is familiar to those who use other languages and it does very little to change what is already part of the language definition (square brackets are already use in a way that does not conflict).  Therefore, it is not confusing, just new.
> 
> As for the other uses, I can't fully wrap my head around it, because I don't have time to study that part yet.
> 
> I have a suggestion, and I am sorry I haven't searched thoroughly to see if someone has suggested this already, though I don't believe they have.  Unicode and UTF are supported very well, however, the support is simply limited to the packages starting at Ada.Strings.UTF_Encoding.  There is no connection to Character_Set found in Maps.  Since many lines of code are designed around the traditional Character_Set in Ada.Strings, the categories of UTF should be conveyable as a (Wide_+)Character_Set so that the much existing code does not have to be fundamentally or completely redesigned in order to use the support to Unicode.  Please consider making Unicode categories available as character_sets (obviously omitting any characters from a category that are out of the range of the string, such as String, sans block drawing, vs. Wide_String, with block drawing).
> 
> If the character_set support for Unicode is not included in the pre-built standard libraries, then it will be much elaboration and unnecessary code to try to implement it correctly.
> 
> Does anybody agree or disagree with this very simple, but very impactful suggestion?
> 
> 
> 


^ permalink raw reply	[relevance 0%]

* Re: Latest suggestion for 202x
  2019-06-15 23:59  7% Latest suggestion for 202x Micah Waddoups
  2019-06-16  7:17  0% ` Dmitry A. Kazakov
@ 2019-06-16 19:34  0% ` Optikos
  2019-06-23 20:17  0% ` Per Sandberg
  2 siblings, 0 replies; 44+ results
From: Optikos @ 2019-06-16 19:34 UTC (permalink / raw)


On Saturday, June 15, 2019 at 6:59:41 PM UTC-5, Micah Waddoups wrote:
> Following is my comment of appreciation AND my suggestion that is a very basic and important level of support for Unicode.
> 
> Frankly, the former rules for directly specifying the contents of an array were perfect, when you can only use parenthesis.  Being able to use square brackets to improve the readability of an array is brilliant because it is familiar to those who use other languages

1) Fortran and Ada are the mainstream languages that utilize parentheses () for array indexing, following the reason #3 below.
2) Algol-family languages (especially the widely-influential Algol60 and the otherwise-influential-on-Ada Algol68) utilize brackets [] for array indexing.  Nearly all other programming languages (which often are little more than Algol60 rejiggered a little bit) have followed Algol's lead on array indexing via bracket [] syntax.
3) Mathematics utilizes subscripts on variables for the customary equivalent of array indexing, although mathematics could be said to also permit modeling array indexing as a narrower application of the generalized function-parenthesis notation f(i), just like Ada emulates.  Hence, Fortran & Ada are more true to mathematics' f(i) notation for representing array indexing as a function-invocation syntax.

> and it does very little to change what is already part of the language definition (square brackets are already use in a way that does not conflict).  Therefore, it is not confusing, just new.
> 
> As for the other uses, I can't fully wrap my head around it, because I don't have time to study that part yet.
> 
> I have a suggestion, and I am sorry I haven't searched thoroughly to see if someone has suggested this already, though I don't believe they have.  Unicode and UTF are supported very well, however, the support is simply limited to the packages starting at Ada.Strings.UTF_Encoding.  There is no connection to Character_Set found in Maps.  Since many lines of code are designed around the traditional Character_Set in Ada.Strings, the categories of UTF should be conveyable as a (Wide_+)Character_Set so that the much existing code does not have to be fundamentally or completely redesigned in order to use the support to Unicode.  Please consider making Unicode categories available as character_sets (obviously omitting any characters from a category that are out of the range of the string, such as String, sans block drawing, vs. Wide_String, with block drawing).
> 
> If the character_set support for Unicode is not included in the pre-built standard libraries, then it will be much elaboration and unnecessary code to try to implement it correctly.
> 
> Does anybody agree or disagree with this very simple, but very impactful suggestion?

^ permalink raw reply	[relevance 0%]

* Re: Latest suggestion for 202x
  2019-06-15 23:59  7% Latest suggestion for 202x Micah Waddoups
@ 2019-06-16  7:17  0% ` Dmitry A. Kazakov
  2019-06-16 19:34  0% ` Optikos
  2019-06-23 20:17  0% ` Per Sandberg
  2 siblings, 0 replies; 44+ results
From: Dmitry A. Kazakov @ 2019-06-16  7:17 UTC (permalink / raw)


On 2019-06-16 01:59, Micah Waddoups wrote:

> I have a suggestion, and I am sorry I haven't searched thoroughly to see if someone has suggested this already, though I don't believe they have.  Unicode and UTF are supported very well, however, the support is simply limited to the packages starting at Ada.Strings.UTF_Encoding.  There is no connection to Character_Set found in Maps.  Since many lines of code are designed around the traditional Character_Set in Ada.Strings, the categories of UTF should be conveyable as a (Wide_+)Character_Set so that the much existing code does not have to be fundamentally or completely redesigned in order to use the support to Unicode.  Please consider making Unicode categories available as character_sets (obviously omitting any characters from a category that are out of the range of the string, such as String, sans block drawing, vs. Wide_String, with block drawing).
> 
> If the character_set support for Unicode is not included in the pre-built standard libraries, then it will be much elaboration and unnecessary code to try to implement it correctly.

You may find an implementation of Unicode sets, maps, categorization here:

http://www.dmitry-kazakov.de/ada/strings_edit.htm#7.6

> Does anybody agree or disagree with this very simple, but very impactful suggestion?

Sets and maps have very infrequent use. With Unicode they require sparse 
representation and thus less efficient than Latin-1 variants.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[relevance 0%]

* Latest suggestion for 202x
@ 2019-06-15 23:59  7% Micah Waddoups
  2019-06-16  7:17  0% ` Dmitry A. Kazakov
                   ` (2 more replies)
  0 siblings, 3 replies; 44+ results
From: Micah Waddoups @ 2019-06-15 23:59 UTC (permalink / raw)


Following is my comment of appreciation AND my suggestion that is a very basic and important level of support for Unicode.

Frankly, the former rules for directly specifying the contents of an array were perfect, when you can only use parenthesis.  Being able to use square brackets to improve the readability of an array is brilliant because it is familiar to those who use other languages and it does very little to change what is already part of the language definition (square brackets are already use in a way that does not conflict).  Therefore, it is not confusing, just new.

As for the other uses, I can't fully wrap my head around it, because I don't have time to study that part yet.

I have a suggestion, and I am sorry I haven't searched thoroughly to see if someone has suggested this already, though I don't believe they have.  Unicode and UTF are supported very well, however, the support is simply limited to the packages starting at Ada.Strings.UTF_Encoding.  There is no connection to Character_Set found in Maps.  Since many lines of code are designed around the traditional Character_Set in Ada.Strings, the categories of UTF should be conveyable as a (Wide_+)Character_Set so that the much existing code does not have to be fundamentally or completely redesigned in order to use the support to Unicode.  Please consider making Unicode categories available as character_sets (obviously omitting any characters from a category that are out of the range of the string, such as String, sans block drawing, vs. Wide_String, with block drawing).

If the character_set support for Unicode is not included in the pre-built standard libraries, then it will be much elaboration and unnecessary code to try to implement it correctly.

Does anybody agree or disagree with this very simple, but very impactful suggestion?



^ permalink raw reply	[relevance 7%]

* Re: Chess game in character over MS Windows
  @ 2019-03-01 13:17  9% ` manueledensenster
  0 siblings, 0 replies; 44+ results
From: manueledensenster @ 2019-03-01 13:17 UTC (permalink / raw)

On Friday, March 1, 2019 at 2:07:21 PM UTC+1, manueled...@gmail.com wrote:
> Hello,
> 
> I have inserted the 12 Wide_Character into a string of length 12 but gnatmake say at the exécution : length check failed.
> 
> Help me please.
> 
> 
> I have tested with UTF-8-auto-dos encoding with Emacs.
> 
> With GPS in UTF-8 is the same problem.
> 
> Thank you.

Sorry bis.
Ok, over MS Windows the problem is same over GPS or simple make with gnatmake and GtkAda.

My program is an chess game in text in a GtkAda project.
Sensibly the cocage of 12 Wide_Character is Ok.

But the Chess character are not printed.

I convert The literal Wide_Character with the line :

[code]

Locale_From_Utf8

		  (Ada.Strings.UTF_Encoding.Conversions.Convert
		     ("" & Images(Piece_Type'Pos(Piece)),
		      UTF_8)
		  )
[/code]

To be inserted in gtk_text_buffer.

Thank you for your help.

^ permalink raw reply	[relevance 9%]

* Re: windows-1251 to utf-8
  @ 2018-10-31 20:58  9%     ` Randy Brukardt
  0 siblings, 0 replies; 44+ results
From: Randy Brukardt @ 2018-10-31 20:58 UTC (permalink / raw)

>Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
>news:prcn4v$d30$1@gioia.aioe.org...
> On 2018-10-31 16:28, eduardsapotski@gmail.com wrote:
>> Let's make it easier. For example:
>>
>> ------------------------------------------------------------------
>>
>> with Ada.Strings.Unbounded;     use Ada.Strings.Unbounded;
>> with Ada.Text_IO.Unbounded_IO;  use Ada.Text_IO.Unbounded_IO;
>>
>> with AWS.Client;            use AWS.Client;
>> with AWS.Messages;          use AWS.Messages;
>> with AWS.Response;          use AWS.Response;
>>
>> procedure Main is
>>
>>     HTML_Result   : Unbounded_String;
>>     Request_Header_List : Header_List;
>>
>> begin
>>
>>     Request_Header_List.Add(Name => "User-Agent", Value => "Mozilla/5.0 
>> (X11; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0");
>>
>>     HTML_Result := Message_Body(Get(URL => "http://www.sql.ru/", Headers 
>> => Request_Header_List));
>>
>>     Put_Line(HTML_Result);
>>
>> end Main;
>>
>> ------------------------------------------------------------------
>>
>> My linux terminal (default UTF-8) show: 
>> https://photos.app.goo.gl/EPgwKoiFSuwkJvgSA
>>
>> If set encoding in terminal Windows-1251 - all is well: 
>> https://photos.app.goo.gl/goN5g7uofD8rYLP79
>>
>> Are there standard ways to solve this problem?
>
> What problem? The page uses the content charset=windows-1251. It is legal.
>
> Your program is illegal as it prints the body using Put_Line. Ada standard 
> requires Character be Latin-1. The only case when your program would be 
> correct is when charset=ISO-8859-1.
>
> You must convert the page body according to the encoding specified by the 
> charset key into a string containing UTF-8 octets and use 
> Streams.Stream_IO to write these octets as-is. The conversion for the case 
> of windows-1251 I described earlier. Create a table Character'Pos 
> 0..255 -> Code_Point and use it for each "character" of HTML_Result.
>
> P.S. GNAT Text_IO ignores Latin-1, but that is between GNAT and the 
> underlying OS.
>
> P.P.S. Technically AWS also ignores Ada standard. But that is an 
> established practice. Since there is no better way.

Right. Probably the easiest way to do this (using just Ada functions) would 
be to:

 (A)  Use Ada.Characters to convert the To_String of the unbounded string to 
a Wide_String, and then store that in a Wide_Unbounded_String (or is that a 
Unbounded_Wide_String?)
 (B) Use Ada.Strings.Wide_Maps to create a character conversion map (the 
conversions were described by another reply);
 (C) Use Ada.Strings.Wide_Unbounded.Translate to apply the mapping from (B) 
to your Wide_Unbounded_String.
(D) Use Ada.Strings.UTF_Encoding.Wide_Strings.Encode to convert 
To_Wide_String to your translated Wide_Unbounded_String, presumably storing 
the result into a Unbounded_String.

You potentially could skip (D) if Wide_Text_IO works when sent to 
Standard_Output (I'd expect that on Windows, no idea on Linux). In that 
case, use Wide_Text_IO.Put to send your result.

In any case, this shows why Unicode exists, and why anything these days that 
uses non-standard encodings is evil. There's really no short-cut to recoding 
such things, and that makes them maddening.

                                  Randy.

^ permalink raw reply	[relevance 9%]

* Re: Strange crash on custom iterator
  2018-07-03 14:17  9%                       ` J-P. Rosen
@ 2018-07-03 15:06  0%                         ` Lucretia
  0 siblings, 0 replies; 44+ results
From: Lucretia @ 2018-07-03 15:06 UTC (permalink / raw)

On Tuesday, 3 July 2018 15:17:14 UTC+1, J-P. Rosen  wrote:
> Le 03/07/2018 à 16:08, Lucretia a écrit :
> >    type Unicode_String is array (Positive range <>) of Integer;
> Array of Integer???? For a Unicode_String...

Firstly read the rest of this thread, secondly, i should've renamed that in that simple test, because IT'S A TEST to show an error in the compiler. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86391

> Btw, do you know the package Ada.Strings.UTF_Encoding ?

Yes, I'm well aware of this completely useless type.

1) It's a subtype of String, which is incorrect as UTF-8 is not a superset of Latin 1, this should never have been allowed.

2) Ada needs a decent Unicode library not this half-arsed crap we have now.

^ permalink raw reply	[relevance 0%]

* Re: Strange crash on custom iterator
  @ 2018-07-03 14:17  9%                       ` J-P. Rosen
  2018-07-03 15:06  0%                         ` Lucretia
  0 siblings, 1 reply; 44+ results
From: J-P. Rosen @ 2018-07-03 14:17 UTC (permalink / raw)


Le 03/07/2018 à 16:08, Lucretia a écrit :
>    type Unicode_String is array (Positive range <>) of Integer;
Array of Integer???? For a Unicode_String...

Btw, do you know the package Ada.Strings.UTF_Encoding ?

-- 
J-P. Rosen
Adalog
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52, Fax: +33 1 45 29 25 00
http://www.adalog.fr

^ permalink raw reply	[relevance 9%]

* Re: Strange crash on custom iterator
  @ 2018-07-02 19:42  7%                   ` Simon Wright
    0 siblings, 1 reply; 44+ results
From: Simon Wright @ 2018-07-02 19:42 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 450 bytes --]

Luke A. Guest <laguest@archeia.com> writes:

> Simon Wright <> wrote:
>
>>> You don't need to make it tagged, to pass it by reference.  It is enough
>>> to make the formal parameter aliased.
>> 
>> Yes, that works (except you have to make the container you're iterating
>> over aliased too).
>
> I had to make the iterate for nation take “aliased in out” and make the
> array aliased, but it still does in the same place.

This worked for me ..


[-- Attachment #2: gnatchop-me --]
[-- Type: text/plain, Size: 9880 bytes --]

--  Copyright 2018, Luke A. Guest
--  License TBD.

with Ada.Characters.Latin_1;
with Ada.Text_IO; use Ada.Text_IO;
with UCA.Encoding;
with UCA.Iterators;

procedure Test is
   package L1 renames Ada.Characters.Latin_1;

   package Octet_IO is new Ada.Text_IO.Modular_IO (UCA.Octets);
   use Octet_IO;

   --  D  : UCA.Octets         := Character'Pos ('Q');
   --  A  : UCA.Unicode_String := UCA.To_Array (D);
   --  A2 : UCA.Unicode_String := UCA.Unicode_String'(1, 0, 0, 0, 0, 0, 1, 0);
   --  D2 : UCA.Octets         := UCA.To_Octet (A2);

   --  package OA_IO is new Ada.Text_IO.Integer_IO (Num => UCA.Bits);

   use UCA.Encoding;
   A : aliased UCA.Unicode_String :=
     +("ᚠᛇᚻ᛫ᛒᛦᚦ᛫ᚠᚱᚩᚠᚢᚱ᛫ᚠᛁᚱᚪ᛫ᚷᛖᚻᚹᛦᛚᚳᚢᛗ" & L1.LF &
         "Hello, world" & L1.LF &
         "Sîne klâwen durh die wolken sint geslagen," & L1.LF &
         "Τη γλώσσα μου έδωσαν ελληνική" & L1.LF &
         "मैं काँच खा सकता हूँ और मुझे उससे कोई चोट नहीं पहुंचती." & L1.LF &
         "میں کانچ کھا سکتا ہوں اور مجھے تکلیف نہیں ہوتی");
   B : aliased UCA.Unicode_String :=
     (225, 154, 160, 225, 155, 135, 225, 154, 187, 225, 155, 171, 225, 155, 146, 225, 155, 166, 225,
      154, 166, 225, 155, 171, 225, 154, 160, 225, 154, 177, 225, 154, 169, 225, 154, 160, 225, 154,
      162, 225, 154, 177, 225, 155, 171, 225, 154, 160, 225, 155, 129, 225, 154, 177, 225, 154, 170,
      225, 155, 171, 225, 154, 183, 225, 155, 150, 225, 154, 187, 225, 154, 185, 225, 155, 166, 225,
      155, 154, 225, 154, 179, 225, 154, 162, 225, 155, 151,
      10,
      72, 101, 108, 108, 111, 44, 32, 119, 111, 114, 108, 100,
      10,
      83, 195, 174, 110, 101, 32, 107, 108, 195, 162, 119, 101, 110, 32, 100, 117, 114, 104, 32, 100,
      105, 101, 32, 119, 111, 108, 107, 101, 110, 32, 115, 105, 110, 116, 32, 103, 101, 115, 108, 97,
      103, 101, 110, 44,
      10,
      206, 164, 206, 183, 32, 206, 179, 206, 187, 207, 142, 207, 131, 207, 131, 206, 177, 32, 206, 188,
      206, 191, 207, 133, 32, 206, 173, 206, 180, 207, 137, 207, 131, 206, 177, 206, 189, 32, 206, 181,
      206, 187, 206, 187, 206, 183, 206, 189, 206, 185, 206, 186, 206, 174,
      10,
      224, 164, 174, 224, 165, 136, 224, 164, 130, 32, 224, 164, 149, 224, 164, 190, 224, 164, 129, 224,
      164, 154, 32, 224, 164, 150, 224, 164, 190, 32, 224, 164, 184, 224, 164, 149, 224, 164, 164, 224,
      164, 190, 32, 224, 164, 185, 224, 165, 130, 224, 164, 129, 32, 224, 164, 148, 224, 164, 176, 32,
      224, 164, 174, 224, 165, 129, 224, 164, 157, 224, 165, 135, 32, 224, 164, 137, 224, 164, 184, 224,
      164, 184, 224, 165, 135, 32, 224, 164, 149, 224, 165, 139, 224, 164, 136, 32, 224, 164, 154, 224,
      165, 139, 224, 164, 159, 32, 224, 164, 168, 224, 164, 185, 224, 165, 128, 224, 164, 130, 32, 224,
      164, 170, 224, 164, 185, 224, 165, 129, 224, 164, 130, 224, 164, 154, 224, 164, 164, 224, 165, 128, 46,
      10,
      217, 133, 219, 140, 218, 186, 32, 218, 169, 216, 167, 217, 134, 218, 134, 32, 218, 169, 218, 190,
      216, 167, 32, 216, 179, 218, 169, 216, 170, 216, 167, 32, 219, 129, 217, 136, 218, 186, 32, 216,
      167, 217, 136, 216, 177, 32, 217, 133, 216, 172, 218, 190, 219, 146, 32, 216, 170, 218, 169, 217,
      132, 219, 140, 217, 129, 32, 217, 134, 219, 129, 219, 140, 218, 186, 32, 219, 129, 217, 136, 216,
      170, 219, 140);
begin
--   Put_Line ("A => " & To_UTF_8_String (A));
   Put_Line ("A => " & L1.LF & String (+A));

   Put_Line ("A => ");
   Put ('(');

   for E of A loop
      Put (Item => E, Base => 2);
      Put (", ");
   end loop;

   Put (')');
   New_Line;

   Put_Line ("B => " & L1.LF & String (+B));

   Put_Line ("A (Iterated) => ");

   for I in UCA.Iterators.Iterate (A) loop
      Put (UCA.Iterators.Element (I));       --  ERROR! Dies in Element, Data has nothing gdb => p position - $1 = (data => (), index => 1)
   end loop;

   New_Line;
end Test;

with Ada.Strings.UTF_Encoding;
with Ada.Unchecked_Conversion;

package UCA is
   use Ada.Strings.UTF_Encoding;

   type Octets is mod 2 ** 8 with
     Size => 8;

   type Unicode_String is array (Positive range <>) of Octets with
     Pack => True;

   type Unicode_String_Access is access all Unicode_String;

   --  This should match Wide_Wide_Character in size.
   type Code_Points is mod 2 ** 32 with
     Static_Predicate => Code_Points in 0 .. 16#0000_D7FF# or Code_Points in 16#0000_E000# .. 16#0010_FFFF#,
     Size             => 32;

private
   type Bits is range 0 .. 1 with
     Size => 1;

   type Bit_Range is range 0 .. Octets'Size - 1;
end UCA;

with Ada.Finalization;
with Ada.Iterator_Interfaces;
private with System.Address_To_Access_Conversions;

package UCA.Iterators is
   ---------------------------------------------------------------------------------------------------------------------
   --  Iteration over code points.
   ---------------------------------------------------------------------------------------------------------------------
   type Cursor is private;
   pragma Preelaborable_Initialization (Cursor);

   function Has_Element (Position : in Cursor) return Boolean;

   function Element (Position : in Cursor) return Octets;

   package Code_Point_Iterators is new Ada.Iterator_Interfaces (Cursor, Has_Element);

   function Iterate (Container : aliased in Unicode_String) return Code_Point_Iterators.Forward_Iterator'Class;
   function Iterate (Container : aliased in Unicode_String; Start : in Cursor) return
     Code_Point_Iterators.Forward_Iterator'Class;

   ---------------------------------------------------------------------------------------------------------------------
   --  Iteration over grapheme clusters.
   ---------------------------------------------------------------------------------------------------------------------
private
   use Ada.Finalization;

   package Convert is new System.Address_To_Access_Conversions (Unicode_String);

   type Cursor is
      record
         Data  : Convert.Object_Pointer := null;
         Index : Positive               := Positive'Last;
      end record;

   type Code_Point_Iterator is new Limited_Controlled and Code_Point_Iterators.Forward_Iterator with
      record
         Data  : Convert.Object_Pointer := null;
      end record;

   overriding
   function First (Object : in Code_Point_Iterator) return Cursor;

   overriding
   function Next  (Object : in Code_Point_Iterator; Position : Cursor) return Cursor;

end UCA.Iterators;

with Ada.Text_IO; use Ada.Text_IO;

package body UCA.Iterators is
   package Octet_IO is new Ada.Text_IO.Modular_IO (UCA.Octets);
   use Octet_IO;

   use type Convert.Object_Pointer;

   function Has_Element (Position : in Cursor) return Boolean is
   begin
      return Position.Index in Position.Data'Range;
   end Has_Element;

   function Element (Position : in Cursor) return Octets is
   begin
      if Position.Data = null then
         raise Constraint_Error with "Fuck!";
      end if;
      Put ("<< Element - " & Positive'Image (Position.Index) & " - ");
      Put (Position.Data (Position.Index));
      Put_Line (" >>");

      return Position.Data (Position.Index);
   end Element;

   function Iterate (Container : aliased in Unicode_String) return Code_Point_Iterators.Forward_Iterator'Class is
   begin
      Put_Line ("<< iterate >>");
      return I : Code_Point_Iterator := (Limited_Controlled with
        Data => Convert.To_Pointer (Container'Address)) do
         if I.Data = null then
            Put_Line ("Data => null");
         else
            Put_Line ("Data => not null - Length: " & Positive'Image (I.Data'Length));
         end if;
         null;
      end return;
   end Iterate;

   function Iterate (Container : aliased in Unicode_String; Start : in Cursor) return
     Code_Point_Iterators.Forward_Iterator'Class is
   begin
      Put_Line ("<< iterate >>");
      return I : Code_Point_Iterator := (Limited_Controlled with
        Data => Convert.To_Pointer (Container'Address)) do
         if I.Data = null then
            Put_Line ("Data => null");
         else
            Put_Line ("Data => not null");
         end if;
         null;
      end return;
   end Iterate;

   ---------------------------------------------------------------------------------------------------------------------
   --  Iteration over grapheme clusters.
   ---------------------------------------------------------------------------------------------------------------------
   overriding
   function First (Object : in Code_Point_Iterator) return Cursor is
   begin
      return (Data => Object.Data, Index => Positive'First);
   end First;

   overriding
   function Next  (Object : in Code_Point_Iterator; Position : Cursor) return Cursor is
   begin
      return (Data => Object.Data, Index => Position.Index + 1);
   end Next;
end UCA.Iterators;

--  Copyright © 2018, Luke A. Guest
with Ada.Unchecked_Conversion;

package body UCA.Encoding is
   function To_Unicode_String (Str : in String) return Unicode_String is
      Result : Unicode_String (1 .. Str'Length) with
        Address => Str'Address;
   begin
      return Result;
   end To_Unicode_String;

   function To_String (Str : in Unicode_String) return String is
      Result : String (1 .. Str'Length) with
        Address => Str'Address;
   begin
      return Result;
   end To_String;
end UCA.Encoding;

package UCA.Encoding is
   use Ada.Strings.UTF_Encoding;

   function To_Unicode_String (Str : in String) return Unicode_String;
   function To_String (Str : in Unicode_String) return String;

   function "+" (Str : in String) return Unicode_String renames To_Unicode_String;
   function "+" (Str : in Unicode_String) return String renames To_String;
end UCA.Encoding;

^ permalink raw reply	[relevance 7%]

* Re: unicode and wide_text_io
  @ 2017-12-28 22:35 11%           ` G.B.
  0 siblings, 0 replies; 44+ results
From: G.B. @ 2017-12-28 22:35 UTC (permalink / raw)


On 28.12.17 16:47, 00120260b@gmail.com wrote:
> Then, how come the norm hasn't made it a bit easier to input/ouput post-latin-1 characters ? Why aren't other norms/characters set/encodings more like special cases ?
> 

Actually, output of non-7-bit, unambiguously encoded text
has been made reasonably easy, I'd say, also defaulting
to what should be expected:

with Ada.Wide_Text_IO.Text_Streams;
with Ada.Strings.UTF_Encoding.Wide_Strings;

procedure UTF is
    --  USD/EUR, i.e. "$/€"
    Ratio : constant Wide_String := "$/" & Wide_Character'Val (16#20AC#);

    use Ada.Wide_Text_Io, Ada.Strings;
begin
    Put_Line (Ratio); --  use defaults, traditional
    String'Write --  stream output, force UTF-8
      (Text_Streams.Stream (Current_Output),
       UTF_Encoding.Wide_Strings.Encode (Ratio));
end UTF;

The above source text uses only 7 bit encoding for post-
latin-1 strings. Only comment text is using a wide_character.

If, instead, source text is encoded by "more" bits, and using
post-latin-1 literals or identifiers, then the compiler
may need to be told. I think that BOMs may be of use, and
in any case, there are compiler switches or some other
vendor specific vocabulary describing source text.


^ permalink raw reply	[relevance 11%]

* Re: win32 interfacing check (SetClipboardData)
  @ 2017-09-02  9:38  9%           ` Xavier Petit
  0 siblings, 0 replies; 44+ results
From: Xavier Petit @ 2017-09-02  9:38 UTC (permalink / raw)


Le 01/09/2017 à 15:10, Dmitry A. Kazakov a écrit :
> On 01/09/2017 14:51, Xavier Petit wrote:
>> Thanks but even with Set_Clipboard (Ada.[Wide_]Wide_Text_IO.Get_Line); 
>> I was getting weird clipboard text without -gnatW8 flag.
> 
> But these are not UTF-8! They are UCS-2 and UCS-4.
Yes but having a look at :
https://gcc.gnu.org/onlinedocs/gnat_ugn/Character-Set-Control.html
https://gcc.gnu.org/onlinedocs/gnat_ugn/Wide_005fCharacter-Encodings.html
https://gcc.gnu.org/onlinedocs/gnat_ugn/Wide_005fWide_005fCharacter-Encodings.html

It appears that without the -gnatW8 flag, "Brackets Coding" is the default :
- “In this encoding, a wide character is represented by the following 
eight character sequence: [...]”
- “In this encoding, a wide wide character is represented by the 
following ten or twelve byte character sequence”

...and with the flag, "UTF-8 Coding" is used : “A wide character is 
represented using UCS Transformation Format 8 (UTF-8)”

I think I'm still missing something because one thing is sure :
Ada.Strings.UTF_Encoding.Wide_Wide_String.Encode 
(Ada.Wide_Wide_Text_IO.Get_Line) doesn't not work without the UTF-8 flag...

 From UTF_Encoding.Wide_Wide_Strings package :
“
The encoding routines take a Wide_Wide_String as input and encode the
result using the specified UTF encoding method.
Encode Wide_Wide_String using UTF-8 encoding
Encode Wide_Wide_String using UTF_16 encoding
”
So it means Get_Line returns a Wide_Wide_String without the USC-4 
encoding ? because Encode doesn't return UTF-(8/16) encoding without the 
flag.

> ([Wide_]Wide_Text_IO should never be used, there is no single case one 
> would need these.)
yeah I'm gonna try not to use the Wide_Wide packages, one thing I liked 
with Wide_Wide_String is the correct 'Length attribute.

> If you have a UTF-8 encoded file (e.g. created using Notepad++, saved 
> without BOM), you should use Ada.Streams.Stream_IO, best in binary mode 
> if you are using GNAT.
> 
> You will have to detect line ends manually, but at least there will be 
> guaranty that the run-time does not mangle anything.
Ok, “binary mode”, do you mean using Stream_Element_Array ?

> If you are using Windows calls with the "W" suffix, then all strings 
> there are already UTF-16 and you don't need to convert anything.
Ok, if I stay with win32 functions, I'll get only UTF-16, if I mess with 
external text sources (like files) or Ada standard, I'll deal with 
others text encoding formats like UTF-8, UCS-2, etc...

^ permalink raw reply	[relevance 9%]

* Re: Bug in Ada - Latin 1 is not a subset of UTF-8
  2016-10-17 20:57 10% ` Jacob Sparre Andersen
@ 2016-10-18  5:44  0%   ` J-P. Rosen
  0 siblings, 0 replies; 44+ results
From: J-P. Rosen @ 2016-10-18  5:44 UTC (permalink / raw)


Le 17/10/2016 à 22:57, Jacob Sparre Andersen a écrit :
>> UTF_String should be implemented as an array like String and then
>> > UTF_8_String should be a subtype of UTF_String or a renaming, if that
>> > is the intent.
> I think the best you can do is to ignore the subtypes declared in
> Ada.Strings.UTF_Encoding (as they are just plain wrong), and declare
> your own type for storing UTF-8 encoded strings.

FWIW, the issue of whether to make UTF-8 a different type or a subtype
of String was discussed at the ARG. It was decided to make a subtype
basically on the grounds that:
1) In most cases, you need to read the beginning of a file (presumably
with Text_IO) before you decide whether it is UTF-8 or not
2) We feared that with a separate type, people would complain that "once
again, Ada does it differently than other languages", and that it would
involve many type conversions for no real benefit.

-- 
J-P. Rosen
Adalog
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52, Fax: +33 1 45 29 25 00
http://www.adalog.fr


^ permalink raw reply	[relevance 0%]

* Re: Bug in Ada - Latin 1 is not a subset of  UTF-8
  @ 2016-10-17 20:57 10% ` Jacob Sparre Andersen
  2016-10-18  5:44  0%   ` J-P. Rosen
  0 siblings, 1 reply; 44+ results
From: Jacob Sparre Andersen @ 2016-10-17 20:57 UTC (permalink / raw)


Lucretia wrote:

> Whilst binding SDL_TTF function, I was going to Overload the TTF_Size*
> functions, but I couldn't do that because UTF_8_String is a subtype of
> String; String is Latin 1 and Latin 1 is not a subset of UTF-8, ASCII
> is.
>
> UTF_String should be implemented as an array like String and then
> UTF_8_String should be a subtype of UTF_String or a renaming, if that
> is the intent.

I think the best you can do is to ignore the subtypes declared in
Ada.Strings.UTF_Encoding (as they are just plain wrong), and declare
your own type for storing UTF-8 encoded strings.

Greetings,

Jacob
-- 
"There are only two types of data:
                         Data which has been backed up
                         Data which has not been lost - yet"

^ permalink raw reply	[relevance 10%]

* Re: A few questions on parsing, sockets, UTF-8 strings
  @ 2016-08-11 19:09  9%         ` gautier_niouzes
  0 siblings, 0 replies; 44+ results
From: gautier_niouzes @ 2016-08-11 19:09 UTC (permalink / raw)


> In that case, as long as I don't need to access single characters ever, could I stick with fixed strings?

Exactly. String is just an array of (8-bit) Character, so you can have UTF-8 strings stored there (or ASCII, or other things...), but a single "Unicode character" will take one *or more* Character's in a String.
As a reminder, you can define "subtype UTF_8_String is String;", just to be aware that taking a single Character's in your String can be meaningless.
But wait, the package Ada.Strings.UTF_Encoding does it for you, plus provides conversions functions.
_________________________ 
Gautier's Ada programming 
http://sf.net/users/gdemont/

^ permalink raw reply	[relevance 9%]

* Re: Exclusive file access
  2015-08-31 23:34  9%             ` Randy Brukardt
@ 2015-09-01  7:33  0%               ` Dmitry A. Kazakov
  0 siblings, 0 replies; 44+ results
From: Dmitry A. Kazakov @ 2015-09-01  7:33 UTC (permalink / raw)


On Mon, 31 Aug 2015 18:34:12 -0500, Randy Brukardt wrote:

> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
> news:mvil865iebyb$.1of2shk5faacq$.dlg@40tude.net...
> ...
>>> As long as you don't use Wide_String—if you do use that, things get 
>>> rather messy.
>>
>> Both are messy. Character and Ada.Text_IO was designed prior to Unicode.
>> Later amendments were futile attempts to repair what needed no repair.
> 
> I would have said: "Later amendments were futile attempts to repair what 
> needed replacement." since we have no choice but to support modern character 
> sets in Ada. But the way to do that isn't by abandoning strong typing 
> (Ada.Strings.UTF_Encoding) or by duplicating everything many times 
> (Wide_Wide_String, which naturally leads to Wide_Wide_Text_IO which leads to 
> Wide_Wide_Open which leads to Wide_Wide_Madness).

Agreed

> Since a sensible solution can't be done compatibility, we needed to start 
> over -- but that's never had appropriate traction.

And good so, because without reworking the type system no reasonable
solution is impossible.

Either:

a. Encoding must be an interface, a view, of the string type decoupled from
its internal representation.

b. If the representation is the encoding, then the interface of an array of
characters must still be available.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[relevance 0%]

* Re: Exclusive file access
  @ 2015-08-31 23:34  9%             ` Randy Brukardt
  2015-09-01  7:33  0%               ` Dmitry A. Kazakov
  0 siblings, 1 reply; 44+ results
From: Randy Brukardt @ 2015-08-31 23:34 UTC (permalink / raw)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 949 bytes --]

"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
news:mvil865iebyb$.1of2shk5faacq$.dlg@40tude.net...
...
>> As long as you don't use Wide_String—if you do use that, things get 
>> rather messy.
>
> Both are messy. Character and Ada.Text_IO was designed prior to Unicode.
> Later amendments were futile attempts to repair what needed no repair.

I would have said: "Later amendments were futile attempts to repair what 
needed replacement." since we have no choice but to support modern character 
sets in Ada. But the way to do that isn't by abandoning strong typing 
(Ada.Strings.UTF_Encoding) or by duplicating everything many times 
(Wide_Wide_String, which naturally leads to Wide_Wide_Text_IO which leads to 
Wide_Wide_Open which leads to Wide_Wide_Madness).

Since a sensible solution can't be done compatibility, we needed to start 
over -- but that's never had appropriate traction.

                                Randy.


^ permalink raw reply	[relevance 9%]

* Re: gtkada: CAIRO_STATUS_INVALID_STRING
  @ 2015-03-19  9:08  9% ` J-P. Rosen
  0 siblings, 0 replies; 44+ results
From: J-P. Rosen @ 2015-03-19  9:08 UTC (permalink / raw)


Le 19/03/2015 01:50, hreba a écrit :
> So what can I do to make the String UTF-8?

package Ada.Strings.UTF_Encoding

-- 
J-P. Rosen
Adalog
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52, Fax: +33 1 45 29 25 00
http://www.adalog.fr

^ permalink raw reply	[relevance 9%]

* Re: string and wide string usage
  2013-03-07 14:20  9% ` ytomino
@ 2013-03-07 17:14  0%   ` Dmitry A. Kazakov
  0 siblings, 0 replies; 44+ results
From: Dmitry A. Kazakov @ 2013-03-07 17:14 UTC (permalink / raw)


On Thu, 7 Mar 2013 06:20:05 -0800 (PST), ytomino wrote:

> On Thursday, March 7, 2013 8:12:01 PM UTC+9, Ali Bendriss wrote:
>> I've got some problem with some string in example:
>> a base 64 encoded string
>> V2luZG93c8KgNyBQcm9mZXNzaW9ubmVsIE4=
>> wich decode to 'Windows\xa07 Professionnel N' in utf-8
>> every thing is working if I feed directly the database, but if want to 
>> apply Ada.Characters.Handling.To_Lower on the string before feeding the 
>> database postgres is not happy 
>> 'ERROR:  invalid byte sequence for encoding "UTF8": 0xe2 0xa0 0x37'
>> it's not really a big deal, but I would like to understand where the 
>> problem is. Do I have to use wide string ?
> 
> Because functions in Ada.Characters.Handling take not UTF-8 but Latin-1.
> You have to
> 1. convert UTF-8 String to Wide_Wide_String, process UTF-32 and restore it to UTF-8.
>   (Ada.Characters.Conversion also take Latin-1. You have to use GNAT.Encode_String/Decode_String or Ada.Strings.UTF_Encoding for converting.)
> 2. search a external library to process UTF-8 directly.

Provided the base 64 encodes an UTF-8 string, which you wanted to convert
to lower case UTF-8 string using the Unicode lower case mapping, then you
can use

   function To_Lowercase (Value : String) return String;

from

http://www.dmitry-kazakov.de/ada/strings_edit.htm#7.6

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de



^ permalink raw reply	[relevance 0%]

* Re: string and wide string usage
  @ 2013-03-07 14:20  9% ` ytomino
  2013-03-07 17:14  0%   ` Dmitry A. Kazakov
  0 siblings, 1 reply; 44+ results
From: ytomino @ 2013-03-07 14:20 UTC (permalink / raw)


On Thursday, March 7, 2013 8:12:01 PM UTC+9, Ali Bendriss wrote:
> I've got some problem with some string in example:
> a base 64 encoded string
> V2luZG93c8KgNyBQcm9mZXNzaW9ubmVsIE4=
> wich decode to 'Windows\xa07 Professionnel N' in utf-8
> every thing is working if I feed directly the database, but if want to 
> apply Ada.Characters.Handling.To_Lower on the string before feeding the 
> database postgres is not happy 
> 'ERROR:  invalid byte sequence for encoding "UTF8": 0xe2 0xa0 0x37'
> it's not really a big deal, but I would like to understand where the 
> problem is. Do I have to use wide string ?

Because functions in Ada.Characters.Handling take not UTF-8 but Latin-1.
You have to
1. convert UTF-8 String to Wide_Wide_String, process UTF-32 and restore it to UTF-8.
  (Ada.Characters.Conversion also take Latin-1. You have to use GNAT.Encode_String/Decode_String or Ada.Strings.UTF_Encoding for converting.)
2. search a external library to process UTF-8 directly.



^ permalink raw reply	[relevance 9%]

* Re: Convert wide_string to string (as the same byte array)
  2012-03-06 15:54  0%   ` Adam Beneschan
@ 2012-03-07  1:04  0%     ` Randy Brukardt
  0 siblings, 0 replies; 44+ results
From: Randy Brukardt @ 2012-03-07  1:04 UTC (permalink / raw)

"Adam Beneschan" <adam@irvine.com> wrote in message 
news:5368448.8.1331049289886.JavaMail.geo-discussion-forums@pbbpr1...
> On Monday, March 5, 2012 5:58:48 PM UTC-8, Randy Brukardt wrote:
>>
>> An alternative to Adam's solution would be to use the Ada2012 encoding
>> functions (A.4.11), specifically Ada.Strings.UTF_Encoding.Wide_Strings, 
>> and
>> use a UTF-8 encoding. That would be shorter, but not fixed length, so
>> whether that would work for you depends on the API you are feeding these
>> into.
>
> This may seem like a dumb question, but does that preserve order?

My understanding was that UTF-8 was designed so that ordinary byte 
comparison operations would work "properly" on UTF-8 strings (presuming no 
"overlong encodings" are used; there is no point in such things, it's like 
including NOPs in your generated instructions). That's surely true if only 
equality is involved; I believe it is also true for ordering, but as I've 
never tried it I don't want to say for absolutely certain.

                                            Randy.

^ permalink raw reply	[relevance 0%]

* Re: Convert wide_string to string (as the same byte array)
  2012-03-06  1:58  9% ` Randy Brukardt
@ 2012-03-06 15:54  0%   ` Adam Beneschan
  2012-03-07  1:04  0%     ` Randy Brukardt
  0 siblings, 1 reply; 44+ results
From: Adam Beneschan @ 2012-03-06 15:54 UTC (permalink / raw)


On Monday, March 5, 2012 5:58:48 PM UTC-8, Randy Brukardt wrote:
> 
> An alternative to Adam's solution would be to use the Ada2012 encoding 
> functions (A.4.11), specifically Ada.Strings.UTF_Encoding.Wide_Strings, and 
> use a UTF-8 encoding. That would be shorter, but not fixed length, so 
> whether that would work for you depends on the API you are feeding these 
> into.

This may seem like a dumb question, but does that preserve order?

                        -- Adam



^ permalink raw reply	[relevance 0%]

* Re: Convert wide_string to string (as the same byte array)
  @ 2012-03-06  1:58  9% ` Randy Brukardt
  2012-03-06 15:54  0%   ` Adam Beneschan
  0 siblings, 1 reply; 44+ results
From: Randy Brukardt @ 2012-03-06  1:58 UTC (permalink / raw)


"Erich" <john@peppermind.com> wrote in message 
news:f88cc8ca-183a-40c7-a01c-2adc1137d845@b18g2000vbz.googlegroups.com...
>A newbie question: I need to convert a wide_string to a (platform/
> endian independent) string that represents all the bytes of the
> wide_string. How do you do that?

An alternative to Adam's solution would be to use the Ada2012 encoding 
functions (A.4.11), specifically Ada.Strings.UTF_Encoding.Wide_Strings, and 
use a UTF-8 encoding. That would be shorter, but not fixed length, so 
whether that would work for you depends on the API you are feeding these 
into.

                                           Randy.





^ permalink raw reply	[relevance 9%]

* Re: Why no Ada.Wide_Directories?
  2011-10-18  1:10  8%       ` Adam Beneschan
@ 2011-10-18  2:32  0%         ` ytomino
  0 siblings, 0 replies; 44+ results
From: ytomino @ 2011-10-18  2:32 UTC (permalink / raw)


On Oct 18, 10:10 am, Adam Beneschan <a...@irvine.com> wrote:
> On Oct 17, 4:47 pm, ytomino <aghi...@gmail.com> wrote:
>
>
>
>
>
>
>
>
>
> > On Oct 18, 6:33 am, "Randy Brukardt" <ra...@rrsoftware.com> wrote:
>
> > > Say what?
>
> > > Ada.Strings.Encoding (new in Ada 2012) uses a subtype of String to store
> > > UTF-8 encoded strings. As such, I'd find it pretty surprising if doing so
> > > was "a violation of the standard".
>
> > > The intent has always been that Open, Ada.Directories, etc. take UTF-8
> > > strings as an option. Presumably the implementation would use a Form to
> > > specify that the file names in UTF-8 form rather than Latin-1. (I wasn't
> > > able to find a reference for this in a quick search, but I know it has been
> > > talked about on several occasions.)
>
> > > One of the primary reasons that Ada.Strings.Encoding uses a subtype of
> > > String rather than a separate type is so that it can be passed to Open and
> > > the like.
>
> > > It's probably true that we should standardize on the Form needed to use
> > > UTF-8 strings in these contexts, or at least come up with Implementation
> > > Advice on that point.
>
> > >                                        Randy.
>
> > Good news. Thanks for letting know.
> > My worry is decreased a little.
>
> > However, even if that is right, Form parameters are missing for many
> > subprograms.
> > Probably, All subprograms in Ada.Directories,
> > Ada.Directories.Hierarchical_File_Names, Ada.Command_Line,
> > Ada.Environment_Variables and other subprograms having Name parameter
> > or returning a file name should have Form parameter.
> > (For example, I do Open (X, Form => "UTF-8"). Which does Name (X)
> > returns UTF-8 or Latin-1?)
>
> > Moreover, in the future, we will always use I/O subprograms as UTF-8
> > mode if what you say is realized.
> > But other libraries in the standard are explicitly defined as Latin-1.
> > It's certain that Ada.Character.Handling.To_Upper breaks UTF-8.
>
> I have a feeling you're fundamentally confused about what UTF-8 is, as
> compared to "Latin-1".  Latin-1 is a character mapping.  It defines,
> for all integers in the range 0..255, what character that integer
> represents (e.g. 77 represents 'M', etc.).  Unicode is a character
> mapping that defines characters for a much larger integer range.  For
> integers in the range 0..255, the character represented in Unicode is
> the same as that in Latin-1; higher integers represent characters in
> other alphabets, other symbols, etc.  Those mappings just tell you
> what symbols go with what numbers, and they don't say anything about
> how the numbers are supposed to be stored.
>
> UTF-8 is an encoding (representation).  It defines, for each non-
> negative integer up to a certain point, what bits are used to
> represent that integer.  The number of bits is not fixed.  So even if
> you're working with characters all in the 0..255 range, some of those
> characters will be represented in 8 bits (one byte) and some will take
> 16 bits (two bytes).
>
> Because of this, it is not feasible to work with strings or characters
> in UTF-8 encoding.  Suppose you declare a string
>
>    S : String (1 .. 100);
>
> but you want it to be a UTF-8 string.  How would that work?  If you
> want to look at S(50), the computer would have to start at the
> beginning of the string and figure out whether each character is
> represented as 1 or 2 bytes.  Nobody wants that.
>
> The only sane way to work with strings in memory is to use a format
> where every character is the same size (String if all your characters
> are in the 0..255 range, Wide_String for 0..65535, Wide_Wide_String
> for 0..2**32-1).  Then, if you have a string of bytes in UTF-8 format,
> you convert it to a regular (Wide_)(Wide_)String with routines in
> Ada.Strings.UTF_Encoding; and it also has routines for converting
> regular strings to UTF-8 format.  But you don't want to *keep* strings
> in memory and work with them in UTF-8 format.  That's why it doesn't
> make sense to have string routines (like
> Ada.Strings.Equal_Case_Insensitive or Ada.Character_Handling.To_Upper)
> that work with UTF-8.
>
> Hope this solves your problem.
>
>                              -- Adam

I'm not confused. Your misreading.

Of course, if applications always hold file names as Wide_Wide_String,
and encode to UTF-8 only/every calling I/O subprograms as what you
say, so it's very simple and it is perhaps intended method. I
understand it.

But, where do these file names come from?
These are usually told by command-line or configuration file (written
by user).
It is probably encoded UTF-8 if the locale setting of OS is UTF-8.
So Form parameters of subprograms in Ada.Command_Line are necessary
and it's natural keeping UTF-8.

(Some file systems like Linux accept broken code as correct file name.
Applications must not (can not?) decode/encode file names in this
case.
Broken file name may be right file name if user sets LANG variable.
Same thing is in NTFS/NFS+. These file systems can accept broken
UTF-16. Strictly speaking, always, an application should not encode/
decode file names. But, Ada decides file names are stored into String
(as long as Randy says). So we have to give up about UTF-16 file
systems.)

And, it's popular that text processing functions keep encoded strings
in many other libraries or languages. I do not necessarily want to
deny the way of Ada, but I feel your opinion is prejudiced. It is not
so difficult as you say in fact.



^ permalink raw reply	[relevance 0%]

* Re: Why no Ada.Wide_Directories?
  @ 2011-10-18  1:10  8%       ` Adam Beneschan
  2011-10-18  2:32  0%         ` ytomino
  0 siblings, 1 reply; 44+ results
From: Adam Beneschan @ 2011-10-18  1:10 UTC (permalink / raw)

On Oct 17, 4:47 pm, ytomino <aghi...@gmail.com> wrote:
> On Oct 18, 6:33 am, "Randy Brukardt" <ra...@rrsoftware.com> wrote:
>
>
>
>
>
>
>
> > Say what?
>
> > Ada.Strings.Encoding (new in Ada 2012) uses a subtype of String to store
> > UTF-8 encoded strings. As such, I'd find it pretty surprising if doing so
> > was "a violation of the standard".
>
> > The intent has always been that Open, Ada.Directories, etc. take UTF-8
> > strings as an option. Presumably the implementation would use a Form to
> > specify that the file names in UTF-8 form rather than Latin-1. (I wasn't
> > able to find a reference for this in a quick search, but I know it has been
> > talked about on several occasions.)
>
> > One of the primary reasons that Ada.Strings.Encoding uses a subtype of
> > String rather than a separate type is so that it can be passed to Open and
> > the like.
>
> > It's probably true that we should standardize on the Form needed to use
> > UTF-8 strings in these contexts, or at least come up with Implementation
> > Advice on that point.
>
> >                                        Randy.
>
> Good news. Thanks for letting know.
> My worry is decreased a little.
>
> However, even if that is right, Form parameters are missing for many
> subprograms.
> Probably, All subprograms in Ada.Directories,
> Ada.Directories.Hierarchical_File_Names, Ada.Command_Line,
> Ada.Environment_Variables and other subprograms having Name parameter
> or returning a file name should have Form parameter.
> (For example, I do Open (X, Form => "UTF-8"). Which does Name (X)
> returns UTF-8 or Latin-1?)
>
> Moreover, in the future, we will always use I/O subprograms as UTF-8
> mode if what you say is realized.
> But other libraries in the standard are explicitly defined as Latin-1.
> It's certain that Ada.Character.Handling.To_Upper breaks UTF-8.

I have a feeling you're fundamentally confused about what UTF-8 is, as
compared to "Latin-1".  Latin-1 is a character mapping.  It defines,
for all integers in the range 0..255, what character that integer
represents (e.g. 77 represents 'M', etc.).  Unicode is a character
mapping that defines characters for a much larger integer range.  For
integers in the range 0..255, the character represented in Unicode is
the same as that in Latin-1; higher integers represent characters in
other alphabets, other symbols, etc.  Those mappings just tell you
what symbols go with what numbers, and they don't say anything about
how the numbers are supposed to be stored.

UTF-8 is an encoding (representation).  It defines, for each non-
negative integer up to a certain point, what bits are used to
represent that integer.  The number of bits is not fixed.  So even if
you're working with characters all in the 0..255 range, some of those
characters will be represented in 8 bits (one byte) and some will take
16 bits (two bytes).

Because of this, it is not feasible to work with strings or characters
in UTF-8 encoding.  Suppose you declare a string

   S : String (1 .. 100);

but you want it to be a UTF-8 string.  How would that work?  If you
want to look at S(50), the computer would have to start at the
beginning of the string and figure out whether each character is
represented as 1 or 2 bytes.  Nobody wants that.

The only sane way to work with strings in memory is to use a format
where every character is the same size (String if all your characters
are in the 0..255 range, Wide_String for 0..65535, Wide_Wide_String
for 0..2**32-1).  Then, if you have a string of bytes in UTF-8 format,
you convert it to a regular (Wide_)(Wide_)String with routines in
Ada.Strings.UTF_Encoding; and it also has routines for converting
regular strings to UTF-8 format.  But you don't want to *keep* strings
in memory and work with them in UTF-8 format.  That's why it doesn't
make sense to have string routines (like
Ada.Strings.Equal_Case_Insensitive or Ada.Character_Handling.To_Upper)
that work with UTF-8.

Hope this solves your problem.

                             -- Adam

^ permalink raw reply	[relevance 8%]

* Re: Why no Ada.Wide_Directories?
  2011-10-15  1:06  8% ` ytomino
  2011-10-15  8:38  0%   ` Dmitry A. Kazakov
@ 2011-10-17 21:33  0%   ` Randy Brukardt
    1 sibling, 1 reply; 44+ results
From: Randy Brukardt @ 2011-10-17 21:33 UTC (permalink / raw)


"ytomino" <aghia05@gmail.com> wrote in message 
news:418b8140-fafb-442f-b91c-e22cc47f8adb@y22g2000pri.googlegroups.com...
> Hello.
> In RM 3.5.2, Ada's Character/String types are not UTF-8 but Latin-1
> (except Ada.Strings.UTF_Encoding).
> I'm afraid that is violation of the standard even if the
> implementation accepts UTF-8.

Say what?

Ada.Strings.Encoding (new in Ada 2012) uses a subtype of String to store 
UTF-8 encoded strings. As such, I'd find it pretty surprising if doing so 
was "a violation of the standard".

The intent has always been that Open, Ada.Directories, etc. take UTF-8 
strings as an option. Presumably the implementation would use a Form to 
specify that the file names in UTF-8 form rather than Latin-1. (I wasn't 
able to find a reference for this in a quick search, but I know it has been 
talked about on several occasions.)

One of the primary reasons that Ada.Strings.Encoding uses a subtype of 
String rather than a separate type is so that it can be passed to Open and 
the like.

It's probably true that we should standardize on the Form needed to use 
UTF-8 strings in these contexts, or at least come up with Implementation 
Advice on that point.

                                       Randy.





^ permalink raw reply	[relevance 0%]

* Re: Why no Ada.Wide_Directories?
  2011-10-15  1:06  8% ` ytomino
@ 2011-10-15  8:38  0%   ` Dmitry A. Kazakov
  2011-10-17 21:33  0%   ` Randy Brukardt
  1 sibling, 0 replies; 44+ results
From: Dmitry A. Kazakov @ 2011-10-15  8:38 UTC (permalink / raw)

On Fri, 14 Oct 2011 18:06:05 -0700 (PDT), ytomino wrote:

> In RM 3.5.2, Ada's Character/String types are not UTF-8 but Latin-1
> (except Ada.Strings.UTF_Encoding).
> I'm afraid that is violation of the standard even if the
> implementation accepts UTF-8.

The same applies to Wide_String, which is UCS-2 not UTF-16. Implementations
pretending otherwise are wrong. For that matter Windows xW calls are
UTF-16. Passing Wide_String there is wrong.

> Of course, I think that the standard is impractical, too.

There are two problems with the standard:

1. It does not define strings and characters in terms of a code point type
to be consistent with Unicode;

2. It does not provide automatic conversions between character/string
types, because of the problem #1, and because the Ada type system is too
weak for that.

Clearly file operations, directory operations, character maps should be
defined using code points rather than characters. There should be only one
instance of each operation/package independent on the encoding and the
combinations of encodings.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[relevance 0%]

* Re: Why no Ada.Wide_Directories?
  @ 2011-10-15  1:06  8% ` ytomino
  2011-10-15  8:38  0%   ` Dmitry A. Kazakov
  2011-10-17 21:33  0%   ` Randy Brukardt
  0 siblings, 2 replies; 44+ results
From: ytomino @ 2011-10-15  1:06 UTC (permalink / raw)


Hello.
In RM 3.5.2, Ada's Character/String types are not UTF-8 but Latin-1
(except Ada.Strings.UTF_Encoding).
I'm afraid that is violation of the standard even if the
implementation accepts UTF-8.

Of course, I think that the standard is impractical, too.
If we must keep the standard, there is no way to access a file (and
other environment features) named with non-ASCII, at all.
I'm unlikely to bear... But that's another problem.

I do not know why the standard does not have Wide_Directories and
Text_IO.Wide_Open and Wide_Command_Line and Wide_Environment_Variables
and...,
Still, too, I hope these (or the standard allows that Character/String
represent UTF-8).



^ permalink raw reply	[relevance 8%]

Results 1-44 of 44 | reverse | options above

-- pct% links below jump to the message on this page, permalinks otherwise --
2011-10-14  6:58     Why no Ada.Wide_Directories? Michael Rohan
2011-10-15  1:06  8% ` ytomino
2011-10-15  8:38  0%   ` Dmitry A. Kazakov
2011-10-17 21:33  0%   ` Randy Brukardt
2011-10-17 23:47         ` ytomino
2011-10-18  1:10  8%       ` Adam Beneschan
2011-10-18  2:32  0%         ` ytomino
2012-02-24 22:01     Convert wide_string to string (as the same byte array) Erich
2012-03-06  1:58  9% ` Randy Brukardt
2012-03-06 15:54  0%   ` Adam Beneschan
2012-03-07  1:04  0%     ` Randy Brukardt
2013-03-07 11:12     string and wide string usage Ali Bendriss
2013-03-07 14:20  9% ` ytomino
2013-03-07 17:14  0%   ` Dmitry A. Kazakov
2015-03-19  0:50     gtkada: CAIRO_STATUS_INVALID_STRING hreba
2015-03-19  9:08  9% ` J-P. Rosen
2015-08-27 13:52     Exclusive file access ahlan
2015-08-28 17:40     ` ahlan
2015-08-29  7:05       ` Dmitry A. Kazakov
2015-08-29  8:31         ` Pascal Obry
2015-08-29 12:02           ` Dmitry A. Kazakov
2015-08-30 11:35             ` Florian Weimer
2015-08-30 12:44               ` Dmitry A. Kazakov
2015-08-31 23:34  9%             ` Randy Brukardt
2015-09-01  7:33  0%               ` Dmitry A. Kazakov
2016-08-11 14:39     A few questions on parsing, sockets, UTF-8 strings john
2016-08-11 16:23     ` Dmitry A. Kazakov
2016-08-11 17:40       ` john
2016-08-11 17:49         ` Dmitry A. Kazakov
2016-08-11 18:22           ` john
2016-08-11 19:09  9%         ` gautier_niouzes
2016-10-17 20:18     Bug in Ada - Latin 1 is not a subset of UTF-8 Lucretia
2016-10-17 20:57 10% ` Jacob Sparre Andersen
2016-10-18  5:44  0%   ` J-P. Rosen
2017-08-29 20:28     win32 interfacing check (SetClipboardData) Xavier Petit
2017-08-30 16:04     ` Dmitry A. Kazakov
2017-08-30 18:41       ` Xavier Petit
2017-08-30 21:17         ` Dmitry A. Kazakov
2017-09-01 12:51           ` Xavier Petit
2017-09-01 13:10             ` Dmitry A. Kazakov
2017-09-02  9:38  9%           ` Xavier Petit
2017-12-27 18:08     unicode and wide_text_io Mehdi Saada
2017-12-28 13:15     ` Mehdi Saada
2017-12-28 14:25       ` Dmitry A. Kazakov
2017-12-28 14:32         ` Simon Wright
2017-12-28 15:28           ` Niklas Holsti
2017-12-28 15:47             ` 00120260b
2017-12-28 22:35 11%           ` G.B.
2018-06-30 10:48     Strange crash on custom iterator Lucretia
2018-06-30 11:32     ` Simon Wright
2018-06-30 12:02       ` Lucretia
2018-06-30 14:25         ` Simon Wright
2018-06-30 14:33           ` Lucretia
2018-06-30 19:25             ` Simon Wright
2018-06-30 19:36               ` Luke A. Guest
2018-07-01 18:06                 ` Jacob Sparre Andersen
2018-07-01 19:59                   ` Simon Wright
2018-07-02 17:43                     ` Luke A. Guest
2018-07-02 19:42  7%                   ` Simon Wright
2018-07-03 14:08                         ` Lucretia
2018-07-03 14:17  9%                       ` J-P. Rosen
2018-07-03 15:06  0%                         ` Lucretia
2018-10-31  2:57     windows-1251 to utf-8 eduardsapotski
2018-10-31 15:28     ` eduardsapotski
2018-10-31 17:01       ` Dmitry A. Kazakov
2018-10-31 20:58  9%     ` Randy Brukardt
2019-03-01 13:07     Chess game in character over MS Windows manueledensenster
2019-03-01 13:17  9% ` manueledensenster
2019-06-15 23:59  7% Latest suggestion for 202x Micah Waddoups
2019-06-16  7:17  0% ` Dmitry A. Kazakov
2019-06-16 19:34  0% ` Optikos
2019-06-23 20:17  0% ` Per Sandberg
2019-11-22 13:09     I need to show extended Ascii codes in GtkAda environment L Dries
2019-11-22 14:12     ` Dmitry A. Kazakov
2019-11-22 21:22 13%   ` Randy Brukardt
2019-11-22 21:36 10%     ` Dmitry A. Kazakov
2020-07-23 15:48     Unable to use "Find all references" with GPS CE 2020 !? Jérôme Haguet
2020-07-23 17:33     ` gautier_niouzes
2020-08-26 14:47       ` Jérôme Haguet
2020-08-28 10:35  8%     ` Jérôme Haguet
2021-06-18 11:02  0%       ` Jérôme Haguet
2021-04-17 22:03     Ada and Unicode DrPi
2021-04-19  8:29     ` Maxim Reznik
2021-04-19 11:15       ` Simon Wright
2022-04-03 19:20 12%     ` Thomas
2021-04-19  9:08  9% ` Stephen Leake
2021-04-19 11:56 11%   ` Luke A. Guest
2021-04-19 12:13  0%     ` Luke A. Guest
2021-04-19 15:48  0%       ` DrPi
2021-04-19 12:52  0%     ` Dmitry A. Kazakov
2021-04-19 13:00           ` Luke A. Guest
2021-04-19 13:24             ` J-P. Rosen
2022-04-03 18:04  8%           ` Thomas
2021-04-20 19:06             ` Randy Brukardt
2022-04-03 18:37 10%           ` Thomas
2021-04-19 16:14  0%   ` DrPi
2021-04-19 13:18     ` Vadim Godunko
2022-04-03 16:51  8%   ` Thomas
2023-04-04  0:02 14%     ` Thomas
2021-06-19 18:28     XMLAda & unicode symbols 196...@googlemail.com
2021-06-19 21:24     ` Simon Wright
2021-06-20 17:10       ` 196...@googlemail.com
2021-06-21 15:26  7%     ` Simon Wright
2022-03-01 20:47 10% [ANN] UXStrings package available (UXS_20220226) Blady
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox