comp.lang.ada
 help / color / mirror / Atom feed
* gnat_regpat and unexpected handling of alnum and unicode needed
@ 2019-02-17 11:24 19.krause.70
  2019-02-17 12:50 ` Simon Wright
  2019-02-17 13:09 ` 19.krause.70
  0 siblings, 2 replies; 5+ messages in thread
From: 19.krause.70 @ 2019-02-17 11:24 UTC (permalink / raw)


Hello All,

I was strugeling about a behavior in gnat-repat which is different from the behavior of egrep for example and different from the behavior I expect.

The expression [[:alnum:]] matches the underscore in gnat_regpat but not in egrep. It feels much more natural to don't match the underscore like egrep does. And I think it is more posix compliant.

Question is why?

Now I could simply use [[:alpha:][0-9]]+ instead but then I got to my second question: How do I handle unicode strings with gnat_regpat, because [[:alpha:]] seems to match only ascii a-zA-Z. Some sample code (Safe as utf-8 text, compiled with -gnatW8):

with Ada.Text_IO;
with Gnat.Regpat;

procedure gnat_regpat_test is
    test1 : constant String := "foo_bar";
    test2 : constant String := "fööbär";
    regexp1 : constant String := "^[[:alnum:]]+$";
    regexp2 : constant String := "^[[:alpha:][0-9]]+$";
begin
    if Gnat.Regpat.Match(Expression => regexp1, Data => test1) then
        Ada.Text_IO.Put_Line(test1 & " Matched regexp1 " & regexp1 & "!");
    else
        Ada.Text_IO.Put_Line(test1 & " doesn't Match regexp1 " & regexp1);
    end if;
    if Gnat.Regpat.Match(Expression => regexp2, Data => test1) then
        Ada.Text_IO.Put_Line(test1 & " Matched regexp2 " & regexp2 & "!");
    else
        Ada.Text_IO.Put_Line(test1 & " doesn't Match regexp2 " & regexp2);
    end if;
    if Gnat.Regpat.Match(Expression => regexp1, Data => test2) then
        Ada.Text_IO.Put_Line(test2 & " Matched regexp1 " & regexp1 & "!");
    else
        Ada.Text_IO.Put_Line(test2 & " doesn't Match regexp1 " & regexp1);
    end if;    
    if Gnat.Regpat.Match(Expression => regexp2, Data => test2) then
        Ada.Text_IO.Put_Line(test2 & " Matched regexp2 " & regexp2 & "!");
    else
        Ada.Text_IO.Put_Line(test2 & " doesn't Match regexp2 " & regexp2);
    end if;    
end gnat_regpat_test;

Best Regards,

Hubert

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: gnat_regpat and unexpected handling of alnum and unicode needed
  2019-02-17 11:24 gnat_regpat and unexpected handling of alnum and unicode needed 19.krause.70
@ 2019-02-17 12:50 ` Simon Wright
  2019-02-17 13:15   ` 19.krause.70
  2019-02-17 13:21   ` 19.krause.70
  2019-02-17 13:09 ` 19.krause.70
  1 sibling, 2 replies; 5+ messages in thread
From: Simon Wright @ 2019-02-17 12:50 UTC (permalink / raw)


19.krause.70@googlemail.com writes:

> The expression [[:alnum:]] matches the underscore in gnat_regpat but
> not in egrep. It feels much more natural to don't match the underscore
> like egrep does. And I think it is more posix compliant.
>
> Question is why?

Because, at s-regpat.adb:2325, we find

   function Is_Alnum (C : Character) return Boolean is
   begin
      return Is_Alphanumeric (C) or else C = '_';
   end Is_Alnum;

(Is_Alphanumeric is in Ada.Characters.Handling), presumably because the
author liked using underscores in identifiers.

> How do I handle unicode strings with gnat_regpat, because [[:alpha:]]
> seems to match only ascii a-zA-Z.

What GNAT does with -gnatW8 is to read UTF-8 from the source file and,
in the case of characters, convert then to the internal Latin-1
(approximately) character. So your 'ö' is converted to the single
character with value 246, LC_O_Diaeresis.

I tried just letters, and got

   fööbär Matched regexp3 ^[[:alpha:]]+$!

No idea what's going on here!


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: gnat_regpat and unexpected handling of alnum and unicode needed
  2019-02-17 11:24 gnat_regpat and unexpected handling of alnum and unicode needed 19.krause.70
  2019-02-17 12:50 ` Simon Wright
@ 2019-02-17 13:09 ` 19.krause.70
  1 sibling, 0 replies; 5+ messages in thread
From: 19.krause.70 @ 2019-02-17 13:09 UTC (permalink / raw)


Hello,

I've forgot to post the output of the script:

foo_bar Matched regexp1 ^[[:alnum:]]+$!
foo_bar doesn't Match regexp2 ^[[:alpha:][0-9]]+$
fööbär Matched regexp1 ^[[:alnum:]]+$!
fööbär doesn't Match regexp2 ^[[:alpha:][0-9]]+$

Best Regards,

Hubert

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: gnat_regpat and unexpected handling of alnum and unicode needed
  2019-02-17 12:50 ` Simon Wright
@ 2019-02-17 13:15   ` 19.krause.70
  2019-02-17 13:21   ` 19.krause.70
  1 sibling, 0 replies; 5+ messages in thread
From: 19.krause.70 @ 2019-02-17 13:15 UTC (permalink / raw)


Hello Simon,

So gnat-regpat is not unicode ready. Does anyone knows ada libs that could handle regexp with unicode? Or am I doing something wrong?

regards,

Hubert


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: gnat_regpat and unexpected handling of alnum and unicode needed
  2019-02-17 12:50 ` Simon Wright
  2019-02-17 13:15   ` 19.krause.70
@ 2019-02-17 13:21   ` 19.krause.70
  1 sibling, 0 replies; 5+ messages in thread
From: 19.krause.70 @ 2019-02-17 13:21 UTC (permalink / raw)


Am Sonntag, 17. Februar 2019 13:50:24 UTC+1 schrieb Simon Wright:
> I tried just letters, and got
> 
>    fööbär Matched regexp3 ^[[:alpha:]]+$!
> 
> No idea what's going on here!

Strange. But conclusion seems to be: gnat-regpat is not unicode aware (gnat v. 6). Does anyone knows about a unicode aware, posix extended regexp compatible library for ada?

regards,

Hubert


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-02-17 13:21 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-17 11:24 gnat_regpat and unexpected handling of alnum and unicode needed 19.krause.70
2019-02-17 12:50 ` Simon Wright
2019-02-17 13:15   ` 19.krause.70
2019-02-17 13:21   ` 19.krause.70
2019-02-17 13:09 ` 19.krause.70

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox