Lexical Conundrum

comp.lang.ada
 help / color / mirror / Atom feed

* Lexical Conundrum
@ 1998-02-19  0:00 Nick Roberts
       [not found] ` <EotBMK.MnK@world.std.com>
  1998-02-23  0:00 ` Jean-Pierre Rosen
  0 siblings, 2 replies; 15+ messages in thread
From: Nick Roberts @ 1998-02-19  0:00 UTC (permalink / raw)



I have run the following program

   1  with Ada.Text_IO; use Ada.Text_IO;
   2  procedure Test_1 is
   3     subtype UC is Character range'A'..'Z';
   4  begin
   5     Put_Line("Start of Test_1");
   6     if 'a'='a' or'a'in UC then
   7        Put_Line("True branch executed");
   8     end if;
   9     Put_Line("End of Test_1");
  10  end;

through GNAT 3.10, and it compiles and runs fine.  I haven't tried, but I
imagine it would probably work on any Ada compiler, in all likelihood.

But, if you look closely at line 6, you will see the sequence

   or'a'in

in the middle of an expression.

Now, from chapter 2 of the RM, one might get the impression that this could
be parsed as five lexical elements (three identifiers and two apostrophes).
 Of course, if a compiler were to parse it that way, the result would be a
syntax error.

For comparison, I have put

   range'A'..'Z';

into line 3.  I don't think there is an ambiguity in parsing this, although
it is very close to the previous example.  Similarly, if the space were to
be removed from just before the example in line 6, the ambiguity would be
resolved there also.

One immediate conclusion, I think, is that the introduction of a one-letter
attribute (in an implementation of the language) could cause difficulties! 
Of course, it's hard to imagine a motive for such an attribute, in
practice.

Question: would a compiler be in contravention of the RM by rejecting the
above program (with a syntax error in line 6)?  I admit that such a
compiler may be considered to be poorly designed!


== Nick Roberts ================================================
== Croydon, UK                       ===========================
==                                              ================
== Proprietor, ThoughtWing Software                   ==========
== Independent Software Development Consultant            ======
== Nick.Roberts@dial.pipex.com                              ====
== Voicemail & Fax +44 181-405 1124                          ===
==                                                            ==
==           I live not in myself, but I become               ==
===          Portion of that around me; and to me             ==
====         High mountains are a feeling, but the hum        ==
=======      Of human cities torture.
===========                             -- Byron [Childe Harold]





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Lexical Conundrum
       [not found] ` <EotBMK.MnK@world.std.com>
@ 1998-02-22  0:00   ` Robert Dewar
  1998-02-23  0:00     ` Keith Thompson
                       ` (2 more replies)
  1998-02-23  0:00   ` Mark A Biggar
  1 sibling, 3 replies; 15+ messages in thread
From: Robert Dewar @ 1998-02-22  0:00 UTC (permalink / raw)



Robert Duff says

<<>Now, from chapter 2 of the RM, one might get the impression that this could
>be parsed as five lexical elements (three identifiers and two apostrophes).

I think 2.2(7) makes it clear that the above is three lexical elements,
not five.
>>

I am puzzled. 2.2(7) says

7   One or more separators are allowed between any two adjacent lexical                    
elements, before the first of each compilation, or after the last.  At least               
one separator is required between an identifier, a reserved word, or a                     
numeric_literal and an adjacent identifier, reserved word, or numeric_                     
literal.                                                                                   


But neither the 3 token nor the 5 token interpretations have or need
separators, so how is this paragraph relevant?

Not that there is a problem anyway!!!!





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Lexical Conundrum
@ 1998-02-22  0:00 Nick Roberts
  1998-02-22  0:00 ` Robert Dewar
  0 siblings, 1 reply; 15+ messages in thread
From: Nick Roberts @ 1998-02-22  0:00 UTC (permalink / raw)



I have run the following program

   1  with Ada.Text_IO; use Ada.Text_IO;
   2  procedure Test_1 is
   3     subtype UC is Character range'A'..'Z';
   4  begin
   5     Put_Line("Start of Test_1");
   6     if 'a'='a' or'a'in UC then
   7        Put_Line("True branch executed");
   8     end if;
   9     Put_Line("End of Test_1");
  10  end;

through GNAT 3.10, and it compiles and runs fine.  I haven't tried, but I
imagine it would probably work on any Ada compiler, in all likelihood.
But, if you look closely at line 6, you will see the sequence

   or'a'in

in the middle of an expression.
Now, from chapter 2 of the RM, one might get the impression that this could
be parsed as five lexical elements (three identifiers and two apostrophes).
Of course, if a compiler were to parse it that way, the result would be a
syntax error.

For comparison, I have put

   range'A'..'Z';

into line 3.  I don't think there is an ambiguity in parsing this, although
it is very close to the previous example.  Similarly, if the space were to
be removed from just before the example in line 6, the ambiguity would be
resolved there also.

One immediate conclusion, I think, is that the introduction of a one-letter
attribute (in an implementation of the language) could cause difficulties!
Of course, it's hard to imagine a motive for such an attribute, in practice.

Question: would a compiler be in contravention of the RM by rejecting the
above program (with a syntax error in line 6)?  I admit that such a
compiler may be considered to be poorly designed!


== Nick Roberts ================================================
== Croydon, UK                       ===========================
==                                              ================
== Proprietor, ThoughtWing Software                   ==========
== Independent Software Development Consultant            ======
== Nick.Roberts@dial.pipex.com                              ====
== Voicemail & Fax +44 181-405 1124                          ===
==                                                            ==
==           I live not in myself, but I become               ==
===          Portion of that around me; and to me             ==
====         High mountains are a feeling, but the hum        ==
=======      Of human cities torture.
===========                             -- Byron [Childe Harold]







^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Lexical Conundrum
  1998-02-22  0:00 Nick Roberts
@ 1998-02-22  0:00 ` Robert Dewar
  1998-02-24  0:00   ` John Roberts-Jones
  0 siblings, 1 reply; 15+ messages in thread
From: Robert Dewar @ 1998-02-22  0:00 UTC (permalink / raw)



Nick Roberts said

  I have run the following program
  
     1  with Ada.Text_IO; use Ada.Text_IO;
     2  procedure Test_1 is
     3     subtype UC is Character range'A'..'Z';
     4  begin
     5     Put_Line("Start of Test_1");
     6     if 'a'='a' or'a'in UC then
     7        Put_Line("True branch executed");
     8     end if;
     9     Put_Line("End of Test_1");
    10  end;
  
  through GNAT 3.10, and it compiles and runs fine.  I haven't tried, but I
  imagine it would probably work on any Ada compiler, in all likelihood.
  But, if you look closely at line 6, you will see the sequence
  
     or'a'in
  
  in the middle of an expression.
  Now, from chapter 2 of the RM, one might get the impression that this could
  be parsed as five lexical elements (three identifiers and two apostrophes).
  Of course, if a compiler were to parse it that way, the result would be a
  syntax error.
  
  For comparison, I have put
  
     range'A'..'Z';
  
  into line 3.  I don't think there is an ambiguity in parsing this, although
  it is very close to the previous example.  Similarly, if the space were to
  be removed from just before the example in line 6, the ambiguity would be
  resolved there also.
  
  One immediate conclusion, I think, is that the introduction of a one-letter
  attribute (in an implementation of the language) could cause difficulties!
  Of course, it's hard to imagine a motive for such an attribute, in practice.
  
  Question: would a compiler be in contravention of the RM by rejecting the
  above program (with a syntax error in line 6)?  I admit that such a
  compiler may be considered to be poorly designed!

You entirely misunderstand the RM. It is not a recipe for constructing
a compiler, it is a set of rules about what programs are valid and what
they mean. The rules make it clear that

  or'a'in ..

is valid, where three tokens are involved

  or  'a'   in

so this is valid. End of story. If the compiler parses this as five tokens
  or  ' a ' in

it will get confused, and will be wrong! End of story.

Furthermore, introducing a one character attribute does not in anyway change
things. An ambiguity at the language level would be a problem in the language
design.

An apparent ambiguity at the lexical level is merely a problem for the
compiler writer. Not a very difficult one. It is in practice easy to tell
wheher a quote is the start of a character literal or an attribute character.

I recommend reading the GNAT sources to get more knowledge in this area. Here
is a quote from the comments of the Scn package:

  --  Here is where we make the test to distinguish the cases. Treat
  --  as apostrophe if previous token is an identifier, right paren
  --  or the reserved word "all" (latter case as in A.all'Address)
  --  Also treat it as apostrophe after a literal (wrong anyway, but
  --  that's probably the better choice).

The RM is completely silent on this issue, since there is no issue from a
language point of view!











^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Lexical Conundrum
  1998-02-23  0:00     ` Keith Thompson
@ 1998-02-23  0:00       ` Robert Dewar
  0 siblings, 0 replies; 15+ messages in thread
From: Robert Dewar @ 1998-02-23  0:00 UTC (permalink / raw)



Keith says

<<The C language definition resolves this kind of thing by mandating that
lexical analysis is "greedy", i.e., that the lexical analyzer always
forms the longest lexical token that it can, regardless of whether it
will cause problems later.  This is sometimes referred to as "maximal
munch".  (This makes the lexer somewhat easier to write -- and, far more
importantly, IMHO, makes the lexical legality rules easier to describe.)
>>

I don't see this (lexical legality rules easier to describe). There is
no problem in the lexical legality rules in Ada. There are rules for
each lexical token that are clear, and rules for how they are written
down in a sequence.

I suppose it might be nice if one can immediately tell how tokens are
broken up with no additional information, but in practice this is a
problem neither for the Ada programmer, nor for the Ada compiler writer,
since there are never any ambiguities.

C needs the greedy rule because indeed in its absence the C language
would be ambiguous. No such ambiguities exist in Ada, so there is no
need for a disambiguating rule.





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Lexical Conundrum
       [not found]     ` <Eou91J.Es9@world.std.com>
@ 1998-02-23  0:00       ` Robert Dewar
  0 siblings, 0 replies; 15+ messages in thread
From: Robert Dewar @ 1998-02-23  0:00 UTC (permalink / raw)



Bob Duff said

<<I was just pointing out that the RM doesn't require a separator between
a reserved word and a character_lit, not between a reserved word and a
tick mark.  This is different from the rule in some other languages, as
kst points out in another message.
>>

I assume not in the second line here should be nor ...

True, but this does not *resolve* the issue that the original post raised,
it *causes* this issue to appear -- a bit of a difference!!!!





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Lexical Conundrum
       [not found] ` <EotBMK.MnK@world.std.com>
  1998-02-22  0:00   ` Robert Dewar
@ 1998-02-23  0:00   ` Mark A Biggar
  1998-02-24  0:00     ` Mats Weber
  1 sibling, 1 reply; 15+ messages in thread
From: Mark A Biggar @ 1998-02-23  0:00 UTC (permalink / raw)

Robert A Duff wrote:
> 
> In article <01bd3d80$101287c0$LocalHost@xhv46.dial.pipex.com>,
> Nick Roberts <Nick.Roberts@dial.pipex.com> wrote:
> >But, if you look closely at line 6, you will see the sequence
> >
> >   or'a'in
> >
> >in the middle of an expression.
> >
> >Now, from chapter 2 of the RM, one might get the impression that this could
> >be parsed as five lexical elements (three identifiers and two apostrophes).
> 
> I think 2.2(7) makes it clear that the above is three lexical elements,
> not five.

Not really, as section 2.2(7) only talks about cases that require white space
between tokens to be legal.  The real point in this example is the "or" is
a keyword not an identifier, so the following "'" must be the start of
a character literal, not the start of an attribute.  If you look at the 
LM grammar, you will find that there are NO cases where an attribute follows
a key word and also NO cases where a character literal follows an identifier,
so you can always determine the meaning of "'" just by remembering the
classification of the previous token.  This whole problem is one of the
reasons why the Ada95 LM makes such a big deal of the fact
that keywords tokens are NOT a subset of identifier tokens, which was a
problem with the Ada83 LM.  Note that a lexical analyser can fully determine
this on its own with out help from the associated parser.  So any compiler
that doesn't tokenize the above as the 3 tokens "or", "'a'" and "in" is
in violation of the LM.

--
Mark Biggar
mark.a.biggar@lmco.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Lexical Conundrum
  1998-02-22  0:00   ` Robert Dewar
@ 1998-02-23  0:00     ` Keith Thompson
  1998-02-23  0:00       ` Robert Dewar
       [not found]     ` <Eou91J.Es9@world.std.com>
  1998-02-26  0:00     ` Dr Steve Sangwine
  2 siblings, 1 reply; 15+ messages in thread
From: Keith Thompson @ 1998-02-23  0:00 UTC (permalink / raw)



The C language definition resolves this kind of thing by mandating that
lexical analysis is "greedy", i.e., that the lexical analyzer always
forms the longest lexical token that it can, regardless of whether it
will cause problems later.  This is sometimes referred to as "maximal
munch".  (This makes the lexer somewhat easier to write -- and, far more
importantly, IMHO, makes the lexical legality rules easier to describe.)

A common C puzzle is to parse the expression x+++++y.  If it could be
tokenized as x ++ + ++ y, it would be legal, but "the maximal munch"
rule requires it to be tokenized as x ++ ++ + y, which is an error when
it's processed by later compilation phases.  To stray even further from
relevancy, this can be a legal expression in C++ if the "++" operator
is overloaded (I think).

Getting back to the topic of this newsgroup, I *think* the only constructs
in the Ada grammar to which this doesn't apply are Type_Name'('z')
and Prefix'A'B, where A is a (hypothetical) one-letter attribute name.
I suspect this is because the Ada grammar was thought out more carefully
than the C grammar when the language was being designed, avoiding most
such problems in the first place.  In other words, although there are no
actual ambiguities in the grammar of either language, Ada's grammar has
fewer constructs that look like ambiguities at first glance.  The if-else
problem is another example of this.

-- 
Keith Thompson (The_Other_Keith) kst@cts.com <*>
^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H
San Diego, California, USA
Trying to keep my daily caffeine intake between the RDA and the LD50.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Lexical Conundrum
  1998-02-19  0:00 Lexical Conundrum Nick Roberts
       [not found] ` <EotBMK.MnK@world.std.com>
@ 1998-02-23  0:00 ` Jean-Pierre Rosen
  1998-02-23  0:00   ` Robert Dewar
  1 sibling, 1 reply; 15+ messages in thread
From: Jean-Pierre Rosen @ 1998-02-23  0:00 UTC (permalink / raw)



Nick Roberts a ï¿½crit dans le message
> [...]
>But, if you look closely at line 6, you will see the sequence
>
>   or'a'in
>
>in the middle of an expression.
>
>Now, from chapter 2 of the RM, one might get the impression that this could
>be parsed as five lexical elements (three identifiers and two apostrophes).
> Of course, if a compiler were to parse it that way, the result would be a
>syntax error.
> [..]
>One immediate conclusion, I think, is that the introduction of a one-letter
>attribute (in an implementation of the language) could cause difficulties!
>Of course, it's hard to imagine a motive for such an attribute, in
>practice.
>
>Question: would a compiler be in contravention of the RM by rejecting the
>above program (with a syntax error in line 6)?  I admit that such a
>compiler may be considered to be poorly designed!
>
This a known (and presumably only) small difficulty in parsing Ada. And yes,
compilers may take advantage that there is no one letter attribute to make
it easier... although it's not really hard to solve. For example, have a
look at the very simplistic parser used in "normalize" (a utility available
from Adalog's Web site at http://perso.wanadoo.fr/adalog).

BTW, you can make things more difficult. Consider:
if Boolean'('a'='b') then ...






^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Lexical Conundrum
  1998-02-23  0:00 ` Jean-Pierre Rosen
@ 1998-02-23  0:00   ` Robert Dewar
  0 siblings, 0 replies; 15+ messages in thread
From: Robert Dewar @ 1998-02-23  0:00 UTC (permalink / raw)



JPR says

<<This a known (and presumably only) small difficulty in parsing Ada. And yes,
compilers may take advantage that there is no one letter attribute to make
it easier... although it's not really hard to solve. For example, have a
look at the very simplistic parser used in "normalize" (a utility available
from Adalog's Web site at http://perso.wanadoo.fr/adalog).
>>

It is a difficulty in lexical analysis I would say, rather than parsing.
If you absorb the lexical rules into the grammar accepted by the parser,
there are no ambiguities, and no need for look ahead.

The only "difficulty" arises in doing separate lexical analysis, and as
I showed from the GNAT code, this difficulty is so trivial to resolve
that I see no reason to even describe it as a difficulty, hence my
quotes.

As for one character attributes, not so fast, when you have wide characters,
the test for whether the thing following a quote is or is not a valid
character literal is not that simple, and indeed that test would probably
be more work than the simple test done in the GNAT lexical analyzer.

There are no one character attributes in GNAT currently, but if we ever
found a case where a one character name was the right choice, we could
add this without any difficulties or special processing being required.





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Lexical Conundrum
  1998-02-23  0:00   ` Mark A Biggar
@ 1998-02-24  0:00     ` Mats Weber
  1998-02-24  0:00       ` Robert Dewar
  0 siblings, 1 reply; 15+ messages in thread
From: Mats Weber @ 1998-02-24  0:00 UTC (permalink / raw)



Mark A Biggar wrote:

> Not really, as section 2.2(7) only talks about cases that require white space
> between tokens to be legal.  The real point in this example is the "or" is
> a keyword not an identifier, so the following "'" must be the start of
> a character literal, not the start of an attribute.  If you look at the
> LM grammar, you will find that there are NO cases where an attribute follows
> a key word and also NO cases where a character literal follows an identifier,

Not true: e.g. X.all'Address is legal.

> so you can always determine the meaning of "'" just by remembering the
> classification of the previous token.

True, but you have to treat all as a special case.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Lexical Conundrum
  1998-02-24  0:00     ` Mats Weber
@ 1998-02-24  0:00       ` Robert Dewar
  1998-03-05  0:00         ` Robert I. Eachus
  0 siblings, 1 reply; 15+ messages in thread
From: Robert Dewar @ 1998-02-24  0:00 UTC (permalink / raw)



Mats said

<<True, but you have to treat all as a special case.
>>

Here once again is the correct check:

            --  Here is where we make the test to distinguish the cases. Treat
            --  as apostrophe if previous token is an identifier, right paren
            --  or the reserved word "all" (latter case as in A.all'Address)
            --  Also treat it as apostrophe after a literal (wrong anyway, but
            --  that's probably the better choice).

            if Prev_Token = Tok_Identifier
               or else Prev_Token = Tok_Right_Paren
               or else Prev_Token = Tok_All
               or else Prev_Token in Token_Class_Literal
            then
               Token := Tok_Apostrophe;
               return;





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Lexical Conundrum
  1998-02-22  0:00 ` Robert Dewar
@ 1998-02-24  0:00   ` John Roberts-Jones
  0 siblings, 0 replies; 15+ messages in thread
From: John Roberts-Jones @ 1998-02-24  0:00 UTC (permalink / raw)




Robert Dewar wrote in message ...
>I recommend reading the GNAT sources to get more knowledge in this area.
>Here is a quote from the comments of the Scn package:
>
>  --  Here is where we make the test to distinguish the cases. Treat
>  --  as apostrophe if previous token is an identifier, right paren
>  --  or the reserved word "all" (latter case as in A.all'Address)
>  --  Also treat it as apostrophe after a literal (wrong anyway, but
>  --  that's probably the better choice).
>
Does this cover the case that the prefix of an attribute is a subprogram
overloading a standard operator such as
      rational."+" ' access ?

It would, on a similar subject to the original post, be interesting to hear
an authoritative view as to any probems associated with disambiguating a
character string such as "+".










^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Lexical Conundrum
  1998-02-22  0:00   ` Robert Dewar
  1998-02-23  0:00     ` Keith Thompson
       [not found]     ` <Eou91J.Es9@world.std.com>
@ 1998-02-26  0:00     ` Dr Steve Sangwine
  2 siblings, 0 replies; 15+ messages in thread
From: Dr Steve Sangwine @ 1998-02-26  0:00 UTC (permalink / raw)



Robert Dewar wrote:

>I am puzzled. 2.2(7) says

>7   One or more separators are allowed between any two adjacent lexical
>elements, before the first of each compilation, or after the last.  At least
>one separator is required between an identifier, a reserved word, or a
>numeric_literal and an adjacent identifier, reserved word, or numeric_
>literal.

Robert Duff's point is that the separators are allowed but not
required. 2.2(7) explains where the separators are needed such as

      or X in

where X is an identifier and

     orXin

would be interpreted as something else. But it doesn't say
(explicitly) that you don't need separators for

     or 'A' in

As an Ada user since 1986 (not involved with compilers) it
had never occurred to me to leave the separators out. Thanks
for an interesting discussion.

-----------------
Dr S. J. Sangwine
Department of Engineering
The University of Reading
Whiteknights, Reading RG6 6AY, UK
Email: S.J.Sangwine@Reading.ac.uk
Tel: 0118 931 8584 (+44 118 931 8584 outside UK)
Fax: 0118 931 3327
Web: http://www.elec.rdg.ac.uk/~stssjs/sjs.html







^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Lexical Conundrum
  1998-02-24  0:00       ` Robert Dewar
@ 1998-03-05  0:00         ` Robert I. Eachus
  0 siblings, 0 replies; 15+ messages in thread
From: Robert I. Eachus @ 1998-03-05  0:00 UTC (permalink / raw)



In article <dewar.888331888@merv> dewar@merv.cs.nyu.edu (Robert Dewar) writes:

   Mats said

   <<True, but you have to treat all as a special case.
   >>


   Here once again is the correct check...

   Hmmm.  How does GNAT treat the attributes 'ACCESS, 'DELTA, and 'RANGE?
(Attributes whose identifiers match reserved words.)  If you fold the
reserved word into the tic to create an attribute name, no problem.
And as long as GNAT never provides implementation defined attributes
of values or ranges, again no problem. Both Ada 83 and Ada 95 avoid
these potentialy nasty cases, but there were some present in earlier
versions of Ada.  (In particular, the 'RANGE attribute could be a
subtype mark.)

   The right solution is probably to avoid defining attributes of
values.  Just as the cure for the problem of single character attribute
names (again, in Ada 80) was to refrain from defining any.
--

					Robert I. Eachus

with Standard_Disclaimer;
use  Standard_Disclaimer;
function Message (Text: in Clever_Ideas) return Better_Ideas is...




^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~1998-03-05  0:00 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1998-02-19  0:00 Lexical Conundrum Nick Roberts
     [not found] ` <EotBMK.MnK@world.std.com>
1998-02-22  0:00   ` Robert Dewar
1998-02-23  0:00     ` Keith Thompson
1998-02-23  0:00       ` Robert Dewar
     [not found]     ` <Eou91J.Es9@world.std.com>
1998-02-23  0:00       ` Robert Dewar
1998-02-26  0:00     ` Dr Steve Sangwine
1998-02-23  0:00   ` Mark A Biggar
1998-02-24  0:00     ` Mats Weber
1998-02-24  0:00       ` Robert Dewar
1998-03-05  0:00         ` Robert I. Eachus
1998-02-23  0:00 ` Jean-Pierre Rosen
1998-02-23  0:00   ` Robert Dewar
  -- strict thread matches above, loose matches on Subject: below --
1998-02-22  0:00 Nick Roberts
1998-02-22  0:00 ` Robert Dewar
1998-02-24  0:00   ` John Roberts-Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox