From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Thread: 103376,1f96acbbf1e7e66a
X-Google-Attributes: gid103376,public
X-Google-Language: ENGLISH,ASCII-7-bit
Path: 
 g2news2.google.com!news3.google.com!border1.nntp.dca.giganews.com!nntp.giganews.com!newscon06.news.prodigy.com!prodigy.net!newsfeed-00.mathworks.com!nntp.TheWorld.com!not-for-mail
From: Robert A Duff <bobduff@shell01.TheWorld.com>
Newsgroups: comp.lang.ada
Subject: Re: lexical ambiguity
Date: 08 Jun 2006 17:30:53 -0400
Organization: The World Public Access UNIX, Brookline, MA
Message-ID: <wccmzcntjw2.fsf@shell01.TheWorld.com>
References: <1nozvv83n7lhc.1b3qf0olmyllp$.dlg@40tude.net>
 <n-6dnQKIUdPzIB3Z4p2dnA@rcn.net> <w_8gg.760526$084.110855@attbi_s22>
 <z6ydnZgK5-jrhB7ZnZ2dneKdnZydnZ2d@rcn.net>
 <nULgg.1005626$xm3.320354@attbi_s21>
 <9M_gg.1598$O5.554@llslave.llan.ll.mit.edu> <lnirnfnyao.fsf@nuthaus.mib.org>
 <t_1hg.764258$084.649755@attbi_s22> <1149590366.8521.5.camel@localhost>
 <tz1wu2tsyp.fsf@hod.lan.m-e-leypold.de>
 <wccd5dl6mx0.fsf@shell01.TheWorld.com>
 <euirncvq7y.fsf@hod.lan.m-e-leypold.de>
NNTP-Posting-Host: shell01.theworld.com
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: pcls4.std.com 1149802255 5772 192.74.137.71 (8 Jun 2006 21:30:55 GMT)
X-Complaints-To: abuse@TheWorld.com
NNTP-Posting-Date: Thu, 8 Jun 2006 21:30:55 +0000 (UTC)
User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.2
Xref: g2news2.google.com comp.lang.ada:4715
Date: 2006-06-08T17:30:53-04:00
List-Id: <comp.lang.ada>

M E Leypold <development-2006-8ecbb5cc8a-REMOVETHIS@m-e-leypold.de> writes:

> Robert A Duff <bobduff@shell01.TheWorld.com> writes:
> 
> > M E Leypold <development-2006-8ecbb5cc8a-REMOVETHIS@m-e-leypold.de> writes:
> > 
> > > So now (question to all): Is the following rule enough?
> > > 
> > >    - "'" is the beginning of a character literal if the token before
> > >      "'" has not been an identifier (reserved words not counted as
> > >      identifier in this case).
> > 
> > Not quite:
> > 
> >     function F(X: Integer) return String;
> > 
> >     Length: constant Natural := F(123)'Length;
> 
> Ouch. 

It's not a BIG ouch.  To determine whether a single quote begins a
character literal versus a tick, it is sufficient to look back one
token.  Some tokens can be followed by a tick, some by a char_lit,
and some by neither.  None can be followed by both.  It's fairly
straightforward to study the grammar and determine which are which.
Or look at the GNAT sources.

It might be wise to include a sentinel token at the start of the token
stream (Begin_File_Token or whatever), just in case ' comes first
(that would be illegal, but you don't want to crash on it).

It can all be done in the lexer, with no feedback from the parser -- the
lexer just needs to keep track of the previous token, and check it when
it sees a single quote.  Lookahead will get you in trouble; look-back
is the better answer here.

> OK. First a message to Dmitry A. Kazakov and Georg Bauhaus: Sorry, I
> did neither understand all of what you said nor the exact
> implications. But Thanks!

I didn't entirely understand that, either.

> Than: The original poster asked a question about 'lexical
> ambiguity'. The ensuing diskussions leaves me more and more doubtful:
> Can lexical anlysis (grouping characters to tokens and grammatical
> analysis (building a parse tree from a token sequence) be separated
> cleanly in Ada?

Yes.  The look-back is localized to the lexer (which is not "clean", but
at least it's localized (separated from the parser)).

> My first approach would have been (no I'm not implementing an Ada
> parser, but since compiler construction has been a favorite subject of
> me for a number of years, I'm a bit curious about the position of Ada
> in all this) -- now: My first approach would have been, to write a
> lexer with a minimal amount of state. It would shift into
> collect-string state when encountering a '"' (I mean a double quote
> :-) and into especially into maybe-now-comes-a-character-literal state
> at certain points. My first take was that the "certain points" are
> always after identifiers. In view of the case quoted above
> (F(123)'Length) I could amend this rule by adding ')' to the certain
> points.

Right.  But you have to study the grammar to know which tokens have this
property.  It's not that big of a deal.

> But now things become rather ad-hoc. Well -- as I said, that it's just
> curiosity driving me, so I'm not going now to examine the RM not I'm
> going to reverse engineer GNAT to find out how it is done in reality.
> 
> But if anyone in c.l.a. has the answer to the following questions, I'd
> be eternally grateful. Well, grateful, anyway. :-)
> 
>   - Is it possible (for Ada parsers) to separate lexical analysis and
>     grammatical analysis into seperate phases without tricky feedback
>     from parser to lexer, possibly by using a lexer with a finite
>     amount of states.

Yes.  Just a tiny bit of state -- the previous token.  The lexer writer
needs to understand the grammar, but the lexer does not need to
understand the parser.

>   - What is the complete rule for deciding when the next token might
>     be a character literal. Or is that undecidable by just looking on
>     past input (i.e. using lexer state)?

It is decidable by looking at the previous token.  I forget the exact
rule, but it can be deduced easily from the grammar.

> BTW: The "evil" case 
> 
>     if'('="-"("="('='=',',','=','))
> 
> is not parsed ok by syntax highligting in emacs ada-mode (I wouldn't
> have expected it, actually). The rule there seems to be my incomplete
> rule without the reserved words exception. Everything falls magically
> into place if a " " is inserted immediately after "if".

I'm not surprised.  Emacs ada-mode uses some ad-hoc technique that
doesn't always work properly.  Anyway, Emacs is trying to parse bits and
pieces of things without seeing the whole file, and that's a whole
'nother thing.  It is certainly easy to parse the above "evil" thing
properly, but not necessarily if you start in the middle of it.

> >     Y: access T'Class := ...;
> >     Z: access T2'Class := Y.all'Access;
> > 
> > For reserved words, I think you have to study the grammar, and determine
> > which ones can precede a tick mark.
> 
> OK. That I understand now. 
> 
> Regards -- Markus

- Bob