From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00
	autolearn=unavailable autolearn_force=no version=3.4.4
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!mx02.eternal-september.org!feeder.eternal-september.org!news.glorb.com!peer02.iad.highwinds-media.com!news.highwinds-media.com!feed-me.highwinds-media.com!post02.iad.highwinds-media.com!news.flashnewsgroups.com-b7.4zTQh5tI3A!not-for-mail
From: Stephen Leake <stephen_leake@stephe-leake.org>
Newsgroups: comp.lang.ada
Subject: Re: OpenToken: Parsing Ada (subset)?
References: <878uc3r2y6.fsf@adaheads.sparre-andersen.dk>
 	<85twupvjxo.fsf@stephe-leake.org>
 	<81ceb070-16fe-4578-a09a-eb11a2bbb664@googlegroups.com>
 	<162zj7c2l0ykp$.1rxias18vby83.dlg@40tude.net>
 	<856172bk80.fsf@stephe-leake.org>
 	<26ccc147-7a15-48d7-8808-3248edfbf433@googlegroups.com>
 	<85k2v3aeyv.fsf@stephe-leake.org> <mlq4ou$btc$1@loke.gir.dk>
Date: Wed, 17 Jun 2015 12:58:03 -0500
Message-ID: <85h9q68bf8.fsf@stephe-leake.org>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4 (windows-nt)
Cancel-Lock: sha1:oJB48IVPpBpUlnQ02OAXQPWQ9bs=
MIME-Version: 1.0
Content-Type: text/plain
X-Complaints-To: abuse@flashnewsgroups.com
Organization: FlashNewsgroups.com
X-Trace: 3d18c5581b52de97f808413396
X-Received-Bytes: 4388
X-Received-Body-CRC: 3561068680
Xref: news.eternal-september.org comp.lang.ada:26362
Date: 2015-06-17T12:58:03-05:00
List-Id: <comp.lang.ada>

"Randy Brukardt" <randy@rrsoftware.com> writes:

> "Stephen Leake" <stephen_leake@stephe-leake.org> wrote in message 
> news:85k2v3aeyv.fsf@stephe-leake.org...

>> One way to handle this is to provide for feedback from the parser to the
>> lexer; if a parse fails, push back the character literal, tell the lexer
>> to treat the first single quote as a TICK, and procede. I'll work on
>> implementing that in FastToken with the Aflex lexer; it will be a good
>> example.
>>
>> Another way is to treat this particular sequence of tokens as a valid
>> expression, but rewrite it before handing off to the rest of the parser.
>> That requires identifying all such special cases; not too hard.
>>
>> A third choice is to not define a CHARACTER_LITERAL token; then the
>> sequence of tokens is always
>>
>> IDENTIFIER TICK LEFT_PAREN TICK IDENTIFIER TICK RIGHT_PAREN
>>
>> and the parser must identify the character literal, or the grammar must
>> be re-written in the same manner. That may be the simplest solution.
>>
>> If I recall correctly, this issue has been discussed here before, and
>> the proposed solutions were similar. I don't know how GNAT handles this.
>
> I don't think you identified the solution that is typically used: remember 
> the previous token identified by the lexer. Then, when encountering an 
> apostrophe, the token is unconditionally an apostrophe if the preceding 
> token is "all", an identifier, a character or string literal, or an rparen; 
> else it might be a character literal. 

That's the third choice above; the lexer returns TICK (= apostrophe) for
all cases, and the parser deals with further classification.

Hmm. Unless you are saying that logic is in the lexer; I don't see that
it matters much.

Aflex does have a provision for adding some logic in a lexer, although
I'm not sure it supports "remember the previous token".

> No "feedback from the parser" needed
> (that seems like a nightmare to me). The method was originally proposed by 
> Tischler in Ada Letters in July 1983, pg 36. (I got this out of the comments 
> of Janus/Ada, of course.)
>
> I tend to agree with Dmitry; for lexing Ada, regular expressions are just 
> not going to work; you'll need too many fixups to make them worth the 
> trouble. Just write the thing in Ada, it won't take you any longer than 
> figuring out the correct regular expression for an identifier. And that 
> makes it easy to handle the weird special cases of Ada.
>
> Other approaches are going to lex some programs incorrectly; how important 
> that is will vary depending on what kind of tool you are writing but since 
> the effort is similar, it's hard to see the advantage of a regular 
> expression or other "automatic" lexer. (It makes much more sense for a 
> parser, where the effort can be orders of magnitude different.)

Ok. I guess I'd like to see some actual examples of hand-written lexers.
The one in OpenToken is not inspiring to me; that's why I got rid of it
for FastToken (it's definitely easier for me to write regexps than to
write another OpenToken recognizer (= lexer module)).

I have looked briefly at the GNAT lexer. It is highly optimized, and is
apparently generated from some SNOBOL sources (ie, _not_ "hand
written"). For example, it uses nested if-then-else on each character of
each keyword; not something you want to do by hand.

-- 
-- Stephe