From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00
	autolearn=unavailable autolearn_force=no version=3.4.4
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!mx02.eternal-september.org!feeder.eternal-september.org!gandalf.srv.welterde.de!news.jacob-sparre.dk!loke.jacob-sparre.dk!pnx.dk!.POSTED!not-for-mail
From: "Randy Brukardt" <randy@rrsoftware.com>
Newsgroups: comp.lang.ada
Subject: Re: OpenToken: Parsing Ada (subset)?
Date: Tue, 16 Jun 2015 16:34:21 -0500
Organization: Jacob Sparre Andersen Research & Innovation
Message-ID: <mlq4ou$btc$1@loke.gir.dk>
References: <878uc3r2y6.fsf@adaheads.sparre-andersen.dk>
 	<85twupvjxo.fsf@stephe-leake.org>
 	<81ceb070-16fe-4578-a09a-eb11a2bbb664@googlegroups.com>
 	<162zj7c2l0ykp$.1rxias18vby83.dlg@40tude.net>
 	<856172bk80.fsf@stephe-leake.org>
 	<26ccc147-7a15-48d7-8808-3248edfbf433@googlegroups.com>
 <85k2v3aeyv.fsf@stephe-leake.org>
NNTP-Posting-Host: rrsoftware.com
X-Trace: loke.gir.dk 1434490462 12204 24.196.82.226 (16 Jun 2015 21:34:22 GMT)
X-Complaints-To: news@jacob-sparre.dk
NNTP-Posting-Date: Tue, 16 Jun 2015 21:34:22 +0000 (UTC)
X-Priority: 3
X-MSMail-Priority: Normal
X-Newsreader: Microsoft Outlook Express 6.00.2900.5931
X-RFC2646: Format=Flowed; Original
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157
Xref: news.eternal-september.org comp.lang.ada:26351
Date: 2015-06-16T16:34:21-05:00
List-Id: <comp.lang.ada>

"Stephen Leake" <stephen_leake@stephe-leake.org> wrote in message 
news:85k2v3aeyv.fsf@stephe-leake.org...
> Shark8 <onewingedshark@gmail.com> writes:
>
>> On Friday, June 5, 2015 at 3:03:30 AM UTC-6, Stephen Leake wrote:
>>>
>>> If you trust that the regexp engine is [1]well written and maintained,
>>> [2]the expressive power is adequate for your language, and [3]the speed 
>>> is
>>> adequate for your application, then why waste resources reimplementing
>>> the tools? Use them and get on with the interesting work.
>>
>> While #3 seems to not come up often, #1 and #2 seem to be far more
>> "question-able" -- #2 is especially misevaluated a LOT. (If it wasn't
>> we wouldn't see people trying to parse HTML or CSV with regex.)
>
> Yes.
>
>>> regexp are perfectly adequate for Ada.
>>
>> Even something like Character'('C')?
>
> Hmm. I've never had problem with code like that, but it does seem like
> the lexer could treat '(' as a character literal, which would produce a
> parse error.
>
> Testing ...
>
> The Emacs lexer handles these properly:
>
> Character'('C')
> Character'( 'C' )
> Character ' ( 'C' )
>
> but not:
>
> Character '('C')
>
> (Which may be one reason I never write code the latter way :)
>
> The Emacs lexer regular expression for character literal is:
>
> "[^a-zA-Z0-9)]'[^'\n]'"
>
> which says if the first tick is preceded by identifier characters or
> right paren, it's not a character literal; that explains the above
> behavior, and works for typical Ada code in Emacs.
>
> But that regular expression doesn't work in a normal lexer, since it
> references text before the beginning of the desired lexeme. Emacs is
> _not_ a "normal lexer".
>
> Using the regular expression "'[^']'|''''" for CHARACTER_LITERAL
> (handling the special case ''''), the Aflex lexer handles the
> above cases as follows:
>
> Character'('C')
> IDENTIFIER CHARACTER_LITERAL IDENTIFIER TICK RIGHT_PAREN
>
> Character'( 'C' )
> IDENTIFIER TICK LEFT_PAREN CHARACTER_LITERAL RIGHT_PAREN
>
> Character '('C')
> IDENTIFIER CHARACTER_LITERAL IDENTIFIER TICK RIGHT_PAREN
>
> Character ' ( 'C' )
> IDENTIFIER TICK LEFT_PAREN CHARACTER_LITERAL RIGHT_PAREN
>
> This is as expected (but not desired).
>
> One way to handle this is to provide for feedback from the parser to the
> lexer; if a parse fails, push back the character literal, tell the lexer
> to treat the first single quote as a TICK, and procede. I'll work on
> implementing that in FastToken with the Aflex lexer; it will be a good
> example.
>
> Another way is to treat this particular sequence of tokens as a valid
> expression, but rewrite it before handing off to the rest of the parser.
> That requires identifying all such special cases; not too hard.
>
> A third choice is to not define a CHARACTER_LITERAL token; then the
> sequence of tokens is always
>
> IDENTIFIER TICK LEFT_PAREN TICK IDENTIFIER TICK RIGHT_PAREN
>
> and the parser must identify the character literal, or the grammar must
> be re-written in the same manner. That may be the simplest solution.
>
> If I recall correctly, this issue has been discussed here before, and
> the proposed solutions were similar. I don't know how GNAT handles this.

I don't think you identified the solution that is typically used: remember 
the previous token identified by the lexer. Then, when encountering an 
apostrophe, the token is unconditionally an apostrophe if the preceding 
token is "all", an identifier, a character or string literal, or an rparen; 
else it might be a character literal. No "feedback from the parser" needed 
(that seems like a nightmare to me). The method was originally proposed by 
Tischler in Ada Letters in July 1983, pg 36. (I got this out of the comments 
of Janus/Ada, of course.)

I tend to agree with Dmitry; for lexing Ada, regular expressions are just 
not going to work; you'll need too many fixups to make them worth the 
trouble. Just write the thing in Ada, it won't take you any longer than 
figuring out the correct regular expression for an identifier. And that 
makes it easy to handle the weird special cases of Ada.

Other approaches are going to lex some programs incorrectly; how important 
that is will vary depending on what kind of tool you are writing but since 
the effort is similar, it's hard to see the advantage of a regular 
expression or other "automatic" lexer. (It makes much more sense for a 
parser, where the effort can be orders of magnitude different.)

                                        Randy.