From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00
	autolearn=unavailable autolearn_force=no version=3.4.4
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!mx02.eternal-september.org!feeder.eternal-september.org!news.glorb.com!peer02.iad.highwinds-media.com!news.highwinds-media.com!feed-me.highwinds-media.com!post02.iad.highwinds-media.com!news.flashnewsgroups.com-b7.4zTQh5tI3A!not-for-mail
From: Stephen Leake <stephen_leake@stephe-leake.org>
Newsgroups: comp.lang.ada
Subject: Re: OpenToken: Parsing Ada (subset)?
References: <878uc3r2y6.fsf@adaheads.sparre-andersen.dk>
 	<85twupvjxo.fsf@stephe-leake.org>
 	<81ceb070-16fe-4578-a09a-eb11a2bbb664@googlegroups.com>
 	<162zj7c2l0ykp$.1rxias18vby83.dlg@40tude.net>
 	<856172bk80.fsf@stephe-leake.org>
 	<26ccc147-7a15-48d7-8808-3248edfbf433@googlegroups.com>
Date: Tue, 16 Jun 2015 09:46:16 -0500
Message-ID: <85k2v3aeyv.fsf@stephe-leake.org>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4 (windows-nt)
Cancel-Lock: sha1:XdNy9u3iloNJ6rURiujnECKmC1w=
MIME-Version: 1.0
Content-Type: text/plain
X-Complaints-To: abuse@flashnewsgroups.com
Organization: FlashNewsgroups.com
X-Trace: 5ad81558036bae97f808425457
X-Received-Bytes: 4417
X-Received-Body-CRC: 1459860047
Xref: news.eternal-september.org comp.lang.ada:26347
Date: 2015-06-16T09:46:16-05:00
List-Id: <comp.lang.ada>

Shark8 <onewingedshark@gmail.com> writes:

> On Friday, June 5, 2015 at 3:03:30 AM UTC-6, Stephen Leake wrote:
>>
>> If you trust that the regexp engine is [1]well written and maintained,
>> [2]the expressive power is adequate for your language, and [3]the speed is
>> adequate for your application, then why waste resources reimplementing
>> the tools? Use them and get on with the interesting work.
>
> While #3 seems to not come up often, #1 and #2 seem to be far more
> "question-able" -- #2 is especially misevaluated a LOT. (If it wasn't
> we wouldn't see people trying to parse HTML or CSV with regex.)

Yes.

>> regexp are perfectly adequate for Ada.
>
> Even something like Character'('C')?

Hmm. I've never had problem with code like that, but it does seem like
the lexer could treat '(' as a character literal, which would produce a
parse error.

Testing ...

The Emacs lexer handles these properly:

Character'('C')
Character'( 'C' )
Character ' ( 'C' )

but not:

Character '('C')

(Which may be one reason I never write code the latter way :)

The Emacs lexer regular expression for character literal is:

"[^a-zA-Z0-9)]'[^'\n]'"

which says if the first tick is preceded by identifier characters or
right paren, it's not a character literal; that explains the above
behavior, and works for typical Ada code in Emacs.

But that regular expression doesn't work in a normal lexer, since it
references text before the beginning of the desired lexeme. Emacs is
_not_ a "normal lexer".

Using the regular expression "'[^']'|''''" for CHARACTER_LITERAL
(handling the special case ''''), the Aflex lexer handles the
above cases as follows:

Character'('C')
IDENTIFIER CHARACTER_LITERAL IDENTIFIER TICK RIGHT_PAREN

Character'( 'C' )
IDENTIFIER TICK LEFT_PAREN CHARACTER_LITERAL RIGHT_PAREN

Character '('C')
IDENTIFIER CHARACTER_LITERAL IDENTIFIER TICK RIGHT_PAREN

Character ' ( 'C' )
IDENTIFIER TICK LEFT_PAREN CHARACTER_LITERAL RIGHT_PAREN

This is as expected (but not desired).

One way to handle this is to provide for feedback from the parser to the
lexer; if a parse fails, push back the character literal, tell the lexer
to treat the first single quote as a TICK, and procede. I'll work on
implementing that in FastToken with the Aflex lexer; it will be a good
example.

Another way is to treat this particular sequence of tokens as a valid
expression, but rewrite it before handing off to the rest of the parser.
That requires identifying all such special cases; not too hard.

A third choice is to not define a CHARACTER_LITERAL token; then the
sequence of tokens is always

IDENTIFIER TICK LEFT_PAREN TICK IDENTIFIER TICK RIGHT_PAREN

and the parser must identify the character literal, or the grammar must
be re-written in the same manner. That may be the simplest solution.

If I recall correctly, this issue has been discussed here before, and
the proposed solutions were similar. I don't know how GNAT handles this.

I think the statement "regular expressions are perfectly adequate for
Ada" stands; this case just shows that the parser must be complicated if
the lexer is not.

This case is a good example of the possible trade-offs between the lexer
and parser complexity; the Emacs lexer handles all typical cases without
feedback from the parser, but is more complex than an Aflex lexer. The
Aflex lexer handles the same cases, but requires feedback from the
parser or other complexity.

--
-- Stephe