From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00
	autolearn=unavailable autolearn_force=no version=3.4.4
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!mx02.eternal-september.org!feeder.eternal-september.org!news.albasani.net!reality.xs3.de!news.jacob-sparre.dk!loke.jacob-sparre.dk!pnx.dk!.POSTED!not-for-mail
From: "Randy Brukardt" <randy@rrsoftware.com>
Newsgroups: comp.lang.ada
Subject: Re: OpenToken: Parsing Ada (subset)?
Date: Wed, 17 Jun 2015 15:44:58 -0500
Organization: Jacob Sparre Andersen Research & Innovation
Message-ID: <mlsm8b$mdr$1@loke.gir.dk>
References: <878uc3r2y6.fsf@adaheads.sparre-andersen.dk>
 	<85twupvjxo.fsf@stephe-leake.org>
 	<81ceb070-16fe-4578-a09a-eb11a2bbb664@googlegroups.com>
 	<162zj7c2l0ykp$.1rxias18vby83.dlg@40tude.net>
 	<856172bk80.fsf@stephe-leake.org>
 	<26ccc147-7a15-48d7-8808-3248edfbf433@googlegroups.com>
 	<85k2v3aeyv.fsf@stephe-leake.org> <mlq4ou$btc$1@loke.gir.dk>
 <85h9q68bf8.fsf@stephe-leake.org>
NNTP-Posting-Host: rrsoftware.com
X-Trace: loke.gir.dk 1434573899 22971 24.196.82.226 (17 Jun 2015 20:44:59 GMT)
X-Complaints-To: news@jacob-sparre.dk
NNTP-Posting-Date: Wed, 17 Jun 2015 20:44:59 +0000 (UTC)
X-Priority: 3
X-MSMail-Priority: Normal
X-Newsreader: Microsoft Outlook Express 6.00.2900.5931
X-RFC2646: Format=Flowed; Original
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157
Xref: news.eternal-september.org comp.lang.ada:26365
Date: 2015-06-17T15:44:58-05:00
List-Id: <comp.lang.ada>

"Stephen Leake" <stephen_leake@stephe-leake.org> wrote in message 
news:85h9q68bf8.fsf@stephe-leake.org...
> "Randy Brukardt" <randy@rrsoftware.com> writes:
>
>> "Stephen Leake" <stephen_leake@stephe-leake.org> wrote in message
>> news:85k2v3aeyv.fsf@stephe-leake.org...
>
>>> One way to handle this is to provide for feedback from the parser to the
>>> lexer; if a parse fails, push back the character literal, tell the lexer
>>> to treat the first single quote as a TICK, and procede. I'll work on
>>> implementing that in FastToken with the Aflex lexer; it will be a good
>>> example.
>>>
>>> Another way is to treat this particular sequence of tokens as a valid
>>> expression, but rewrite it before handing off to the rest of the parser.
>>> That requires identifying all such special cases; not too hard.
>>>
>>> A third choice is to not define a CHARACTER_LITERAL token; then the
>>> sequence of tokens is always
>>>
>>> IDENTIFIER TICK LEFT_PAREN TICK IDENTIFIER TICK RIGHT_PAREN
>>>
>>> and the parser must identify the character literal, or the grammar must
>>> be re-written in the same manner. That may be the simplest solution.
>>>
>>> If I recall correctly, this issue has been discussed here before, and
>>> the proposed solutions were similar. I don't know how GNAT handles this.
>>
>> I don't think you identified the solution that is typically used: 
>> remember
>> the previous token identified by the lexer. Then, when encountering an
>> apostrophe, the token is unconditionally an apostrophe if the preceding
>> token is "all", an identifier, a character or string literal, or an 
>> rparen;
>> else it might be a character literal.
>
> That's the third choice above; the lexer returns TICK (= apostrophe) for
> all cases, and the parser deals with further classification.
>
> Hmm. Unless you are saying that logic is in the lexer; I don't see that
> it matters much.

Of course it's in the lexer, I don't see how it could be in the parser 
(especially if the parser is table-driven). The character following the Tick 
in a character literal could be anything, including characters that the 
parser doesn't even see (white space, characters like $, %, and # that 
aren't legal tokens, etc.). Trying to handle that in the parser would 
greatly complicate the grammar and most likely would introduce ambiguities 
into it.

I could see it working in a hand-written parser, but that doesn't make much 
sense (why use a table-driven lexer and a hand-written parser? -- the parser 
is far more work). I also could see it working if one introduced a filter 
stage in between the lexer and parser -- but then you've just introduced a 
lot more complexity.

...
> I have looked briefly at the GNAT lexer. It is highly optimized, and is
> apparently generated from some SNOBOL sources (ie, _not_ "hand
> written"). For example, it uses nested if-then-else on each character of
> each keyword; not something you want to do by hand.

The one in Janus/Ada started out life as a table-driven lexer that we were 
required to use in our original CS 701 compiler. But that couldn't handle 
the little glitches of Ada and it was huge, we needed something much smaller 
for our original CP/M compiler (which had to fit in 48K RAM). So it was 
replaced by hand-written code in the same style (essentially replacing the 
table transitions with case statements). (Actually, I think it was 
originally generated by a program from the transition table, and then 
hand-pruned because much of it was clearly redundant.) It uses a character 
classification table to reduce the number of transitions needed. For most 
classes, there are very few transitions (if you see "=", there isn't much 
that can follow it). It's fairly large (the entire unit is 2400 lines), but 
a large part of that is error handling code, code to create token records 
(which for efficiency is repeated everywhere rather than put into a 
subprogram - this code is very "hot" in a profile of the compiler, and 
anything that we can save matters) and code to manage identifiers and 
reserved words. I suspect I could do it in much less space if I was starting 
from scratch (which I probably will have to do to support UTF-8).

                                    Randy.