From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!mx02.eternal-september.org!feeder.eternal-september.org!news.albasani.net!reality.xs3.de!news.jacob-sparre.dk!loke.jacob-sparre.dk!pnx.dk!.POSTED!not-for-mail From: "Randy Brukardt" Newsgroups: comp.lang.ada Subject: Re: OpenToken: Parsing Ada (subset)? Date: Wed, 17 Jun 2015 15:44:58 -0500 Organization: Jacob Sparre Andersen Research & Innovation Message-ID: References: <878uc3r2y6.fsf@adaheads.sparre-andersen.dk> <85twupvjxo.fsf@stephe-leake.org> <81ceb070-16fe-4578-a09a-eb11a2bbb664@googlegroups.com> <162zj7c2l0ykp$.1rxias18vby83.dlg@40tude.net> <856172bk80.fsf@stephe-leake.org> <26ccc147-7a15-48d7-8808-3248edfbf433@googlegroups.com> <85k2v3aeyv.fsf@stephe-leake.org> <85h9q68bf8.fsf@stephe-leake.org> NNTP-Posting-Host: rrsoftware.com X-Trace: loke.gir.dk 1434573899 22971 24.196.82.226 (17 Jun 2015 20:44:59 GMT) X-Complaints-To: news@jacob-sparre.dk NNTP-Posting-Date: Wed, 17 Jun 2015 20:44:59 +0000 (UTC) X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 6.00.2900.5931 X-RFC2646: Format=Flowed; Original X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 Xref: news.eternal-september.org comp.lang.ada:26365 Date: 2015-06-17T15:44:58-05:00 List-Id: "Stephen Leake" wrote in message news:85h9q68bf8.fsf@stephe-leake.org... > "Randy Brukardt" writes: > >> "Stephen Leake" wrote in message >> news:85k2v3aeyv.fsf@stephe-leake.org... > >>> One way to handle this is to provide for feedback from the parser to the >>> lexer; if a parse fails, push back the character literal, tell the lexer >>> to treat the first single quote as a TICK, and procede. I'll work on >>> implementing that in FastToken with the Aflex lexer; it will be a good >>> example. >>> >>> Another way is to treat this particular sequence of tokens as a valid >>> expression, but rewrite it before handing off to the rest of the parser. >>> That requires identifying all such special cases; not too hard. >>> >>> A third choice is to not define a CHARACTER_LITERAL token; then the >>> sequence of tokens is always >>> >>> IDENTIFIER TICK LEFT_PAREN TICK IDENTIFIER TICK RIGHT_PAREN >>> >>> and the parser must identify the character literal, or the grammar must >>> be re-written in the same manner. That may be the simplest solution. >>> >>> If I recall correctly, this issue has been discussed here before, and >>> the proposed solutions were similar. I don't know how GNAT handles this. >> >> I don't think you identified the solution that is typically used: >> remember >> the previous token identified by the lexer. Then, when encountering an >> apostrophe, the token is unconditionally an apostrophe if the preceding >> token is "all", an identifier, a character or string literal, or an >> rparen; >> else it might be a character literal. > > That's the third choice above; the lexer returns TICK (= apostrophe) for > all cases, and the parser deals with further classification. > > Hmm. Unless you are saying that logic is in the lexer; I don't see that > it matters much. Of course it's in the lexer, I don't see how it could be in the parser (especially if the parser is table-driven). The character following the Tick in a character literal could be anything, including characters that the parser doesn't even see (white space, characters like $, %, and # that aren't legal tokens, etc.). Trying to handle that in the parser would greatly complicate the grammar and most likely would introduce ambiguities into it. I could see it working in a hand-written parser, but that doesn't make much sense (why use a table-driven lexer and a hand-written parser? -- the parser is far more work). I also could see it working if one introduced a filter stage in between the lexer and parser -- but then you've just introduced a lot more complexity. ... > I have looked briefly at the GNAT lexer. It is highly optimized, and is > apparently generated from some SNOBOL sources (ie, _not_ "hand > written"). For example, it uses nested if-then-else on each character of > each keyword; not something you want to do by hand. The one in Janus/Ada started out life as a table-driven lexer that we were required to use in our original CS 701 compiler. But that couldn't handle the little glitches of Ada and it was huge, we needed something much smaller for our original CP/M compiler (which had to fit in 48K RAM). So it was replaced by hand-written code in the same style (essentially replacing the table transitions with case statements). (Actually, I think it was originally generated by a program from the transition table, and then hand-pruned because much of it was clearly redundant.) It uses a character classification table to reduce the number of transitions needed. For most classes, there are very few transitions (if you see "=", there isn't much that can follow it). It's fairly large (the entire unit is 2400 lines), but a large part of that is error handling code, code to create token records (which for efficiency is repeated everywhere rather than put into a subprogram - this code is very "hot" in a profile of the compiler, and anything that we can save matters) and code to manage identifiers and reserved words. I suspect I could do it in much less space if I was starting from scratch (which I probably will have to do to support UTF-8). Randy.