From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!mx02.eternal-september.org!feeder.eternal-september.org!news.glorb.com!peer02.iad.highwinds-media.com!news.highwinds-media.com!feed-me.highwinds-media.com!post02.iad.highwinds-media.com!news.flashnewsgroups.com-b7.4zTQh5tI3A!not-for-mail From: Stephen Leake Newsgroups: comp.lang.ada Subject: Re: OpenToken: Parsing Ada (subset)? References: <878uc3r2y6.fsf@adaheads.sparre-andersen.dk> <85twupvjxo.fsf@stephe-leake.org> <81ceb070-16fe-4578-a09a-eb11a2bbb664@googlegroups.com> <162zj7c2l0ykp$.1rxias18vby83.dlg@40tude.net> <856172bk80.fsf@stephe-leake.org> <26ccc147-7a15-48d7-8808-3248edfbf433@googlegroups.com> Date: Tue, 16 Jun 2015 09:46:16 -0500 Message-ID: <85k2v3aeyv.fsf@stephe-leake.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4 (windows-nt) Cancel-Lock: sha1:XdNy9u3iloNJ6rURiujnECKmC1w= MIME-Version: 1.0 Content-Type: text/plain X-Complaints-To: abuse@flashnewsgroups.com Organization: FlashNewsgroups.com X-Trace: 5ad81558036bae97f808425457 X-Received-Bytes: 4417 X-Received-Body-CRC: 1459860047 Xref: news.eternal-september.org comp.lang.ada:26347 Date: 2015-06-16T09:46:16-05:00 List-Id: Shark8 writes: > On Friday, June 5, 2015 at 3:03:30 AM UTC-6, Stephen Leake wrote: >> >> If you trust that the regexp engine is [1]well written and maintained, >> [2]the expressive power is adequate for your language, and [3]the speed is >> adequate for your application, then why waste resources reimplementing >> the tools? Use them and get on with the interesting work. > > While #3 seems to not come up often, #1 and #2 seem to be far more > "question-able" -- #2 is especially misevaluated a LOT. (If it wasn't > we wouldn't see people trying to parse HTML or CSV with regex.) Yes. >> regexp are perfectly adequate for Ada. > > Even something like Character'('C')? Hmm. I've never had problem with code like that, but it does seem like the lexer could treat '(' as a character literal, which would produce a parse error. Testing ... The Emacs lexer handles these properly: Character'('C') Character'( 'C' ) Character ' ( 'C' ) but not: Character '('C') (Which may be one reason I never write code the latter way :) The Emacs lexer regular expression for character literal is: "[^a-zA-Z0-9)]'[^'\n]'" which says if the first tick is preceded by identifier characters or right paren, it's not a character literal; that explains the above behavior, and works for typical Ada code in Emacs. But that regular expression doesn't work in a normal lexer, since it references text before the beginning of the desired lexeme. Emacs is _not_ a "normal lexer". Using the regular expression "'[^']'|''''" for CHARACTER_LITERAL (handling the special case ''''), the Aflex lexer handles the above cases as follows: Character'('C') IDENTIFIER CHARACTER_LITERAL IDENTIFIER TICK RIGHT_PAREN Character'( 'C' ) IDENTIFIER TICK LEFT_PAREN CHARACTER_LITERAL RIGHT_PAREN Character '('C') IDENTIFIER CHARACTER_LITERAL IDENTIFIER TICK RIGHT_PAREN Character ' ( 'C' ) IDENTIFIER TICK LEFT_PAREN CHARACTER_LITERAL RIGHT_PAREN This is as expected (but not desired). One way to handle this is to provide for feedback from the parser to the lexer; if a parse fails, push back the character literal, tell the lexer to treat the first single quote as a TICK, and procede. I'll work on implementing that in FastToken with the Aflex lexer; it will be a good example. Another way is to treat this particular sequence of tokens as a valid expression, but rewrite it before handing off to the rest of the parser. That requires identifying all such special cases; not too hard. A third choice is to not define a CHARACTER_LITERAL token; then the sequence of tokens is always IDENTIFIER TICK LEFT_PAREN TICK IDENTIFIER TICK RIGHT_PAREN and the parser must identify the character literal, or the grammar must be re-written in the same manner. That may be the simplest solution. If I recall correctly, this issue has been discussed here before, and the proposed solutions were similar. I don't know how GNAT handles this. I think the statement "regular expressions are perfectly adequate for Ada" stands; this case just shows that the parser must be complicated if the lexer is not. This case is a good example of the possible trade-offs between the lexer and parser complexity; the Emacs lexer handles all typical cases without feedback from the parser, but is more complex than an Aflex lexer. The Aflex lexer handles the same cases, but requires feedback from the parser or other complexity. -- -- Stephe