From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!mx02.eternal-september.org!feeder.eternal-september.org!news.glorb.com!peer02.iad.highwinds-media.com!news.highwinds-media.com!feed-me.highwinds-media.com!post02.iad.highwinds-media.com!news.flashnewsgroups.com-b7.4zTQh5tI3A!not-for-mail From: Stephen Leake Newsgroups: comp.lang.ada Subject: Re: OpenToken: Parsing Ada (subset)? References: <878uc3r2y6.fsf@adaheads.sparre-andersen.dk> <85twupvjxo.fsf@stephe-leake.org> <81ceb070-16fe-4578-a09a-eb11a2bbb664@googlegroups.com> <162zj7c2l0ykp$.1rxias18vby83.dlg@40tude.net> Date: Fri, 05 Jun 2015 04:03:27 -0500 Message-ID: <856172bk80.fsf@stephe-leake.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4 (windows-nt) Cancel-Lock: sha1:wHb0CtexGBdnxj221LJpa4tFjME= MIME-Version: 1.0 Content-Type: text/plain X-Complaints-To: abuse@flashnewsgroups.com Organization: FlashNewsgroups.com X-Trace: 60d06557165e1e97f808403109 X-Received-Bytes: 4412 X-Received-Body-CRC: 1424736922 Xref: news.eternal-september.org comp.lang.ada:26189 Date: 2015-06-05T04:03:27-05:00 List-Id: "Dmitry A. Kazakov" writes: > On Tue, 2 Jun 2015 18:43:50 -0700 (PDT), Shark8 wrote: > >> On Tuesday, June 2, 2015 at 4:12:37 PM UTC-6, Stephen Leake wrote: >>> >>> Obvious to me, but I've been messing with the lexer code in FastToken >>> recently; I switched to using regular expressions for the token >>> recognizers (to be closer to Aflex, which is also now supported). > >> From my experience [mainly maintenance] RegEx is almost always a bad >> solution (though I will grant that most of my encounters w/ it involved >> applying it as a formatting/parsing tool for items that generally weren't >> amiable to such breakdowns [street addresses, for example, are good at >> containing info/formatting that kills a simple regex]). > > Yes. Maintenance is one problem, another is that the family of languages > recognized by regular expressions is far too weak. More powerful languages, > e.g. SNOBOL patterns are slower. The simple lexer in FastToken is intended only to support the FastToken unit tests, which focus on testing the parser generator and executor. Simplicity is the prime driver here; no need for expressive power or speed. Aflex compiles all the regular expressions for all of the tokens into one state machine, that visits each character in the input stream once. You can't get faster than that. > In the end it is always worth of efforts writing a manual token scanner by > hand. "always" is way too strong a statement here. If you trust that the regexp engine is well written and maintained, the expressive power is adequate for your language, and the speed is adequate for your application, then why waste resources reimplementing the tools? Use them and get on with the interesting work. regexp are perfectly adequate for Ada. > Firstly, there are not so many things you would have to recognize that > way. I guess you are saying that implementing a lexer for a restricted set of tokens is easier than implementing a general regular expression engine. True, but that's not the choice at hand; the choice is between implementing a new lexer for a restricted set of tokens, or reusing an existing regular expression engine (supported and maintained externally) and specifying a small set of regular expressions (most of which are simple strings for the reserved words). > Secondly it is much more efficient than pattern matching. Not if you use the Aflex approach; the hand-written OpenToken lexer is far less efficient than the compiled state machine that Aflex produces. > Thirdly it would allow sane error messaging, because usually it is > more outcomes than matched vs. not matched, e.g. malformed identifier > or missing quotation mark. This is a valid but minor point. For Ada strings, since new line is excluded, a missing quotation mark does not produce a very confusing error message (which is precisely why new line is excluded). Other languages are worse for string errors, but the parser stage can provide a better error message if desired. Syntax highlighting in the typical IDE is the best way to address this particular problem; when the entire rest of the file changes color, you know you are missing a quote. It's not an issue for the FastToken unit tests, and the syntax error messages from OpenToken in other contexts have not bothered me yet (certainly not as much as some Microsoft error messages: "cannot load the specified module" indeed; tell me _which_ module!). Nor have I gotten any complaints in that area from ada-mode users. -- -- Stephe