From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00
	autolearn=unavailable autolearn_force=no version=3.4.4
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!mx02.eternal-september.org!feeder.eternal-september.org!news.glorb.com!peer02.iad.highwinds-media.com!news.highwinds-media.com!feed-me.highwinds-media.com!post02.iad.highwinds-media.com!news.flashnewsgroups.com-b7.4zTQh5tI3A!not-for-mail
From: Stephen Leake <stephen_leake@stephe-leake.org>
Newsgroups: comp.lang.ada
Subject: Re: OpenToken: Parsing Ada (subset)?
References: <878uc3r2y6.fsf@adaheads.sparre-andersen.dk>
 	<85twupvjxo.fsf@stephe-leake.org>
 	<81ceb070-16fe-4578-a09a-eb11a2bbb664@googlegroups.com>
 	<162zj7c2l0ykp$.1rxias18vby83.dlg@40tude.net>
Date: Fri, 05 Jun 2015 04:03:27 -0500
Message-ID: <856172bk80.fsf@stephe-leake.org>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4 (windows-nt)
Cancel-Lock: sha1:wHb0CtexGBdnxj221LJpa4tFjME=
MIME-Version: 1.0
Content-Type: text/plain
X-Complaints-To: abuse@flashnewsgroups.com
Organization: FlashNewsgroups.com
X-Trace: 60d06557165e1e97f808403109
X-Received-Bytes: 4412
X-Received-Body-CRC: 1424736922
Xref: news.eternal-september.org comp.lang.ada:26189
Date: 2015-06-05T04:03:27-05:00
List-Id: <comp.lang.ada>

"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:

> On Tue, 2 Jun 2015 18:43:50 -0700 (PDT), Shark8 wrote:
>
>> On Tuesday, June 2, 2015 at 4:12:37 PM UTC-6, Stephen Leake wrote:
>>> 
>>> Obvious to me, but I've been messing with the lexer code in FastToken
>>> recently; I switched to using regular expressions for the token
>>> recognizers (to be closer to Aflex, which is also now supported). 
>
>> From my experience [mainly maintenance] RegEx is almost always a bad
>> solution (though I will grant that most of my encounters w/ it involved
>> applying it as a formatting/parsing tool for items that generally weren't
>> amiable to such breakdowns [street addresses, for example, are good at
>> containing info/formatting that kills a simple regex]).
>
> Yes. Maintenance is one problem, another is that the family of languages
> recognized by regular expressions is far too weak. More powerful languages,
> e.g. SNOBOL patterns are slower.

The simple lexer in FastToken is intended only to support the FastToken
unit tests, which focus on testing the parser generator and executor.
Simplicity is the prime driver here; no need for expressive power or
speed.

Aflex compiles all the regular expressions for all of the tokens into
one state machine, that visits each character in the input stream once.
You can't get faster than that.

> In the end it is always worth of efforts writing a manual token scanner by
> hand. 

"always" is way too strong a statement here.

If you trust that the regexp engine is well written and maintained, 
the expressive power is adequate for your language, and the speed is
adequate for your application, then why waste resources reimplementing
the tools? Use them and get on with the interesting work.

regexp are perfectly adequate for Ada. 

> Firstly, there are not so many things you would have to recognize that
> way. 

I guess you are saying that implementing a lexer for a restricted set of
tokens is easier than implementing a general regular expression engine.
True, but that's not the choice at hand; the choice is between
implementing a new lexer for a restricted set of tokens, or reusing an
existing regular expression engine (supported and maintained
externally) and specifying a small set of regular expressions (most of
which are simple strings for the reserved words).

> Secondly it is much more efficient than pattern matching. 

Not if you use the Aflex approach; the hand-written OpenToken lexer is
far less efficient than the compiled state machine that Aflex produces.

> Thirdly it would allow sane error messaging, because usually it is
> more outcomes than matched vs. not matched, e.g. malformed identifier
> or missing quotation mark.

This is a valid but minor point.

For Ada strings, since new line is excluded, a missing quotation mark
does not produce a very confusing error message (which is precisely why
new line is excluded). Other languages are worse for string errors, but
the parser stage can provide a better error message if desired. Syntax
highlighting in the typical IDE is the best way to address this
particular problem; when the entire rest of the file changes color, you
know you are missing a quote.

It's not an issue for the FastToken unit tests, and the syntax error
messages from OpenToken in other contexts have not bothered me yet
(certainly not as much as some Microsoft error messages: "cannot load
the specified module" indeed; tell me _which_ module!). Nor have I
gotten any complaints in that area from ada-mode users.

-- 
-- Stephe