From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-0.3 required=5.0 tests=BAYES_00,
	REPLYTO_WITHOUT_TO_CC autolearn=no autolearn_force=no version=3.4.4
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!mx02.eternal-september.org!feeder.eternal-september.org!aioe.org!.POSTED!not-for-mail
From: "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de>
Newsgroups: comp.lang.ada
Subject: Re: OpenToken: Parsing Ada (subset)?
Date: Fri, 5 Jun 2015 14:20:22 +0200
Organization: cbb software GmbH
Message-ID: <1ljiyuuchbxvp.wrtbilkw3rdb.dlg@40tude.net>
References: <878uc3r2y6.fsf@adaheads.sparre-andersen.dk>
 <85twupvjxo.fsf@stephe-leake.org>
 <81ceb070-16fe-4578-a09a-eb11a2bbb664@googlegroups.com>
 <162zj7c2l0ykp$.1rxias18vby83.dlg@40tude.net>
 <856172bk80.fsf@stephe-leake.org>
Reply-To: mailbox@dmitry-kazakov.de
NNTP-Posting-Host: enOx0b+nfqkc2k+TNpOejg.user.speranza.aioe.org
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Complaints-To: abuse@aioe.org
User-Agent: 40tude_Dialog/2.0.15.1
X-Notice: Filtered by postfilter v. 0.8.2
Xref: news.eternal-september.org comp.lang.ada:26193
Date: 2015-06-05T14:20:22+02:00
List-Id: <comp.lang.ada>

On Fri, 05 Jun 2015 04:03:27 -0500, Stephen Leake wrote:

> Aflex compiles all the regular expressions for all of the tokens into
> one state machine, that visits each character in the input stream once.
> You can't get faster than that.

It is visiting source character vs. visiting internal states of the machine
and transition computations. You cannot say what is more expensive in
advance. A transition computation may be much more expensive than accessing
a character stored in an array. However hand-written scanners do not roll
back either, usually.

>> In the end it is always worth of efforts writing a manual token scanner by
>> hand. 
> 
> "always" is way too strong a statement here.
> 
> If you trust that the regexp engine is well written and maintained, 
> the expressive power is adequate for your language, and the speed is
> adequate for your application, then why waste resources reimplementing
> the tools? Use them and get on with the interesting work.
> 
> regexp are perfectly adequate for Ada. 

Examples I had in mind were Ada identifier and Ada string literal, e.g. in
UTF-8 encoding. I don't think regular expression for these would be shorter
than a hand-written in Ada scanner.

>> Firstly, there are not so many things you would have to recognize that
>> way. 
> 
> I guess you are saying that implementing a lexer for a restricted set of
> tokens is easier than implementing a general regular expression engine.
> True, but that's not the choice at hand; the choice is between
> implementing a new lexer for a restricted set of tokens, or reusing an
> existing regular expression engine (supported and maintained
> externally) and specifying a small set of regular expressions (most of
> which are simple strings for the reserved words).

Yes, it is writing regular expression vs. writing Ada program. I prefer Ada
program.

>> Thirdly it would allow sane error messaging, because usually it is
>> more outcomes than matched vs. not matched, e.g. malformed identifier
>> or missing quotation mark.
> 
> This is a valid but minor point.
> 
> For Ada strings, since new line is excluded, a missing quotation mark
> does not produce a very confusing error message (which is precisely why
> new line is excluded).

Which is tricky in the latest standard because RM refers to line
terminating characters, whereas the OS file system may have its own
definition of line end.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de