From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00
	autolearn=unavailable autolearn_force=no version=3.4.4
X-Received: by 10.70.49.111 with SMTP id t15mr237778pdn.10.1434458631254;
 Tue, 16 Jun 2015 05:43:51 -0700 (PDT)
Path: 
 buffer1.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!h15no3971930igd.0!news-out.google.com!kd3ni7362igb.0!nntp.google.com!peer02.iad.highwinds-media.com!news.highwinds-media.com!feed-me.highwinds-media.com!post02.iad.highwinds-media.com!news.flashnewsgroups.com-b7.4zTQh5tI3A!not-for-mail
From: Stephen Leake <stephen_leake@stephe-leake.org>
Newsgroups: comp.lang.ada
Subject: Re: OpenToken: Parsing Ada (subset)?
References: <878uc3r2y6.fsf@adaheads.sparre-andersen.dk>
 <85twupvjxo.fsf@stephe-leake.org>
 <81ceb070-16fe-4578-a09a-eb11a2bbb664@googlegroups.com>
 <162zj7c2l0ykp$.1rxias18vby83.dlg@40tude.net>
 <856172bk80.fsf@stephe-leake.org>
 <1ljiyuuchbxvp.wrtbilkw3rdb.dlg@40tude.net>
Date: Tue, 16 Jun 2015 07:43:49 -0500
Message-ID: <85pp4vakmy.fsf@stephe-leake.org>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4 (windows-nt)
Cancel-Lock: sha1:CFhFOdxzEBcnpBknFpb2ZvfVgbg=
MIME-Version: 1.0
X-Complaints-To: abuse@flashnewsgroups.com
Organization: FlashNewsgroups.com
X-Trace: 23b3b55801a07e97f808422876
X-Received-Bytes: 5185
X-Received-Body-CRC: 1809039964
Content-Type: text/plain
Xref: number.nntp.giganews.com comp.lang.ada:193648
Date: 2015-06-16T07:43:49-05:00
List-Id: <comp.lang.ada>

"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:

> On Fri, 05 Jun 2015 04:03:27 -0500, Stephen Leake wrote:
>
>> Aflex compiles all the regular expressions for all of the tokens into
>> one state machine, that visits each character in the input stream once.
>> You can't get faster than that.
>
> It is visiting source character vs. visiting internal states of the machine
> and transition computations. You cannot say what is more expensive in
> advance. 

True; measuring is always required.

> A transition computation may be much more expensive than accessing
> a character stored in an array. However hand-written scanners do not roll
> back either, usually.

Ok.

>>> In the end it is always worth of efforts writing a manual token scanner by
>>> hand. 
>> 
>> "always" is way too strong a statement here.
>> 
>> If you trust that the regexp engine is well written and maintained, 
>> the expressive power is adequate for your language, and the speed is
>> adequate for your application, then why waste resources reimplementing
>> the tools? Use them and get on with the interesting work.
>> 
>> regexp are perfectly adequate for Ada. 
>
> Examples I had in mind were Ada identifier and Ada string literal, e.g. in
> UTF-8 encoding. I don't think regular expression for these would be shorter
> than a hand-written in Ada scanner.

Yes, but also not longer!

regular expression for identifer: "foo"

code for identifier: lexeme = "foo"

>>> Firstly, there are not so many things you would have to recognize that
>>> way. 
>> 
>> I guess you are saying that implementing a lexer for a restricted set of
>> tokens is easier than implementing a general regular expression engine.
>> True, but that's not the choice at hand; the choice is between
>> implementing a new lexer for a restricted set of tokens, or reusing an
>> existing regular expression engine (supported and maintained
>> externally) and specifying a small set of regular expressions (most of
>> which are simple strings for the reserved words).
>
> Yes, it is writing regular expression vs. writing Ada program. I prefer Ada
> program.

I'm confused; you start by agreeing with my statement, but then say
something else.

I said "implement a general regular expression engine", meaning the Ada
code that interprets regular expressions at run time. I hope you agree
that writing that is harder than writing an Ada scanner.

To address your point; let's compare writing a regular expression with
writing Ada code to do the same thing.

Here's the regular expression I use for Ada numeric literals:

"\([0-9]+#\)?[-+0-9a-fA-F.]+\(#\)?"

Given that you are at least a little familiar with regular expressions,
there's nothing hard about that.

It does not enforce all the lexical rules for numbers; it allows
repeated, leading, and trailing underscores; it doesn't enforce pairs of
'#'. But there is no legal Ada text that will be misinterpreted by this
expression. It is always a trade off between enforcing that sort of rule
in the lexer or the parser.

The OpenToken Ada code that does the same job (while enforcing some of
the rules) is in opentoken-recognizer-based_integer_real_ada.ads, .adb,
opentoken-recognizer-extended_digits.ads, .adb. That's a total of 469
lines of code. That's a state-based scanner; a recursive descent one
would be simpler.

It took several hours to write the Ada code, and a few more for the tests.
But later I found a bug that required a new test and code change.

It took a few minutes to write the regular expression, and I reused the
tests; no bugs found yet.

Obviously, if you are not familiar with regular expressions, they will
be harder to write. But a software engineer should be willing to learn
the appropriate language for the job at hand.

To be fair, if you are used to writing Ada code that does scanning, and
have a good library for it, that job will be easier as well. So it may
just come down to experience and available tools.

I would guess that the average good programmer, starting with no
knowledge of either, can learn enough about regular expressions to write
the above faster than they can learn enough about writing scanners in
Ada to do that job well.

-- 
-- Stephe