Re: OpenToken: Parsing Ada (subset)?

comp.lang.ada
 help / color / mirror / Atom feed

From: Stephen Leake <stephen_leake@stephe-leake.org>
Subject: Re: OpenToken: Parsing Ada (subset)?
Date: Tue, 16 Jun 2015 07:43:49 -0500
Date: 2015-06-16T07:43:49-05:00	[thread overview]
Message-ID: <85pp4vakmy.fsf@stephe-leake.org> (raw)
In-Reply-To: 1ljiyuuchbxvp.wrtbilkw3rdb.dlg@40tude.net

"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:

> On Fri, 05 Jun 2015 04:03:27 -0500, Stephen Leake wrote:
>
>> Aflex compiles all the regular expressions for all of the tokens into
>> one state machine, that visits each character in the input stream once.
>> You can't get faster than that.
>
> It is visiting source character vs. visiting internal states of the machine
> and transition computations. You cannot say what is more expensive in
> advance. 

True; measuring is always required.

> A transition computation may be much more expensive than accessing
> a character stored in an array. However hand-written scanners do not roll
> back either, usually.

Ok.

>>> In the end it is always worth of efforts writing a manual token scanner by
>>> hand. 
>> 
>> "always" is way too strong a statement here.
>> 
>> If you trust that the regexp engine is well written and maintained, 
>> the expressive power is adequate for your language, and the speed is
>> adequate for your application, then why waste resources reimplementing
>> the tools? Use them and get on with the interesting work.
>> 
>> regexp are perfectly adequate for Ada. 
>
> Examples I had in mind were Ada identifier and Ada string literal, e.g. in
> UTF-8 encoding. I don't think regular expression for these would be shorter
> than a hand-written in Ada scanner.

Yes, but also not longer!

regular expression for identifer: "foo"

code for identifier: lexeme = "foo"

>>> Firstly, there are not so many things you would have to recognize that
>>> way. 
>> 
>> I guess you are saying that implementing a lexer for a restricted set of
>> tokens is easier than implementing a general regular expression engine.
>> True, but that's not the choice at hand; the choice is between
>> implementing a new lexer for a restricted set of tokens, or reusing an
>> existing regular expression engine (supported and maintained
>> externally) and specifying a small set of regular expressions (most of
>> which are simple strings for the reserved words).
>
> Yes, it is writing regular expression vs. writing Ada program. I prefer Ada
> program.

I'm confused; you start by agreeing with my statement, but then say
something else.

I said "implement a general regular expression engine", meaning the Ada
code that interprets regular expressions at run time. I hope you agree
that writing that is harder than writing an Ada scanner.

To address your point; let's compare writing a regular expression with
writing Ada code to do the same thing.

Here's the regular expression I use for Ada numeric literals:

"\([0-9]+#\)?[-+0-9a-fA-F.]+\(#\)?"

Given that you are at least a little familiar with regular expressions,
there's nothing hard about that.

It does not enforce all the lexical rules for numbers; it allows
repeated, leading, and trailing underscores; it doesn't enforce pairs of
'#'. But there is no legal Ada text that will be misinterpreted by this
expression. It is always a trade off between enforcing that sort of rule
in the lexer or the parser.

The OpenToken Ada code that does the same job (while enforcing some of
the rules) is in opentoken-recognizer-based_integer_real_ada.ads, .adb,
opentoken-recognizer-extended_digits.ads, .adb. That's a total of 469
lines of code. That's a state-based scanner; a recursive descent one
would be simpler.

It took several hours to write the Ada code, and a few more for the tests.
But later I found a bug that required a new test and code change.

It took a few minutes to write the regular expression, and I reused the
tests; no bugs found yet.

Obviously, if you are not familiar with regular expressions, they will
be harder to write. But a software engineer should be willing to learn
the appropriate language for the job at hand.

To be fair, if you are used to writing Ada code that does scanning, and
have a good library for it, that job will be easier as well. So it may
just come down to experience and available tools.

I would guess that the average good programmer, starting with no
knowledge of either, can learn enough about regular expressions to write
the above faster than they can learn enough about writing scanners in
Ada to do that job well.

-- 
-- Stephe

next prev parent reply	other threads:[~2015-06-16 12:43 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-01 13:08 OpenToken: Parsing Ada (subset)? Jacob Sparre Andersen
2015-06-02 22:12 ` Stephen Leake
2015-06-03  1:43   ` Shark8
2015-06-03  7:36     ` Dmitry A. Kazakov
2015-06-05  9:03       ` Stephen Leake
2015-06-05  9:23         ` Georg Bauhaus
2015-06-05 20:49           ` Shark8
2015-06-05 23:52             ` Dennis Lee Bieber
2015-06-05 12:20         ` Dmitry A. Kazakov
2015-06-16 12:43           ` Stephen Leake [this message]
2015-06-16 13:24             ` Dmitry A. Kazakov
2015-06-16 14:13               ` G.B.
2015-06-17 17:38                 ` Stephen Leake
2015-06-17 17:29               ` Stephen Leake
2015-06-17 17:42                 ` Shark8
2015-06-17 19:03                 ` Dmitry A. Kazakov
2015-06-05 20:53         ` Shark8
2015-06-16 14:46           ` Stephen Leake
2015-06-16 15:31             ` G.B.
2015-06-17 17:44               ` Stephen Leake
2015-06-16 21:34             ` Randy Brukardt
2015-06-17 17:58               ` Stephen Leake
2015-06-17 20:44                 ` Randy Brukardt
2015-06-18  7:51                 ` AdaMagica
2015-06-18  9:12                 ` Georg Bauhaus
2015-06-17 17:50 ` AdaMagica

replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox