From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-0.3 required=5.0 tests=BAYES_00,
	REPLYTO_WITHOUT_TO_CC autolearn=no autolearn_force=no version=3.4.4
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!mx02.eternal-september.org!feeder.eternal-september.org!aioe.org!.POSTED!not-for-mail
From: "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de>
Newsgroups: comp.lang.ada
Subject: Re: OpenToken: Parsing Ada (subset)?
Date: Wed, 17 Jun 2015 21:03:59 +0200
Organization: cbb software GmbH
Message-ID: <1ucuzb8jv2ibe.4awaxtp8eab6.dlg@40tude.net>
References: <878uc3r2y6.fsf@adaheads.sparre-andersen.dk>
 <85twupvjxo.fsf@stephe-leake.org>
 <81ceb070-16fe-4578-a09a-eb11a2bbb664@googlegroups.com>
 <162zj7c2l0ykp$.1rxias18vby83.dlg@40tude.net>
 <856172bk80.fsf@stephe-leake.org> <1ljiyuuchbxvp.wrtbilkw3rdb.dlg@40tude.net>
 <85pp4vakmy.fsf@stephe-leake.org>
 <1a08qrccls0bi$.16y7q3hosklae.dlg@40tude.net>
 <85twu68cqb.fsf@stephe-leake.org>
Reply-To: mailbox@dmitry-kazakov.de
NNTP-Posting-Host: evoS9sCOdnHjo0GRLLMU1Q.user.speranza.aioe.org
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Complaints-To: abuse@aioe.org
User-Agent: 40tude_Dialog/2.0.15.1
X-Notice: Filtered by postfilter v. 0.8.2
Xref: news.eternal-september.org comp.lang.ada:26364
Date: 2015-06-17T21:03:59+02:00
List-Id: <comp.lang.ada>

On Wed, 17 Jun 2015 12:29:48 -0500, Stephen Leake wrote:

> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:
> 
>> On Tue, 16 Jun 2015 07:43:49 -0500, Stephen Leake wrote:
>>
>>> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:
>>> 
>>> Here's the regular expression I use for Ada numeric literals:
>>> 
>>> "\([0-9]+#\)?[-+0-9a-fA-F.]+\(#\)?"
>>> 
>>> Given that you are at least a little familiar with regular expressions,
>>> there's nothing hard about that.
>>
>> It is hard.
> 
> Ok, I gather you are not "at least a little familiar with regular
> expressions".

*Nobody* is familiar with to be sure that the language generated by the
pattern like above is one of the Ada numeric literal. Note "like", because
your pattern obviously does not generate the Ada numeric literal.

The things are actually much worse that complexity. It is a combination of
complexity and weakness. Regular expressions cannot do stuff like Ada
literals. Thus patterns actually used are only approximations to what is
required. The designer must know how the generated language differ from the
required one. And the reader must read not only the program but also the
mind of pattern designer. 

>>> It does not enforce all the lexical rules for numbers; it allows
>>> repeated, leading, and trailing underscores; it doesn't enforce pairs of
>>> '#'.
>>
>> That is exactly the point. It does not parse literal right 
> 
> It's _not_ a "parser"; it's a "lexer".
> 
> Define "right".

def Right:

No false positives, no false negatives <=> Rejects only illegal literals,
accepts only legal literals.

>The line between lexer and parser is a design decision,
> not set in stone. 

True, but we are not talking about higher-level things like maximum
fraction length supported. Simple lexical stuff like:

- matching '#'s
- non-repeating '_'s
- valid base number
- the set of digits corresponding to the base
etc

all are beyond the power of regular expressions. (Unlike SNOBOL patterns)

>> and you have to
>> reparse the matched chunk of text once again. What was the gain? 
> 
> Doing it this way allows reusing a regexp engine, which is easier than
> writing a lexer from scatch.

You still have to parse it again. Also with or without regular expression
you have to do it. The only difference is in detecting the end of the
lexeme. Not a problem for manually written scanner at all.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de