OpenToken: Parsing Ada (subset)?

comp.lang.ada
 help / color / mirror / Atom feed

* OpenToken: Parsing Ada (subset)?
@ 2015-06-01 13:08 Jacob Sparre Andersen
  2015-06-02 22:12 ` Stephen Leake
  2015-06-17 17:50 ` AdaMagica
  0 siblings, 2 replies; 26+ messages in thread
From: Jacob Sparre Andersen @ 2015-06-01 13:08 UTC (permalink / raw)


I'm attempting to use OpenToken to parse an Ada subset without much
success.

I've reduced my problem all the way down to attempting to parse a
"dotted" identifier:

   Example           -> Dotted_Identifier EOF
   Dotted_Identifier -> Identifier | Dotted_Identifier "." Identifier

I declare the syntax like this (I've removed the reserved words for now):

   Syntax : constant Tokenizer.Syntax :=
     (Dot_T                 => Tokenizer.Get (OpenToken.Recognizer.Separator.Get (".")),
      Identifier_T          => Tokenizer.Get (OpenToken.Recognizer.Identifier.Get
                                                (Start_Chars => Ada.Strings.Maps.Constants.Letter_Set,
                                                 Body_Chars  => Ada.Strings.Maps.Constants.Alphanumeric_Set)),
      Comment_T             => Tokenizer.Get (OpenToken.Recognizer.Line_Comment.Get ("--")),
      Whitespace_T          => Tokenizer.Get (OpenToken.Recognizer.Character_Set.Get
                                                (OpenToken.Recognizer.Character_Set.Standard_Whitespace)),
      End_Of_File_T         => Tokenizer.Get (OpenToken.Recognizer.End_Of_File.Get));

I declare the token objects (for writing the grammar) like this:

   Dot        : constant Master_Token.Class := Master_Token.Get (Dot_T);
   EOF        : constant Master_Token.Class := Master_Token.Get (End_Of_File_T);
   Identifier : constant Identifiers.Instance'Class := Identifiers.Get (Identifier_T);

   Compilation_Unit  : constant Nonterminal.Class := Nonterminal.Get (Compilation_Unit_T);
   Dotted_Identifier : constant Dotted_Identifiers.Class := Dotted_Identifiers.Get (Dotted_Identifier_T);

I declare the grammar as:

   Grammar : constant Production_List.Instance :=
     Compilation_Unit  <= Dotted_Identifier & EOF and
     Dotted_Identifier <= Dotted_Identifier & Dot & Identifier + Dotted_Identifiers.Join_Dotted_Identifiers and
     Dotted_Identifier <= Identifier;

When I attempt to use this grammar to parse a "dotted" identifier, I run
into problems when OpenToken attempts to convert an identifier token to
a Dotted_Identifier.Instance object.  The problem which occurs is that
the identifier is represented as a plain Master_Token.Instance object,
which doesn't contain the actual identifier - which means that I can't
copy it into the "dotted" identifier.

I'm using OpenToken 6.0b.  The full compilable source (including the
used OpenToken packages) can be downloaded from
<http://jacob.sparre-andersen.dk/temp/ada_parsing_with_opentoken-2015-06-01-a.zip>.
To build the example:

   gnatmake parse_dotted_identifier

To test it:

   echo Ada.Text_IO.Integer_IO | ./parse_dotted_identifier

I'm probably doing something obvious wrong, but what?

Greetings,

Jacob
-- 
"It is very easy to get ridiculously confused about the
 tenses of time travel, but most things can be resolved
 by a sufficiently large ego."


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OpenToken: Parsing Ada (subset)?
  2015-06-01 13:08 OpenToken: Parsing Ada (subset)? Jacob Sparre Andersen
@ 2015-06-02 22:12 ` Stephen Leake
  2015-06-03  1:43   ` Shark8
  2015-06-17 17:50 ` AdaMagica
  1 sibling, 1 reply; 26+ messages in thread
From: Stephen Leake @ 2015-06-02 22:12 UTC (permalink / raw)

Jacob Sparre Andersen <sparre@nbi.dk> writes:

> I'm attempting to use OpenToken to parse an Ada subset without much
> success.

(Shameless plug) An alternative approach is to start from the full Ada
grammar in Emacs ada-mode (ada-grammar.wy), and reduce it to what you
want. You'd also have to change the actions; they are focused on
supporting indentation and navigation in Emacs.

That grammar is in a .wy file, which must be processed by wisi-generate
to produce Ada code. The code produced by wisi-generate does not support
Ada actions at the moment (it only supports Emacs elisp, or a process or
dll communicating with Emacs), but you could either add that to
wisi-generate or treat the generated code as a template and add your own
actions. 

wisi-generate also assumes a single token type, so that changes how you
write the actions (more code in the actions, less in the tokens).

If you want to try adding actions to wisi-generate, I suggest you start
from FastToken (in monotone branch org.fasttoken, no release yet). It is
passing all its tests. I am still working on it, so the code will be
changing quite a bit. I plain to completely eliminate the token type
hierarchy, to avoid the run-time dynamic memory management it requires.
But that should be mostly orthogonal to actions. I'd be happy to help
with this; it is on my list of things to do someday, and having an
actual user is fun :).

> Identifier_T          =>
>    Tokenizer.Get (OpenToken.Recognizer.Identifier.Get
>                   (Start_Chars => Ada.Strings.Maps.Constants.Letter_Set,
>                    Body_Chars  => Ada.Strings.Maps.Constants.Alphanumeric_Set)),

This line says to use the default 'Get' routine to specify the token
type to use for Identifier_T; that is Master_Token.Instance, as you
complain later (see opentoken-token-enumerated-analyzer.ads:88). Change
to:

      Identifier_T          =>
         Tokenizer.Get 
            (OpenToken.Recognizer.Identifier.Get
                (Start_Chars => Ada.Strings.Maps.Constants.Letter_Set,
                 Body_Chars  => Ada.Strings.Maps.Constants.Alphanumeric_Set),
                 Identifiers.Get (Identifier_T),

I hate default parameters in general; this one is sort of convenient
(most terminals are happy being Master_Token.Instance, and you don't
have to repeat Identifier_T), but easily leads to this kind of bug (most
default parameters lead to hard to find bugs!).

Note that the ID parameter to Identifiers.Get could be defaulted; it is
overwritten in Analyzer.Initialize.

To see this in the code; terminal token objects are created on the
parser stack at opentoken-production-parser-lalr-parser.adb:318, as a
copy of a token returned by Analyzer.Get. Analyzer.Get returns the token
stored in the syntax; see opentoken-token-enumerated-analyzer.adb:772.

Nonterminal token objects are created at
opentoken-production-parser-lalr-parser.adb:177, as a copy of the
production left hand side (ie Dotted_Identifier below).

> Identifier : constant Identifiers.Instance'Class := Identifiers.Get (Identifier_T);
>
> I declare the grammar as:
>
>    Grammar : constant Production_List.Instance :=
>      Compilation_Unit  <= Dotted_Identifier & EOF and
>      Dotted_Identifier <= Dotted_Identifier & Dot & Identifier + Dotted_Identifiers.Join_Dotted_Identifiers and
>      Dotted_Identifier <= Identifier;

You might think that the use of Identifier.Instance here does what you
want, but it doesn't; the grammar determines the types only of the
nonterminals, the syntax determines the types of the terminals.

I suppose I should add that to the OpenToken user manual, but I'd rather
work on the FastToken user manual, which won't have this problem (only
one type for tokens :).

> I'm probably doing something obvious wrong, but what?

Obvious to me, but I've been messing with the lexer code in FastToken
recently; I switched to using regular expressions for the token
recognizers (to be closer to Aflex, which is also now supported). While
doing that, I deleted the default Get above.

Using regular expressions instead of the old OpenToken recognizers is a
big change, but wisi-generate takes care of most of the drudge work for
you (you just have to specify the actual regular expressions).

Glad to hear someone is using OpenToken, and it's fun to actually talk
about this stuff once in a while :).

-- 
-- Stephe

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OpenToken: Parsing Ada (subset)?
  2015-06-02 22:12 ` Stephen Leake
@ 2015-06-03  1:43   ` Shark8
  2015-06-03  7:36     ` Dmitry A. Kazakov
  0 siblings, 1 reply; 26+ messages in thread
From: Shark8 @ 2015-06-03  1:43 UTC (permalink / raw)


On Tuesday, June 2, 2015 at 4:12:37 PM UTC-6, Stephen Leake wrote:
> 
> Obvious to me, but I've been messing with the lexer code in FastToken
> recently; I switched to using regular expressions for the token
> recognizers (to be closer to Aflex, which is also now supported). While
> doing that, I deleted the default Get above.
> 
> Using regular expressions instead of the old OpenToken recognizers is a
> big change, but wisi-generate takes care of most of the drudge work for
> you (you just have to specify the actual regular expressions).

Is that a good idea?
From my experience [mainly maintenance] RegEx is almost always a bad solution (though I will grant that most of my encounters w/ it involved applying it as a formatting/parsing tool for items that generally weren't amiable to such breakdowns [street addresses, for example, are good at containing info/formatting that kills a simple regex]).


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OpenToken: Parsing Ada (subset)?
  2015-06-03  1:43   ` Shark8
@ 2015-06-03  7:36     ` Dmitry A. Kazakov
  2015-06-05  9:03       ` Stephen Leake
  0 siblings, 1 reply; 26+ messages in thread
From: Dmitry A. Kazakov @ 2015-06-03  7:36 UTC (permalink / raw)


On Tue, 2 Jun 2015 18:43:50 -0700 (PDT), Shark8 wrote:

> On Tuesday, June 2, 2015 at 4:12:37 PM UTC-6, Stephen Leake wrote:
>> 
>> Obvious to me, but I've been messing with the lexer code in FastToken
>> recently; I switched to using regular expressions for the token
>> recognizers (to be closer to Aflex, which is also now supported). While
>> doing that, I deleted the default Get above.
>> 
>> Using regular expressions instead of the old OpenToken recognizers is a
>> big change, but wisi-generate takes care of most of the drudge work for
>> you (you just have to specify the actual regular expressions).
> 
> Is that a good idea?

No.

> From my experience [mainly maintenance] RegEx is almost always a bad
> solution (though I will grant that most of my encounters w/ it involved
> applying it as a formatting/parsing tool for items that generally weren't
> amiable to such breakdowns [street addresses, for example, are good at
> containing info/formatting that kills a simple regex]).

Yes. Maintenance is one problem, another is that the family of languages
recognized by regular expressions is far too weak. More powerful languages,
e.g. SNOBOL patterns are slower.

In the end it is always worth of efforts writing a manual token scanner by
hand. Firstly, there are not so many things you would have to recognize
that way. Secondly it is much more efficient than pattern matching. Thirdly
it would allow sane error messaging, because usually it is more outcomes
than matched vs. not matched, e.g. malformed identifier or missing
quotation mark.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OpenToken: Parsing Ada (subset)?
  2015-06-03  7:36     ` Dmitry A. Kazakov
@ 2015-06-05  9:03       ` Stephen Leake
  2015-06-05  9:23         ` Georg Bauhaus
                           ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Stephen Leake @ 2015-06-05  9:03 UTC (permalink / raw)

"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:

> On Tue, 2 Jun 2015 18:43:50 -0700 (PDT), Shark8 wrote:
>
>> On Tuesday, June 2, 2015 at 4:12:37 PM UTC-6, Stephen Leake wrote:
>>> 
>>> Obvious to me, but I've been messing with the lexer code in FastToken
>>> recently; I switched to using regular expressions for the token
>>> recognizers (to be closer to Aflex, which is also now supported). 
>
>> From my experience [mainly maintenance] RegEx is almost always a bad
>> solution (though I will grant that most of my encounters w/ it involved
>> applying it as a formatting/parsing tool for items that generally weren't
>> amiable to such breakdowns [street addresses, for example, are good at
>> containing info/formatting that kills a simple regex]).
>
> Yes. Maintenance is one problem, another is that the family of languages
> recognized by regular expressions is far too weak. More powerful languages,
> e.g. SNOBOL patterns are slower.

The simple lexer in FastToken is intended only to support the FastToken
unit tests, which focus on testing the parser generator and executor.
Simplicity is the prime driver here; no need for expressive power or
speed.

Aflex compiles all the regular expressions for all of the tokens into
one state machine, that visits each character in the input stream once.
You can't get faster than that.

> In the end it is always worth of efforts writing a manual token scanner by
> hand. 

"always" is way too strong a statement here.

If you trust that the regexp engine is well written and maintained, 
the expressive power is adequate for your language, and the speed is
adequate for your application, then why waste resources reimplementing
the tools? Use them and get on with the interesting work.

regexp are perfectly adequate for Ada. 

> Firstly, there are not so many things you would have to recognize that
> way. 

I guess you are saying that implementing a lexer for a restricted set of
tokens is easier than implementing a general regular expression engine.
True, but that's not the choice at hand; the choice is between
implementing a new lexer for a restricted set of tokens, or reusing an
existing regular expression engine (supported and maintained
externally) and specifying a small set of regular expressions (most of
which are simple strings for the reserved words).

> Secondly it is much more efficient than pattern matching. 

Not if you use the Aflex approach; the hand-written OpenToken lexer is
far less efficient than the compiled state machine that Aflex produces.

> Thirdly it would allow sane error messaging, because usually it is
> more outcomes than matched vs. not matched, e.g. malformed identifier
> or missing quotation mark.

This is a valid but minor point.

For Ada strings, since new line is excluded, a missing quotation mark
does not produce a very confusing error message (which is precisely why
new line is excluded). Other languages are worse for string errors, but
the parser stage can provide a better error message if desired. Syntax
highlighting in the typical IDE is the best way to address this
particular problem; when the entire rest of the file changes color, you
know you are missing a quote.

It's not an issue for the FastToken unit tests, and the syntax error
messages from OpenToken in other contexts have not bothered me yet
(certainly not as much as some Microsoft error messages: "cannot load
the specified module" indeed; tell me _which_ module!). Nor have I
gotten any complaints in that area from ada-mode users.

-- 
-- Stephe

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OpenToken: Parsing Ada (subset)?
  2015-06-05  9:03       ` Stephen Leake
@ 2015-06-05  9:23         ` Georg Bauhaus
  2015-06-05 20:49           ` Shark8
  2015-06-05 12:20         ` Dmitry A. Kazakov
  2015-06-05 20:53         ` Shark8
  2 siblings, 1 reply; 26+ messages in thread
From: Georg Bauhaus @ 2015-06-05  9:23 UTC (permalink / raw)


On 05.06.15 11:03, Stephen Leake wrote:
> Aflex compiles all the regular expressions for all of the tokens into
> one state machine, that visits each character in the input stream once.
> You can't get faster than that.

You can get faster than that if the OS supports
line-oriented I/O of text:

NotInString -> '-' ... -> '-' -> SKIPThisLine :)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OpenToken: Parsing Ada (subset)?
  2015-06-05  9:23         ` Georg Bauhaus
@ 2015-06-05 20:49           ` Shark8
  2015-06-05 23:52             ` Dennis Lee Bieber
  0 siblings, 1 reply; 26+ messages in thread
From: Shark8 @ 2015-06-05 20:49 UTC (permalink / raw)


On Friday, June 5, 2015 at 3:23:40 AM UTC-6, Georg Bauhaus wrote:
> On 05.06.15 11:03, Stephen Leake wrote:
> > Aflex compiles all the regular expressions for all of the tokens into
> > one state machine, that visits each character in the input stream once.
> > You can't get faster than that.
> 
> You can get faster than that if the OS supports
> line-oriented I/O of text:
> 
> NotInString -> '-' ... -> '-' -> SKIPThisLine :)

Hm, I seem to recall something about OpenVMS text-file that might qualify there.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OpenToken: Parsing Ada (subset)?
  2015-06-05 20:49           ` Shark8
@ 2015-06-05 23:52             ` Dennis Lee Bieber
  0 siblings, 0 replies; 26+ messages in thread
From: Dennis Lee Bieber @ 2015-06-05 23:52 UTC (permalink / raw)


On Fri, 5 Jun 2015 13:49:02 -0700 (PDT), Shark8 <onewingedshark@gmail.com>
declaimed the following:

>On Friday, June 5, 2015 at 3:23:40 AM UTC-6, Georg Bauhaus wrote:
>> On 05.06.15 11:03, Stephen Leake wrote:
>> > Aflex compiles all the regular expressions for all of the tokens into
>> > one state machine, that visits each character in the input stream once.
>> > You can't get faster than that.
>> 
>> You can get faster than that if the OS supports
>> line-oriented I/O of text:
>> 
>> NotInString -> '-' ... -> '-' -> SKIPThisLine :)
>
>Hm, I seem to recall something about OpenVMS text-file that might qualify there.

	VMS had an abundance of RMS (Record Management Services) formats even
for text files -- one of which was "carriage control" (<lf> delimited
lines). And don't get me started on the FORTRAN "native" "segmented record"
type (a logical record is split across 1 or more physical records, wherein
each physical record had a marker bits for first, continuation, last
record, and a record length value; continuation basically set neither flag,
a short record would have first and last bits set).
-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
    wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OpenToken: Parsing Ada (subset)?
  2015-06-05  9:03       ` Stephen Leake
  2015-06-05  9:23         ` Georg Bauhaus
@ 2015-06-05 12:20         ` Dmitry A. Kazakov
  2015-06-16 12:43           ` Stephen Leake
  2015-06-05 20:53         ` Shark8
  2 siblings, 1 reply; 26+ messages in thread
From: Dmitry A. Kazakov @ 2015-06-05 12:20 UTC (permalink / raw)


On Fri, 05 Jun 2015 04:03:27 -0500, Stephen Leake wrote:

> Aflex compiles all the regular expressions for all of the tokens into
> one state machine, that visits each character in the input stream once.
> You can't get faster than that.

It is visiting source character vs. visiting internal states of the machine
and transition computations. You cannot say what is more expensive in
advance. A transition computation may be much more expensive than accessing
a character stored in an array. However hand-written scanners do not roll
back either, usually.

>> In the end it is always worth of efforts writing a manual token scanner by
>> hand. 
> 
> "always" is way too strong a statement here.
> 
> If you trust that the regexp engine is well written and maintained, 
> the expressive power is adequate for your language, and the speed is
> adequate for your application, then why waste resources reimplementing
> the tools? Use them and get on with the interesting work.
> 
> regexp are perfectly adequate for Ada. 

Examples I had in mind were Ada identifier and Ada string literal, e.g. in
UTF-8 encoding. I don't think regular expression for these would be shorter
than a hand-written in Ada scanner.

>> Firstly, there are not so many things you would have to recognize that
>> way. 
> 
> I guess you are saying that implementing a lexer for a restricted set of
> tokens is easier than implementing a general regular expression engine.
> True, but that's not the choice at hand; the choice is between
> implementing a new lexer for a restricted set of tokens, or reusing an
> existing regular expression engine (supported and maintained
> externally) and specifying a small set of regular expressions (most of
> which are simple strings for the reserved words).

Yes, it is writing regular expression vs. writing Ada program. I prefer Ada
program.

>> Thirdly it would allow sane error messaging, because usually it is
>> more outcomes than matched vs. not matched, e.g. malformed identifier
>> or missing quotation mark.
> 
> This is a valid but minor point.
> 
> For Ada strings, since new line is excluded, a missing quotation mark
> does not produce a very confusing error message (which is precisely why
> new line is excluded).

Which is tricky in the latest standard because RM refers to line
terminating characters, whereas the OS file system may have its own
definition of line end.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OpenToken: Parsing Ada (subset)?
  2015-06-05 12:20         ` Dmitry A. Kazakov
@ 2015-06-16 12:43           ` Stephen Leake
  2015-06-16 13:24             ` Dmitry A. Kazakov
  0 siblings, 1 reply; 26+ messages in thread
From: Stephen Leake @ 2015-06-16 12:43 UTC (permalink / raw)

"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:

> On Fri, 05 Jun 2015 04:03:27 -0500, Stephen Leake wrote:
>
>> Aflex compiles all the regular expressions for all of the tokens into
>> one state machine, that visits each character in the input stream once.
>> You can't get faster than that.
>
> It is visiting source character vs. visiting internal states of the machine
> and transition computations. You cannot say what is more expensive in
> advance. 

True; measuring is always required.

> A transition computation may be much more expensive than accessing
> a character stored in an array. However hand-written scanners do not roll
> back either, usually.

Ok.

>>> In the end it is always worth of efforts writing a manual token scanner by
>>> hand. 
>> 
>> "always" is way too strong a statement here.
>> 
>> If you trust that the regexp engine is well written and maintained, 
>> the expressive power is adequate for your language, and the speed is
>> adequate for your application, then why waste resources reimplementing
>> the tools? Use them and get on with the interesting work.
>> 
>> regexp are perfectly adequate for Ada. 
>
> Examples I had in mind were Ada identifier and Ada string literal, e.g. in
> UTF-8 encoding. I don't think regular expression for these would be shorter
> than a hand-written in Ada scanner.

Yes, but also not longer!

regular expression for identifer: "foo"

code for identifier: lexeme = "foo"

>>> Firstly, there are not so many things you would have to recognize that
>>> way. 
>> 
>> I guess you are saying that implementing a lexer for a restricted set of
>> tokens is easier than implementing a general regular expression engine.
>> True, but that's not the choice at hand; the choice is between
>> implementing a new lexer for a restricted set of tokens, or reusing an
>> existing regular expression engine (supported and maintained
>> externally) and specifying a small set of regular expressions (most of
>> which are simple strings for the reserved words).
>
> Yes, it is writing regular expression vs. writing Ada program. I prefer Ada
> program.

I'm confused; you start by agreeing with my statement, but then say
something else.

I said "implement a general regular expression engine", meaning the Ada
code that interprets regular expressions at run time. I hope you agree
that writing that is harder than writing an Ada scanner.

To address your point; let's compare writing a regular expression with
writing Ada code to do the same thing.

Here's the regular expression I use for Ada numeric literals:

"\([0-9]+#\)?[-+0-9a-fA-F.]+\(#\)?"

Given that you are at least a little familiar with regular expressions,
there's nothing hard about that.

It does not enforce all the lexical rules for numbers; it allows
repeated, leading, and trailing underscores; it doesn't enforce pairs of
'#'. But there is no legal Ada text that will be misinterpreted by this
expression. It is always a trade off between enforcing that sort of rule
in the lexer or the parser.

The OpenToken Ada code that does the same job (while enforcing some of
the rules) is in opentoken-recognizer-based_integer_real_ada.ads, .adb,
opentoken-recognizer-extended_digits.ads, .adb. That's a total of 469
lines of code. That's a state-based scanner; a recursive descent one
would be simpler.

It took several hours to write the Ada code, and a few more for the tests.
But later I found a bug that required a new test and code change.

It took a few minutes to write the regular expression, and I reused the
tests; no bugs found yet.

Obviously, if you are not familiar with regular expressions, they will
be harder to write. But a software engineer should be willing to learn
the appropriate language for the job at hand.

To be fair, if you are used to writing Ada code that does scanning, and
have a good library for it, that job will be easier as well. So it may
just come down to experience and available tools.

I would guess that the average good programmer, starting with no
knowledge of either, can learn enough about regular expressions to write
the above faster than they can learn enough about writing scanners in
Ada to do that job well.

-- 
-- Stephe

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OpenToken: Parsing Ada (subset)?
  2015-06-16 12:43           ` Stephen Leake
@ 2015-06-16 13:24             ` Dmitry A. Kazakov
  2015-06-16 14:13               ` G.B.
  2015-06-17 17:29               ` Stephen Leake
  0 siblings, 2 replies; 26+ messages in thread
From: Dmitry A. Kazakov @ 2015-06-16 13:24 UTC (permalink / raw)


On Tue, 16 Jun 2015 07:43:49 -0500, Stephen Leake wrote:

> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:
> 
>> Yes, it is writing regular expression vs. writing Ada program. I prefer Ada
>> program.
> 
> I'm confused; you start by agreeing with my statement, but then say
> something else.
> 
> I said "implement a general regular expression engine", meaning the Ada
> code that interprets regular expressions at run time. I hope you agree
> that writing that is harder than writing an Ada scanner.
> 
> To address your point; let's compare writing a regular expression with
> writing Ada code to do the same thing.
> 
> Here's the regular expression I use for Ada numeric literals:
> 
> "\([0-9]+#\)?[-+0-9a-fA-F.]+\(#\)?"
> 
> Given that you are at least a little familiar with regular expressions,
> there's nothing hard about that.

It is hard.

> It does not enforce all the lexical rules for numbers; it allows
> repeated, leading, and trailing underscores; it doesn't enforce pairs of
> '#'.

That is exactly the point. It does not parse literal right and you have to
reparse the matched chunk of text once again. What was the gain? Why
wouldn't do it right in single step?

> It took a few minutes to write the regular expression, and I reused the
> tests; no bugs found yet.

> Obviously, if you are not familiar with regular expressions, they will
> be harder to write. But a software engineer should be willing to learn
> the appropriate language for the job at hand.

That is not the point. I am familiar with C, but I am avoiding writing
anything in C. Regular expressions is a far worse language than C and,
additionally, incapable to parse Ada literals. Why bother with that mess?
 
> I would guess that the average good programmer, starting with no
> knowledge of either, can learn enough about regular expressions to write
> the above faster than they can learn enough about writing scanners in
> Ada to do that job well.

The problem is that writing a correct pattern for anything more complex
than trivial is hard even for people doing this on daily basis. For an
average programmer it is patently impossible.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OpenToken: Parsing Ada (subset)?
  2015-06-16 13:24             ` Dmitry A. Kazakov
@ 2015-06-16 14:13               ` G.B.
  2015-06-17 17:38                 ` Stephen Leake
  2015-06-17 17:29               ` Stephen Leake
  1 sibling, 1 reply; 26+ messages in thread
From: G.B. @ 2015-06-16 14:13 UTC (permalink / raw)


On 16.06.15 15:24, Dmitry A. Kazakov wrote:
>> It does not enforce all the lexical rules for numbers; it allows
>> >repeated, leading, and trailing underscores; it doesn't enforce pairs of
>> >'#'.
> That is exactly the point. It does not parse literal right and you have to
> reparse the matched chunk of text once again. What was the gain? Why
> wouldn't do it right in single step?

(I believe the use case here permits simplifications,
meaning that REs are not being used for meticulous, final
parsing of Ada.)

But '_' seems missing from "[-+0-9a-fA-F.]+". (And obsolete
Ada syntax, i.e. substitutes for '#'. Which makes a CFG
parser more desirable if '#' or ':' should have matching
occurrences. ;-)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OpenToken: Parsing Ada (subset)?
  2015-06-16 14:13               ` G.B.
@ 2015-06-17 17:38                 ` Stephen Leake
  0 siblings, 0 replies; 26+ messages in thread
From: Stephen Leake @ 2015-06-17 17:38 UTC (permalink / raw)

"G.B." <bauhaus@futureapps.invalid> writes:

> On 16.06.15 15:24, Dmitry A. Kazakov wrote:
>>> It does not enforce all the lexical rules for numbers; it allows
>>> repeated, leading, and trailing underscores; it doesn't enforce pairs of
>>> '#'.
>> That is exactly the point. It does not parse literal right and you have to
>> reparse the matched chunk of text once again. What was the gain? Why
>> wouldn't do it right in single step?
>
> (I believe the use case here permits simplifications,
> meaning that REs are not being used for meticulous, final
> parsing of Ada.)

Yes, my examples are drawn from a parser for the indentation/navigation
engine in Emacs Ada mode, which is actually required to be looser about
Ada syntax rules than an Ada compiler.

Adding an "enforce strict rules" requirement could change the
lexer/parser design, especially if you place an emphasis on
good/helpful/useful error messages.

In addition, OpenToken (and FastToken) are intended for writing
grammar-based parsers quickly, not for getting the best possible
performance, or meeting other project-specific requirements. So the
design priorizes minimizing the amount of new code that must be written
for a new language.

> But '_' seems missing from "[-+0-9a-fA-F.]+". 

Oops; good catch. I would have found that in an ada-mode test,
but I'm not actually planning on using the Aflex lexer in Emacs.

> (And obsolete Ada syntax, i.e. substitutes for '#'. Which makes a CFG
> parser more desirable if '#' or ':' should have matching occurrences.
> ;-)

Emacs ada-mode explicitly ignores such things (at least until someone
asks for it).

-- 
-- Stephe

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OpenToken: Parsing Ada (subset)?
  2015-06-16 13:24             ` Dmitry A. Kazakov
  2015-06-16 14:13               ` G.B.
@ 2015-06-17 17:29               ` Stephen Leake
  2015-06-17 17:42                 ` Shark8
  2015-06-17 19:03                 ` Dmitry A. Kazakov
  1 sibling, 2 replies; 26+ messages in thread
From: Stephen Leake @ 2015-06-17 17:29 UTC (permalink / raw)


"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:

> On Tue, 16 Jun 2015 07:43:49 -0500, Stephen Leake wrote:
>
>> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:
>> 
>> Here's the regular expression I use for Ada numeric literals:
>> 
>> "\([0-9]+#\)?[-+0-9a-fA-F.]+\(#\)?"
>> 
>> Given that you are at least a little familiar with regular expressions,
>> there's nothing hard about that.
>
> It is hard.

Ok, I gather you are not "at least a little familiar with regular
expressions". Too bad.

>> It does not enforce all the lexical rules for numbers; it allows
>> repeated, leading, and trailing underscores; it doesn't enforce pairs of
>> '#'.
>
> That is exactly the point. It does not parse literal right 

It's _not_ a "parser"; it's a "lexer".

Define "right". The line between lexer and parser is a design decision,
not set in stone. 

> and you have to
> reparse the matched chunk of text once again. What was the gain? 

Doing it this way allows reusing a regexp engine, which is easier than
writing a lexer from scatch.

>> It took a few minutes to write the regular expression, and I reused the
>> tests; no bugs found yet.
>
>> Obviously, if you are not familiar with regular expressions, they will
>> be harder to write. But a software engineer should be willing to learn
>> the appropriate language for the job at hand.
>
> That is not the point. I am familiar with C, but I am avoiding writing
> anything in C. Regular expressions is a far worse language than C and,
> additionally, incapable to parse Ada literals. Why bother with that
> mess?

As I have explained several times, I believe this approach is easier
than writing a lexer by hand; why bother with _that_ mess?

It's your choice. It would be nice if you could admit that other people can
make other choices, and still write good code.

Perhaps you could post lexer code that does this "right" by your
definition, so we could judge for ourselves?

>> I would guess that the average good programmer, starting with no
>> knowledge of either, can learn enough about regular expressions to write
>> the above faster than they can learn enough about writing scanners in
>> Ada to do that job well.
>
> The problem is that writing a correct pattern for anything more complex
> than trivial is hard even for people doing this on daily basis. For an
> average programmer it is patently impossible.

"patently" is overkill here; this is simply not my experience.

-- 
-- Stephe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OpenToken: Parsing Ada (subset)?
  2015-06-17 17:29               ` Stephen Leake
@ 2015-06-17 17:42                 ` Shark8
  2015-06-17 19:03                 ` Dmitry A. Kazakov
  1 sibling, 0 replies; 26+ messages in thread
From: Shark8 @ 2015-06-17 17:42 UTC (permalink / raw)

On Wednesday, June 17, 2015 at 11:29:51 AM UTC-6, Stephen Leake wrote:
> 
> It's _not_ a "parser"; it's a "lexer".
> 
> Define "right". The line between lexer and parser is a design decision,
> not set in stone. 
> 
> > and you have to
> > reparse the matched chunk of text once again. What was the gain? 
> 
> Doing it this way allows reusing a regexp engine, which is easier than
> writing a lexer from scatch.

I just implemented a parser for BB-code (a subset, anyway) -- lexing is essentially finding '[', the ']' after that occurrence, if '/' is the '['+1th character. -- Even though that, in itself, isn't beyond the capabilities of regex I wouldn't recommend using regex: the "simple" usually turns out to be deceptive, esp. WRT regex.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OpenToken: Parsing Ada (subset)?
  2015-06-17 17:29               ` Stephen Leake
  2015-06-17 17:42                 ` Shark8
@ 2015-06-17 19:03                 ` Dmitry A. Kazakov
  1 sibling, 0 replies; 26+ messages in thread
From: Dmitry A. Kazakov @ 2015-06-17 19:03 UTC (permalink / raw)

On Wed, 17 Jun 2015 12:29:48 -0500, Stephen Leake wrote:

> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:
> 
>> On Tue, 16 Jun 2015 07:43:49 -0500, Stephen Leake wrote:
>>
>>> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:
>>> 
>>> Here's the regular expression I use for Ada numeric literals:
>>> 
>>> "\([0-9]+#\)?[-+0-9a-fA-F.]+\(#\)?"
>>> 
>>> Given that you are at least a little familiar with regular expressions,
>>> there's nothing hard about that.
>>
>> It is hard.
> 
> Ok, I gather you are not "at least a little familiar with regular
> expressions".

*Nobody* is familiar with to be sure that the language generated by the
pattern like above is one of the Ada numeric literal. Note "like", because
your pattern obviously does not generate the Ada numeric literal.

The things are actually much worse that complexity. It is a combination of
complexity and weakness. Regular expressions cannot do stuff like Ada
literals. Thus patterns actually used are only approximations to what is
required. The designer must know how the generated language differ from the
required one. And the reader must read not only the program but also the
mind of pattern designer. 

>>> It does not enforce all the lexical rules for numbers; it allows
>>> repeated, leading, and trailing underscores; it doesn't enforce pairs of
>>> '#'.
>>
>> That is exactly the point. It does not parse literal right 
> 
> It's _not_ a "parser"; it's a "lexer".
> 
> Define "right".

def Right:

No false positives, no false negatives <=> Rejects only illegal literals,
accepts only legal literals.

>The line between lexer and parser is a design decision,
> not set in stone. 

True, but we are not talking about higher-level things like maximum
fraction length supported. Simple lexical stuff like:

- matching '#'s
- non-repeating '_'s
- valid base number
- the set of digits corresponding to the base
etc

all are beyond the power of regular expressions. (Unlike SNOBOL patterns)

>> and you have to
>> reparse the matched chunk of text once again. What was the gain? 
> 
> Doing it this way allows reusing a regexp engine, which is easier than
> writing a lexer from scatch.

You still have to parse it again. Also with or without regular expression
you have to do it. The only difference is in detecting the end of the
lexeme. Not a problem for manually written scanner at all.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OpenToken: Parsing Ada (subset)?
  2015-06-05  9:03       ` Stephen Leake
  2015-06-05  9:23         ` Georg Bauhaus
  2015-06-05 12:20         ` Dmitry A. Kazakov
@ 2015-06-05 20:53         ` Shark8
  2015-06-16 14:46           ` Stephen Leake
  2 siblings, 1 reply; 26+ messages in thread
From: Shark8 @ 2015-06-05 20:53 UTC (permalink / raw)


On Friday, June 5, 2015 at 3:03:30 AM UTC-6, Stephen Leake wrote:
> 
> If you trust that the regexp engine is [1]well written and maintained, 
> [2]the expressive power is adequate for your language, and [3]the speed is
> adequate for your application, then why waste resources reimplementing
> the tools? Use them and get on with the interesting work.

While #3 seems to not come up often, #1 and #2 seem to be far more "question-able" -- #2 is especially misevaluated a LOT. (If it wasn't we wouldn't see people trying to parse HTML or CSV with regex.)

> regexp are perfectly adequate for Ada.

Even something like Character'('C')?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OpenToken: Parsing Ada (subset)?
  2015-06-05 20:53         ` Shark8
@ 2015-06-16 14:46           ` Stephen Leake
  2015-06-16 15:31             ` G.B.
  2015-06-16 21:34             ` Randy Brukardt
  0 siblings, 2 replies; 26+ messages in thread
From: Stephen Leake @ 2015-06-16 14:46 UTC (permalink / raw)

Shark8 <onewingedshark@gmail.com> writes:

> On Friday, June 5, 2015 at 3:03:30 AM UTC-6, Stephen Leake wrote:
>>
>> If you trust that the regexp engine is [1]well written and maintained,
>> [2]the expressive power is adequate for your language, and [3]the speed is
>> adequate for your application, then why waste resources reimplementing
>> the tools? Use them and get on with the interesting work.
>
> While #3 seems to not come up often, #1 and #2 seem to be far more
> "question-able" -- #2 is especially misevaluated a LOT. (If it wasn't
> we wouldn't see people trying to parse HTML or CSV with regex.)

Yes.

>> regexp are perfectly adequate for Ada.
>
> Even something like Character'('C')?

Hmm. I've never had problem with code like that, but it does seem like
the lexer could treat '(' as a character literal, which would produce a
parse error.

Testing ...

The Emacs lexer handles these properly:

Character'('C')
Character'( 'C' )
Character ' ( 'C' )

but not:

Character '('C')

(Which may be one reason I never write code the latter way :)

The Emacs lexer regular expression for character literal is:

"[^a-zA-Z0-9)]'[^'\n]'"

which says if the first tick is preceded by identifier characters or
right paren, it's not a character literal; that explains the above
behavior, and works for typical Ada code in Emacs.

But that regular expression doesn't work in a normal lexer, since it
references text before the beginning of the desired lexeme. Emacs is
_not_ a "normal lexer".

Using the regular expression "'[^']'|''''" for CHARACTER_LITERAL
(handling the special case ''''), the Aflex lexer handles the
above cases as follows:

Character'('C')
IDENTIFIER CHARACTER_LITERAL IDENTIFIER TICK RIGHT_PAREN

Character'( 'C' )
IDENTIFIER TICK LEFT_PAREN CHARACTER_LITERAL RIGHT_PAREN

Character '('C')
IDENTIFIER CHARACTER_LITERAL IDENTIFIER TICK RIGHT_PAREN

Character ' ( 'C' )
IDENTIFIER TICK LEFT_PAREN CHARACTER_LITERAL RIGHT_PAREN

This is as expected (but not desired).

One way to handle this is to provide for feedback from the parser to the
lexer; if a parse fails, push back the character literal, tell the lexer
to treat the first single quote as a TICK, and procede. I'll work on
implementing that in FastToken with the Aflex lexer; it will be a good
example.

Another way is to treat this particular sequence of tokens as a valid
expression, but rewrite it before handing off to the rest of the parser.
That requires identifying all such special cases; not too hard.

A third choice is to not define a CHARACTER_LITERAL token; then the
sequence of tokens is always

IDENTIFIER TICK LEFT_PAREN TICK IDENTIFIER TICK RIGHT_PAREN

and the parser must identify the character literal, or the grammar must
be re-written in the same manner. That may be the simplest solution.

If I recall correctly, this issue has been discussed here before, and
the proposed solutions were similar. I don't know how GNAT handles this.

I think the statement "regular expressions are perfectly adequate for
Ada" stands; this case just shows that the parser must be complicated if
the lexer is not.

This case is a good example of the possible trade-offs between the lexer
and parser complexity; the Emacs lexer handles all typical cases without
feedback from the parser, but is more complex than an Aflex lexer. The
Aflex lexer handles the same cases, but requires feedback from the
parser or other complexity.

--
-- Stephe

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OpenToken: Parsing Ada (subset)?
  2015-06-16 14:46           ` Stephen Leake
@ 2015-06-16 15:31             ` G.B.
  2015-06-17 17:44               ` Stephen Leake
  2015-06-16 21:34             ` Randy Brukardt
  1 sibling, 1 reply; 26+ messages in thread
From: G.B. @ 2015-06-16 15:31 UTC (permalink / raw)


On 16.06.15 16:46, Stephen Leake wrote:
> Using the regular expression "'[^']'|''''" for CHARACTER_LITERAL

ISO/IEC 10646 question: Is [^'] matching a single wide character?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OpenToken: Parsing Ada (subset)?
  2015-06-16 15:31             ` G.B.
@ 2015-06-17 17:44               ` Stephen Leake
  0 siblings, 0 replies; 26+ messages in thread
From: Stephen Leake @ 2015-06-17 17:44 UTC (permalink / raw)

"G.B." <bauhaus@futureapps.invalid> writes:

> On 16.06.15 16:46, Stephen Leake wrote:
>> Using the regular expression "'[^']'|''''" for CHARACTER_LITERAL
>
> ISO/IEC 10646 question: Is [^'] matching a single wide character?

I don't think this is an ISO question; the Emacs regexp engine does not
claim to match an ISO regexp spec (at least, not that I'm aware of).

In practice, this does work for non-ASCII characters in Emacs, so the
answer is "yes" for Emacs.

I haven't tested this with Aflex yet; I don't think Aflex will work with
non-ASCII text (one of the reasons I'm not planning to use it with
Emacs).

I suspect it would require a major rewrite to get Aflex to work with
UTF-8 or some other non-ASCII representation.

-- 
-- Stephe

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OpenToken: Parsing Ada (subset)?
  2015-06-16 14:46           ` Stephen Leake
  2015-06-16 15:31             ` G.B.
@ 2015-06-16 21:34             ` Randy Brukardt
  2015-06-17 17:58               ` Stephen Leake
  1 sibling, 1 reply; 26+ messages in thread
From: Randy Brukardt @ 2015-06-16 21:34 UTC (permalink / raw)


"Stephen Leake" <stephen_leake@stephe-leake.org> wrote in message 
news:85k2v3aeyv.fsf@stephe-leake.org...
> Shark8 <onewingedshark@gmail.com> writes:
>
>> On Friday, June 5, 2015 at 3:03:30 AM UTC-6, Stephen Leake wrote:
>>>
>>> If you trust that the regexp engine is [1]well written and maintained,
>>> [2]the expressive power is adequate for your language, and [3]the speed 
>>> is
>>> adequate for your application, then why waste resources reimplementing
>>> the tools? Use them and get on with the interesting work.
>>
>> While #3 seems to not come up often, #1 and #2 seem to be far more
>> "question-able" -- #2 is especially misevaluated a LOT. (If it wasn't
>> we wouldn't see people trying to parse HTML or CSV with regex.)
>
> Yes.
>
>>> regexp are perfectly adequate for Ada.
>>
>> Even something like Character'('C')?
>
> Hmm. I've never had problem with code like that, but it does seem like
> the lexer could treat '(' as a character literal, which would produce a
> parse error.
>
> Testing ...
>
> The Emacs lexer handles these properly:
>
> Character'('C')
> Character'( 'C' )
> Character ' ( 'C' )
>
> but not:
>
> Character '('C')
>
> (Which may be one reason I never write code the latter way :)
>
> The Emacs lexer regular expression for character literal is:
>
> "[^a-zA-Z0-9)]'[^'\n]'"
>
> which says if the first tick is preceded by identifier characters or
> right paren, it's not a character literal; that explains the above
> behavior, and works for typical Ada code in Emacs.
>
> But that regular expression doesn't work in a normal lexer, since it
> references text before the beginning of the desired lexeme. Emacs is
> _not_ a "normal lexer".
>
> Using the regular expression "'[^']'|''''" for CHARACTER_LITERAL
> (handling the special case ''''), the Aflex lexer handles the
> above cases as follows:
>
> Character'('C')
> IDENTIFIER CHARACTER_LITERAL IDENTIFIER TICK RIGHT_PAREN
>
> Character'( 'C' )
> IDENTIFIER TICK LEFT_PAREN CHARACTER_LITERAL RIGHT_PAREN
>
> Character '('C')
> IDENTIFIER CHARACTER_LITERAL IDENTIFIER TICK RIGHT_PAREN
>
> Character ' ( 'C' )
> IDENTIFIER TICK LEFT_PAREN CHARACTER_LITERAL RIGHT_PAREN
>
> This is as expected (but not desired).
>
> One way to handle this is to provide for feedback from the parser to the
> lexer; if a parse fails, push back the character literal, tell the lexer
> to treat the first single quote as a TICK, and procede. I'll work on
> implementing that in FastToken with the Aflex lexer; it will be a good
> example.
>
> Another way is to treat this particular sequence of tokens as a valid
> expression, but rewrite it before handing off to the rest of the parser.
> That requires identifying all such special cases; not too hard.
>
> A third choice is to not define a CHARACTER_LITERAL token; then the
> sequence of tokens is always
>
> IDENTIFIER TICK LEFT_PAREN TICK IDENTIFIER TICK RIGHT_PAREN
>
> and the parser must identify the character literal, or the grammar must
> be re-written in the same manner. That may be the simplest solution.
>
> If I recall correctly, this issue has been discussed here before, and
> the proposed solutions were similar. I don't know how GNAT handles this.

I don't think you identified the solution that is typically used: remember 
the previous token identified by the lexer. Then, when encountering an 
apostrophe, the token is unconditionally an apostrophe if the preceding 
token is "all", an identifier, a character or string literal, or an rparen; 
else it might be a character literal. No "feedback from the parser" needed 
(that seems like a nightmare to me). The method was originally proposed by 
Tischler in Ada Letters in July 1983, pg 36. (I got this out of the comments 
of Janus/Ada, of course.)

I tend to agree with Dmitry; for lexing Ada, regular expressions are just 
not going to work; you'll need too many fixups to make them worth the 
trouble. Just write the thing in Ada, it won't take you any longer than 
figuring out the correct regular expression for an identifier. And that 
makes it easy to handle the weird special cases of Ada.

Other approaches are going to lex some programs incorrectly; how important 
that is will vary depending on what kind of tool you are writing but since 
the effort is similar, it's hard to see the advantage of a regular 
expression or other "automatic" lexer. (It makes much more sense for a 
parser, where the effort can be orders of magnitude different.)

                                        Randy. 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OpenToken: Parsing Ada (subset)?
  2015-06-16 21:34             ` Randy Brukardt
@ 2015-06-17 17:58               ` Stephen Leake
  2015-06-17 20:44                 ` Randy Brukardt
                                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Stephen Leake @ 2015-06-17 17:58 UTC (permalink / raw)


"Randy Brukardt" <randy@rrsoftware.com> writes:

> "Stephen Leake" <stephen_leake@stephe-leake.org> wrote in message 
> news:85k2v3aeyv.fsf@stephe-leake.org...

>> One way to handle this is to provide for feedback from the parser to the
>> lexer; if a parse fails, push back the character literal, tell the lexer
>> to treat the first single quote as a TICK, and procede. I'll work on
>> implementing that in FastToken with the Aflex lexer; it will be a good
>> example.
>>
>> Another way is to treat this particular sequence of tokens as a valid
>> expression, but rewrite it before handing off to the rest of the parser.
>> That requires identifying all such special cases; not too hard.
>>
>> A third choice is to not define a CHARACTER_LITERAL token; then the
>> sequence of tokens is always
>>
>> IDENTIFIER TICK LEFT_PAREN TICK IDENTIFIER TICK RIGHT_PAREN
>>
>> and the parser must identify the character literal, or the grammar must
>> be re-written in the same manner. That may be the simplest solution.
>>
>> If I recall correctly, this issue has been discussed here before, and
>> the proposed solutions were similar. I don't know how GNAT handles this.
>
> I don't think you identified the solution that is typically used: remember 
> the previous token identified by the lexer. Then, when encountering an 
> apostrophe, the token is unconditionally an apostrophe if the preceding 
> token is "all", an identifier, a character or string literal, or an rparen; 
> else it might be a character literal. 

That's the third choice above; the lexer returns TICK (= apostrophe) for
all cases, and the parser deals with further classification.

Hmm. Unless you are saying that logic is in the lexer; I don't see that
it matters much.

Aflex does have a provision for adding some logic in a lexer, although
I'm not sure it supports "remember the previous token".

> No "feedback from the parser" needed
> (that seems like a nightmare to me). The method was originally proposed by 
> Tischler in Ada Letters in July 1983, pg 36. (I got this out of the comments 
> of Janus/Ada, of course.)
>
> I tend to agree with Dmitry; for lexing Ada, regular expressions are just 
> not going to work; you'll need too many fixups to make them worth the 
> trouble. Just write the thing in Ada, it won't take you any longer than 
> figuring out the correct regular expression for an identifier. And that 
> makes it easy to handle the weird special cases of Ada.
>
> Other approaches are going to lex some programs incorrectly; how important 
> that is will vary depending on what kind of tool you are writing but since 
> the effort is similar, it's hard to see the advantage of a regular 
> expression or other "automatic" lexer. (It makes much more sense for a 
> parser, where the effort can be orders of magnitude different.)

Ok. I guess I'd like to see some actual examples of hand-written lexers.
The one in OpenToken is not inspiring to me; that's why I got rid of it
for FastToken (it's definitely easier for me to write regexps than to
write another OpenToken recognizer (= lexer module)).

I have looked briefly at the GNAT lexer. It is highly optimized, and is
apparently generated from some SNOBOL sources (ie, _not_ "hand
written"). For example, it uses nested if-then-else on each character of
each keyword; not something you want to do by hand.

-- 
-- Stephe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OpenToken: Parsing Ada (subset)?
  2015-06-17 17:58               ` Stephen Leake
@ 2015-06-17 20:44                 ` Randy Brukardt
  2015-06-18  7:51                 ` AdaMagica
  2015-06-18  9:12                 ` Georg Bauhaus
  2 siblings, 0 replies; 26+ messages in thread
From: Randy Brukardt @ 2015-06-17 20:44 UTC (permalink / raw)

"Stephen Leake" <stephen_leake@stephe-leake.org> wrote in message 
news:85h9q68bf8.fsf@stephe-leake.org...
> "Randy Brukardt" <randy@rrsoftware.com> writes:
>
>> "Stephen Leake" <stephen_leake@stephe-leake.org> wrote in message
>> news:85k2v3aeyv.fsf@stephe-leake.org...
>
>>> One way to handle this is to provide for feedback from the parser to the
>>> lexer; if a parse fails, push back the character literal, tell the lexer
>>> to treat the first single quote as a TICK, and procede. I'll work on
>>> implementing that in FastToken with the Aflex lexer; it will be a good
>>> example.
>>>
>>> Another way is to treat this particular sequence of tokens as a valid
>>> expression, but rewrite it before handing off to the rest of the parser.
>>> That requires identifying all such special cases; not too hard.
>>>
>>> A third choice is to not define a CHARACTER_LITERAL token; then the
>>> sequence of tokens is always
>>>
>>> IDENTIFIER TICK LEFT_PAREN TICK IDENTIFIER TICK RIGHT_PAREN
>>>
>>> and the parser must identify the character literal, or the grammar must
>>> be re-written in the same manner. That may be the simplest solution.
>>>
>>> If I recall correctly, this issue has been discussed here before, and
>>> the proposed solutions were similar. I don't know how GNAT handles this.
>>
>> I don't think you identified the solution that is typically used: 
>> remember
>> the previous token identified by the lexer. Then, when encountering an
>> apostrophe, the token is unconditionally an apostrophe if the preceding
>> token is "all", an identifier, a character or string literal, or an 
>> rparen;
>> else it might be a character literal.
>
> That's the third choice above; the lexer returns TICK (= apostrophe) for
> all cases, and the parser deals with further classification.
>
> Hmm. Unless you are saying that logic is in the lexer; I don't see that
> it matters much.

Of course it's in the lexer, I don't see how it could be in the parser 
(especially if the parser is table-driven). The character following the Tick 
in a character literal could be anything, including characters that the 
parser doesn't even see (white space, characters like $, %, and # that 
aren't legal tokens, etc.). Trying to handle that in the parser would 
greatly complicate the grammar and most likely would introduce ambiguities 
into it.

I could see it working in a hand-written parser, but that doesn't make much 
sense (why use a table-driven lexer and a hand-written parser? -- the parser 
is far more work). I also could see it working if one introduced a filter 
stage in between the lexer and parser -- but then you've just introduced a 
lot more complexity.

...
> I have looked briefly at the GNAT lexer. It is highly optimized, and is
> apparently generated from some SNOBOL sources (ie, _not_ "hand
> written"). For example, it uses nested if-then-else on each character of
> each keyword; not something you want to do by hand.

The one in Janus/Ada started out life as a table-driven lexer that we were 
required to use in our original CS 701 compiler. But that couldn't handle 
the little glitches of Ada and it was huge, we needed something much smaller 
for our original CP/M compiler (which had to fit in 48K RAM). So it was 
replaced by hand-written code in the same style (essentially replacing the 
table transitions with case statements). (Actually, I think it was 
originally generated by a program from the transition table, and then 
hand-pruned because much of it was clearly redundant.) It uses a character 
classification table to reduce the number of transitions needed. For most 
classes, there are very few transitions (if you see "=", there isn't much 
that can follow it). It's fairly large (the entire unit is 2400 lines), but 
a large part of that is error handling code, code to create token records 
(which for efficiency is repeated everywhere rather than put into a 
subprogram - this code is very "hot" in a profile of the compiler, and 
anything that we can save matters) and code to manage identifiers and 
reserved words. I suspect I could do it in much less space if I was starting 
from scratch (which I probably will have to do to support UTF-8).

                                    Randy.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OpenToken: Parsing Ada (subset)?
  2015-06-17 17:58               ` Stephen Leake
  2015-06-17 20:44                 ` Randy Brukardt
@ 2015-06-18  7:51                 ` AdaMagica
  2015-06-18  9:12                 ` Georg Bauhaus
  2 siblings, 0 replies; 26+ messages in thread
From: AdaMagica @ 2015-06-18  7:51 UTC (permalink / raw)


Am Mittwoch, 17. Juni 2015 19:58:06 UTC+2 schrieb Stephen Leake:
> Ok. I guess I'd like to see some actual examples of hand-written lexers.
> The one in OpenToken is not inspiring to me; that's why I got rid of it
> for FastToken (it's definitely easier for me to write regexps than to
> write another OpenToken recognizer (= lexer module)).

I found it very simple to write the multitude of recognizers needed for the Ada, Java, HTML lexers I wrote for OpenToken. Maybe not inspiring, but simple.

OK, forget the HTML lexer - it's unsatisfying. HTML syntax is uncomparable to Ada's.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OpenToken: Parsing Ada (subset)?
  2015-06-17 17:58               ` Stephen Leake
  2015-06-17 20:44                 ` Randy Brukardt
  2015-06-18  7:51                 ` AdaMagica
@ 2015-06-18  9:12                 ` Georg Bauhaus
  2 siblings, 0 replies; 26+ messages in thread
From: Georg Bauhaus @ 2015-06-18  9:12 UTC (permalink / raw)


On 17.06.15 19:58, Stephen Leake wrote:
> Ok. I guess I'd like to see some actual examples of hand-written lexers.

My silly Ada source highlighter effectively used look-back, too, as a
state in the tokenizer, IIRC.  Bigger language tokens would be built
from smaller token pieces; there are definitions of Delimiter_1 and
Delimiter_2, a constraint expressing a relation between Operator_1 and
Delimiter_1 (of inclusion); a number of Character_Set-s and Is_*
functions for classifying; but still not all tokens are classified
correctly (bugs or leniency, depending on POV of this program).

Incidentally, the ubiquitous highlite program has similar shortcomings
when parsing Ada or Perl (they share the '''), last time I looked.

Some more input for testing, omitting spaces and combinations:

Character'(''');
Character(''');
Character'('(');
Character('(');

Character'(-''');
Character'('-'-'-');
Character'(-'''-'-');

Character'(''')'Alignment;
Name(''')'Has_Same_Storage;
Character'Pos(''');
Character'Base'(''');
Character'Base'Pos(''');

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OpenToken: Parsing Ada (subset)?
  2015-06-01 13:08 OpenToken: Parsing Ada (subset)? Jacob Sparre Andersen
  2015-06-02 22:12 ` Stephen Leake
@ 2015-06-17 17:50 ` AdaMagica
  1 sibling, 0 replies; 26+ messages in thread
From: AdaMagica @ 2015-06-17 17:50 UTC (permalink / raw)


Just to add for completeness: The Ada lexer written by me in OpenToken handles Character'('C') correctly. The solution I found is not ideal.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2015-06-18  9:12 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-01 13:08 OpenToken: Parsing Ada (subset)? Jacob Sparre Andersen
2015-06-02 22:12 ` Stephen Leake
2015-06-03  1:43   ` Shark8
2015-06-03  7:36     ` Dmitry A. Kazakov
2015-06-05  9:03       ` Stephen Leake
2015-06-05  9:23         ` Georg Bauhaus
2015-06-05 20:49           ` Shark8
2015-06-05 23:52             ` Dennis Lee Bieber
2015-06-05 12:20         ` Dmitry A. Kazakov
2015-06-16 12:43           ` Stephen Leake
2015-06-16 13:24             ` Dmitry A. Kazakov
2015-06-16 14:13               ` G.B.
2015-06-17 17:38                 ` Stephen Leake
2015-06-17 17:29               ` Stephen Leake
2015-06-17 17:42                 ` Shark8
2015-06-17 19:03                 ` Dmitry A. Kazakov
2015-06-05 20:53         ` Shark8
2015-06-16 14:46           ` Stephen Leake
2015-06-16 15:31             ` G.B.
2015-06-17 17:44               ` Stephen Leake
2015-06-16 21:34             ` Randy Brukardt
2015-06-17 17:58               ` Stephen Leake
2015-06-17 20:44                 ` Randy Brukardt
2015-06-18  7:51                 ` AdaMagica
2015-06-18  9:12                 ` Georg Bauhaus
2015-06-17 17:50 ` AdaMagica

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox