OpenToken: Handling the empty word token

comp.lang.ada
 help / color / mirror / Atom feed

* OpenToken: Handling the empty word token
@ 2012-01-27 16:22 mtrenkmann
  2012-01-27 16:48 ` Dmitry A. Kazakov
  2012-01-28 10:46 ` Stephen Leake
  0 siblings, 2 replies; 12+ messages in thread
From: mtrenkmann @ 2012-01-27 16:22 UTC (permalink / raw)


Hello all.

Very often grammars have so called epsilon-productions where one
alternative for a non-terminal symbol points to the empty word
(epsilon).

For example: Optional -> Something | epsilon

In OpenToken I modeled the epsilon token as an
OpenToken.Recognizer.Nothing.Instance and defined the production like
this:

Optional <= Something and
Optional <= epsilon

Now I realized that the lexer would actually never emit the epsilon
token, because of it's pure formal meaning, and thus the second
production would never be detected.

Is there a way to instrument the parser to silently accept the epsilon
token whenever it expects it without consuming a token from the lexer,
or is it a common convention to translate each grammar into a epsilon-
free representation?

Thanks in advance.

-- Martin



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: OpenToken: Handling the empty word token
  2012-01-27 16:22 OpenToken: Handling the empty word token mtrenkmann
@ 2012-01-27 16:48 ` Dmitry A. Kazakov
  2012-01-28  3:42   ` Randy Brukardt
  2012-01-28 10:46 ` Stephen Leake
  1 sibling, 1 reply; 12+ messages in thread
From: Dmitry A. Kazakov @ 2012-01-27 16:48 UTC (permalink / raw)

On Fri, 27 Jan 2012 08:22:12 -0800 (PST), mtrenkmann wrote:

> Is there a way to instrument the parser to silently accept the epsilon
> token whenever it expects it without consuming a token from the lexer,
> or is it a common convention to translate each grammar into a epsilon-
> free representation?

I use neither explicit grammars nor OpenToken, so it is possible that I
didn't really understand the problem you have.

For a table-driven parser, which I am using, there are two ways to handle
implied tokens, which, I suppose, is what you wanted:

1. You could put an empty string into the table and let that be recognized.
It is a straightforward approach which has the disadvantage that empty
string is always matched, so when you have a recursive grammar you might
run into an endless loop of matching the empty string over and over again
if the parser's state is not really changed.

2. A better way is to connect to the semantic callback on "something
expected." For example, to handle for example assumed multiplication in
infix expressions, e.g. A B + C => A*B + C, I connect to "on missing
operation" and push the multiplication onto the operation stack. That is.
This is not really different from how you would handle a missing right
bracket. Once the parser detected that, you would push the bracket onto the
stack and continue (if you allow recovery on such errors).

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: OpenToken: Handling the empty word token
  2012-01-27 16:48 ` Dmitry A. Kazakov
@ 2012-01-28  3:42   ` Randy Brukardt
  2012-01-29 17:45     ` Stephen Leake
  0 siblings, 1 reply; 12+ messages in thread
From: Randy Brukardt @ 2012-01-28  3:42 UTC (permalink / raw)

"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
news:1jvlv7i0tn14u.b5d2cwsqhl2h$.dlg@40tude.net...
> On Fri, 27 Jan 2012 08:22:12 -0800 (PST), mtrenkmann wrote:
>
>> Is there a way to instrument the parser to silently accept the epsilon
>> token whenever it expects it without consuming a token from the lexer,
>> or is it a common convention to translate each grammar into a epsilon-
>> free representation?
>
> I use neither explicit grammars nor OpenToken, so it is possible that I
> didn't really understand the problem you have.

Like Dmitry, I don't use OpenToken, but I do use a LALR(1) parser generator 
(ours originates in a University of Wisconsin research project from the late 
1970s).

In all of the grammars I've seen, you don't write anything for an epsilon 
production; that's because you are matching nothing. But there is no problem 
in matching nothing, so long as your grammar generator is powerful enough 
(uses at least LALR(1) parsing, or perhaps LR(1) parsing). In that case, 
matching nothing works so long as the follow sets are disjoint (something 
that fails to be true periodically in our Ada grammar).

For instance, here's the grammar for parameter modes from the Janus/Ada 
compiler grammar:

mode ::= IN ## 93
    | OUT ## 94
    | IN OUT ## 95
    |   ## 198

Note that the last production is an epsilon production. The ## part gives an 
action number associated with the matching of that particular alternative of 
this production. The ## part also marks the end of the production (it's 
optional, and | also ends a production -- but it's required on the last 
alternative as the grammar of our grammar uses insignificant line endings 
like Ada does).

I'd be surprised if OpenToken didn't have something similar; and if it 
doesn't, you probably need to upgrade to a better grammar generator.

                                                           Randy.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: OpenToken: Handling the empty word token
  2012-01-27 16:22 OpenToken: Handling the empty word token mtrenkmann
  2012-01-27 16:48 ` Dmitry A. Kazakov
@ 2012-01-28 10:46 ` Stephen Leake
  2012-01-30 16:28   ` mtrenkmann
  1 sibling, 1 reply; 12+ messages in thread
From: Stephen Leake @ 2012-01-28 10:46 UTC (permalink / raw)

mtrenkmann <martin.trenkmann@googlemail.com> writes:

> Very often grammars have so called epsilon-productions where one
> alternative for a non-terminal symbol points to the empty word
> (epsilon).

I've never seen this, but I haven't seen many grammars.

> For example: Optional -> Something | epsilon
>
> In OpenToken I modeled the epsilon token as an
> OpenToken.Recognizer.Nothing.Instance and defined the production like
> this:
>
> Optional <= Something and
> Optional <= epsilon

I've always coded this at a higher level. I'm guessing you want to have
something like this:

Settings <= required_settings & optional + action

I would code that as:

Settings <= required_settings + action and
Settings <= required_settings & optional + action

Which I assume you are doing as a workaround for what you want.

Supporting epsilon is an interesting idea. It does reduce the amount of
user-written code, especially when there is more than one optional part.

> Now I realized that the lexer would actually never emit the epsilon
> token, because of it's pure formal meaning, and thus the second
> production would never be detected.

Right.

> Is there a way to instrument the parser to silently accept the epsilon
> token whenever it expects it without consuming a token from the lexer,

It's been a while since I messed with OpenToken at that level, but it
looks like you could make this happen by changing
opentoken-recognizer-nothing.adb Analyze to report Verdict := Matches.
You might give that a try.

I suspect this would have the problem that Dmitry pointed out; the
epsilon token would always match, so you'd get lots of ambiguities.

To get rid of the ambiguities, you would need to feed information about
what the parser expects into the lexer, so Epsilon would only match if
the parser expects it. That is possible, but there's no formal mechanism
for it in OpenToken. 

You could define a new token type Epsilon, modeled on Nothing, and add a
Set_Expected procedure. Then call that from somewhere in
opentoken-production-parser-lalr.adb Parse; probably just before each
call to Find_Next. The list of expected tokens is given by the function
opentoken-production-parser-lalr.adb Expecting (currently used in error
messages).

If that works, it might be nice to define Set_Expected as a dispatching
operation on OpenToken.Recognizer.Instance, so other tokens could use
it. In that case, it might be worth it to cache the result of Expecting.
Or pass Action_List directly to Set_Expecting, instead of reformatting it.

> or is it a common convention to translate each grammar into a epsilon-
> free representation?

That's certainly how I do it.

-- 
-- Stephe

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: OpenToken: Handling the empty word token
  2012-01-28  3:42   ` Randy Brukardt
@ 2012-01-29 17:45     ` Stephen Leake
  2012-01-31  0:56       ` Randy Brukardt
  0 siblings, 1 reply; 12+ messages in thread
From: Stephen Leake @ 2012-01-29 17:45 UTC (permalink / raw)


"Randy Brukardt" <randy@rrsoftware.com> writes:

> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
> news:1jvlv7i0tn14u.b5d2cwsqhl2h$.dlg@40tude.net...
>> On Fri, 27 Jan 2012 08:22:12 -0800 (PST), mtrenkmann wrote:
>>
>>> Is there a way to instrument the parser to silently accept the epsilon
>>> token whenever it expects it without consuming a token from the lexer,
>>> or is it a common convention to translate each grammar into a epsilon-
>>> free representation?
>>
>> I use neither explicit grammars nor OpenToken, so it is possible that I
>> didn't really understand the problem you have.
>
> Like Dmitry, I don't use OpenToken, but I do use a LALR(1) parser generator 
> (ours originates in a University of Wisconsin research project from the late 
> 1970s).
>
> In all of the grammars I've seen, you don't write anything for an epsilon 
> production; that's because you are matching nothing. But there is no problem 
> in matching nothing, so long as your grammar generator is powerful enough 
> (uses at least LALR(1) parsing, or perhaps LR(1) parsing). In that case, 
> matching nothing works so long as the follow sets are disjoint (something 
> that fails to be true periodically in our Ada grammar).
>
> For instance, here's the grammar for parameter modes from the Janus/Ada 
> compiler grammar:
>
> mode ::= IN ## 93
>     | OUT ## 94
>     | IN OUT ## 95
>     |   ## 198
>
> Note that the last production is an epsilon production. The ## part gives an 
> action number associated with the matching of that particular alternative of 
> this production. The ## part also marks the end of the production (it's 
> optional, and | also ends a production -- but it's required on the last 
> alternative as the grammar of our grammar uses insignificant line endings 
> like Ada does).
>
> I'd be surprised if OpenToken didn't have something similar; 

Not quite. Because OpenToken uses Ada types to build the grammar, we
need an explicit Epsilon token (full code below):

   Grammar : constant Production_List.Instance :=
     Tokens.Parse_Sequence <= Tokens.Paren_Left & Tokens.Mode & Tokens.Paren_Right + Arg_Action'Access and
     Tokens.Mode <= Tokens.In_Tok + Mode_Action'Access and
     Tokens.Mode <= Tokens.Out_Tok + Mode_Action'Access and
     Tokens.Mode <= Tokens.In_Tok & Tokens.Out_Tok + Mode_Action'Access and
     Tokens.Mode <= Tokens.Epsilon + Mode_Action'Access;

> and if it doesn't, you probably need to upgrade to a better grammar
> generator.

One way to do that is to improve OpenToken :).

In this case, we might be able to provide a monadic "+" that would do
the right thing, but I didn't try that.

pragma License (GPL);

with Ada.Text_IO;
with OpenToken.Production.List;
with OpenToken.Production.Parser.LALR;
with OpenToken.Production.Parser;
with OpenToken.Recognizer.Character_Set;
with OpenToken.Recognizer.End_Of_File;
with OpenToken.Recognizer.Keyword;
with OpenToken.Recognizer.Nothing;
with OpenToken.Text_Feeder.String;
with OpenToken.Token.Enumerated.Analyzer;
with OpenToken.Token.Enumerated.List;
with OpenToken.Token.Enumerated.Nonterminal;
procedure Debug is

   type Token_ID_Type is
     (EOF_ID,
      Epsilon_ID,
      In_ID,
      Out_ID,
      Paren_Left_ID,
      Paren_Right_ID,
      Whitespace_ID,

      --  non-terminals
      Mode_ID,
      Parse_Sequence_ID);

   package Master_Token is new OpenToken.Token.Enumerated (Token_ID_Type);
   package Token_List is new Master_Token.List;
   package Nonterminal is new Master_Token.Nonterminal (Token_List);

   package Production is new OpenToken.Production (Master_Token, Token_List, Nonterminal);
   package Production_List is new Production.List;

   use type Production.Instance;        --  "<="
   use type Production_List.Instance;   --  "and"
   use type Production.Right_Hand_Side; --  "+"
   use type Token_List.Instance;        --  "&"

   package Tokens is
      EOF         : constant Master_Token.Class := Master_Token.Get (EOF_ID);
      Epsilon     : constant Master_Token.Class := Master_Token.Get (Epsilon_ID);
      In_Tok      : constant Master_Token.Class := Master_Token.Get (In_ID);
      Out_Tok     : constant Master_Token.Class := Master_Token.Get (Out_ID);
      Paren_Left  : constant Master_Token.Class := Master_Token.Get (Paren_Left_ID);
      Paren_Right : constant Master_Token.Class := Master_Token.Get (Paren_Right_ID);

      --  Nonterminals
      Mode           : constant Nonterminal.Class := Nonterminal.Get (Mode_ID);
      Parse_Sequence : constant Nonterminal.Class := Nonterminal.Get (Parse_Sequence_ID);
   end Tokens;

   package Tokenizer is new Master_Token.Analyzer (Last_Terminal => Whitespace_ID);

   Syntax : constant Tokenizer.Syntax :=
     (EOF_ID         => Tokenizer.Get (OpenToken.Recognizer.End_Of_File.Get, Tokens.EOF),
      Epsilon_ID     => Tokenizer.Get (OpenToken.Recognizer.Nothing.Get),
      In_ID          => Tokenizer.Get (OpenToken.Recognizer.Keyword.Get ("in")),
      Out_ID         => Tokenizer.Get (OpenToken.Recognizer.Keyword.Get ("out")),
      Paren_Left_ID  => Tokenizer.Get (OpenToken.Recognizer.Keyword.Get ("(")),
      Paren_Right_ID => Tokenizer.Get (OpenToken.Recognizer.Keyword.Get (")")),

      Whitespace_ID => Tokenizer.Get
        (OpenToken.Recognizer.Character_Set.Get (OpenToken.Recognizer.Character_Set.Standard_Whitespace))
     );

   procedure Arg_Action
     (New_Token : out Nonterminal.Class;
      Source    : in  Token_List.Instance'Class;
      To_ID     : in  Token_ID_Type)
   is begin
      Nonterminal.Synthesize_Self (New_Token, Source, To_ID);
      Ada.Text_IO.Put_Line ("arg action");
   end Arg_Action;

   procedure Mode_Action
     (New_Token : out Nonterminal.Class;
      Source    : in  Token_List.Instance'Class;
      To_ID     : in  Token_ID_Type)
   is begin
      Nonterminal.Synthesize_Self (New_Token, Source, To_ID);
      Ada.Text_IO.Put_Line ("mode action");
   end Mode_Action;

   Grammar : constant Production_List.Instance :=
     Tokens.Parse_Sequence <= Tokens.Paren_Left & Tokens.Mode & Tokens.Paren_Right + Arg_Action'Access and
     Tokens.Mode <= Tokens.In_Tok + Mode_Action'Access and
     Tokens.Mode <= Tokens.Out_Tok + Mode_Action'Access and
     Tokens.Mode <= Tokens.In_Tok & Tokens.Out_Tok + Mode_Action'Access and
     Tokens.Mode <= Tokens.Epsilon + Mode_Action'Access;

   package OpenToken_Parser is new Production.Parser (Production_List, Tokenizer);
   package LALR_Parser is new OpenToken_Parser.LALR;
   String_Feeder : aliased OpenToken.Text_Feeder.String.Instance;
   Analyzer : constant Tokenizer.Instance := Tokenizer.Initialize (Syntax);
   Command_Parser : LALR_Parser.Instance := LALR_Parser.Generate (Grammar, Analyzer, OpenToken.Trace_Parse);

   use LALR_Parser;
begin
   OpenToken.Text_Feeder.String.Set (String_Feeder, "( in out )");

   Set_Text_Feeder (Command_Parser, String_Feeder'Unchecked_Access);

   --  Read and parse statements from the string until end of string
   loop
      exit when End_Of_Text (Command_Parser);
         Parse (Command_Parser);
   end loop;

end Debug;

-- 
-- Stephe




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: OpenToken: Handling the empty word token
  2012-01-28 10:46 ` Stephen Leake
@ 2012-01-30 16:28   ` mtrenkmann
  2012-01-30 18:34     ` Dmitry A. Kazakov
  2012-01-31 12:58     ` Stephen Leake
  0 siblings, 2 replies; 12+ messages in thread
From: mtrenkmann @ 2012-01-30 16:28 UTC (permalink / raw)


On Jan 28, 11:46 am, Stephen Leake <stephen_le...@stephe-leake.org>
wrote:
> mtrenkmann <martin.trenkm...@googlemail.com> writes:
> > Very often grammars have so called epsilon-productions where one
> > alternative for a non-terminal symbol points to theemptyword
> > (epsilon).
>
> I've never seen this, but I haven't seen many grammars.
>
Here is the ASN.1 grammar I have to deal with (see Annex L):
http://www.itu.int/rec/T-REC-X.680-200811-I/en

> I would code that as:
>
> Settings <= required_settings + action and
> Settings <= required_settings & optional + action
>
> Which I assume you are doing as a workaround for what you want.
>
> Supporting epsilon is an interesting idea. It does reduce the amount of
> user-written code, especially when there is more than one optional part.
>
This is exactly why I need epsilon token support in the RHS of a
production. If you look at the top-level production in ASN.1, namely
"ModuleDefinition", there are already 4 optional non-terminals (the
epsilon token is called "empty" here). Modeling the epsilon token at
that higher level means to provide 2^4 RHS combinations for
"ModuleDefinition". No good!

> > Is there a way to instrument the parser to silently accept the epsilon
> > token whenever it expects it without consuming a token from the lexer,
>
> It's been a while since I messed withOpenTokenat that level, but it
> looks like you could make this happen by changingopentoken-recognizer-nothing.adb Analyze to report Verdict := Matches.
> You might give that a try.
>
> I suspect this would have the problem that Dmitry pointed out; the
> epsilon token would always match, so you'd get lots of ambiguities.
>
I tried it and yes I got lots of ambiguities.

> To get rid of the ambiguities, you would need to feed information about
> what the parser expects into the lexer, so Epsilon would only match if
> the parser expects it. That is possible, but there's no formal mechanism
> for it inOpenToken.
>
> You could define a new token type Epsilon, modeled on Nothing, and add a
> Set_Expected procedure. Then call that from somewhere inopentoken-production-parser-lalr.adb Parse; probably just before each
> call to Find_Next. The list of expected tokens is given by the functionopentoken-production-parser-lalr.adb Expecting (currently used in error
> messages).
>
Sounds feasible to me. I will implement that and let you know.

> -- Stephe

Ok, thanks for all your helpful hints. In the meanwhile I came up with
the following workaround, that actually works fine.
The idea is to perform some kind of recovery in the Parse procedure if
Action.Verb = Error due to the missing of the epsilon token. In that
case I check if epsilon is one of the expected tokens, and if so I set
Current_State.Seen_Token to epsilon and prevent the parser from
reading a new token from the lexer in the next Shift step. Here is the
code with the important parts:

-- The epsilon token becomes a new generic constant of the lexer.
generic
   Last_Terminal : in Token_ID := Token_ID'Last;
   Epsilon_Token : in Token_ID;
package OpenToken.Token.Enumerated.Analyzer is
   -- new function needed by the parser
   function Get_Syntax (Analyzer : in Instance) return Syntax;
   function Epsilon_ID (Analyzer : in Instance) return Terminal_ID;
end OpenToken.Token.Enumerated.Analyzer;

-- This is no complete code! Just the important parts.
package body OpenToken.Production.Parser.LALR is
   overriding procedure Parse (Parser : in out Instance) is
      -- a flag to prevent the parser from reading a new token in
Shift step
      Get_Next_From_Analyzer : Boolean := True;
   begin
      loop
         case Action.Verb is
         when Shift =>
            if Get_Next_From_Analyzer then
               --  Get the next token
               begin
                  Tokenizer.Find_Next (Parser.Analyzer);
               exception
                  when E : Syntax_Error =>
                     raise Syntax_Error with
                     Integer_Image (Line (Parser)) &
                     ":" &
                     Integer_Image (Column (Parser) - 1) &
                     " " &
                     Ada.Exceptions.Exception_Message (E);
               end;
            else
               -- Use the current token again in the next call of
Action_For
               Get_Next_From_Analyzer := True;
            end if;
         when Error =>
            -- BEGIN epsilon token hack
            -- Try to recover if epsilon is one of the expected tokens
            declare
               use Token;
               Expected_Tokens : constant Token_Array
                 := Expecting (Parser.Table, Current_State.State);
               Is_Epsilon_Expected : Boolean := False;
            begin
               for I in Expected_Tokens'Range loop
                  if Expected_Tokens (I) = Tokenizer.Epsilon_Token
then
                     Is_Epsilon_Expected := True;
                     exit;
                  end if;
               end loop;
               if Is_Epsilon_Expected then
                  -- Create an on-demand epsilon token
                  Current_State.Seen_Token := new
Token.Class'(Parser.Analyzer.Get_Syntax
(Parser.Analyzer.Epsilon_ID).Token_Handle.all);
                  Get_Next_From_Analyzer := False;
            -- END epsilon token hack
               else
                  -- Clean up
                  -- Raise Sytax_Error
               end if;
            end;
         end case;
      end loop;
end OpenToken.Production.Parser.LALR;

I am wondering if that behavior cannot be implemented in the table-
lookup procedure. That is when I call Action_For with an unexpected
token, while there is the epsilon as one of the expected tokens, then
this epsilon could be accepted automatically, the parser changes its
state accordingly, and the unexpected token is checked again. Don't
know if that is possible, since I am not yet familiar with the
internals of table-driven parsing.

-- Martin



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: OpenToken: Handling the empty word token
  2012-01-30 16:28   ` mtrenkmann
@ 2012-01-30 18:34     ` Dmitry A. Kazakov
  2012-01-31 12:58     ` Stephen Leake
  1 sibling, 0 replies; 12+ messages in thread
From: Dmitry A. Kazakov @ 2012-01-30 18:34 UTC (permalink / raw)

On Mon, 30 Jan 2012 08:28:57 -0800 (PST), mtrenkmann wrote:

> I am wondering if that behavior cannot be implemented in the table-
> lookup procedure. That is when I call Action_For with an unexpected
> token, while there is the epsilon as one of the expected tokens, then
> this epsilon could be accepted automatically, the parser changes its
> state accordingly, and the unexpected token is checked again. Don't
> know if that is possible, since I am not yet familiar with the
> internals of table-driven parsing.

You have tables of token for states where there are alternatives. You match
the source against the table:

    Get (Source, Statements_Table, Token, Got_It);
    if Got_It then -- One of the tokens has been matched
       case Token is
           when If_Token =>
               Do_If;
           when For_Token =>
               Do_For;
            when ... =>
               ...
       end case;
    else -- Nothing matched, this is the "epsilon" choice
        ...
    end if;

This is basically a recursive descent parser. You would also have to define
a pair useful subprograms like Skip_Blanks, Get_Identifier, Get_Literal,
Get_Keyword.

The only difficult part is infix expressions. For these there are
ready-to-use means, which match the expression evaluating it or producing a
tree as byproduct.

The problem as you described it just does not exist. You get the keyword,
if it is not there, you do something in reaction. That is.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: OpenToken: Handling the empty word token
  2012-01-29 17:45     ` Stephen Leake
@ 2012-01-31  0:56       ` Randy Brukardt
  2012-01-31  9:09         ` Georg Bauhaus
  2012-01-31 12:16         ` Stephen Leake
  0 siblings, 2 replies; 12+ messages in thread
From: Randy Brukardt @ 2012-01-31  0:56 UTC (permalink / raw)

"Stephen Leake" <stephen_leake@stephe-leake.org> wrote in message 
news:82ehuibdwt.fsf@stephe-leake.org...
> "Randy Brukardt" <randy@rrsoftware.com> writes:
...
>> I'd be surprised if OpenToken didn't have something similar;
>
> Not quite. Because OpenToken uses Ada types to build the grammar, we
> need an explicit Epsilon token (full code below):

I see. It appears that OpenToken uses an LL-derived parsing scheme. (It 
appears to be a version of recursive descent, which is an LL scheme.] And 
it's important to note that LL parsing cannot include epsilon tokens and 
have problems with left-recursion. LR-derived parsing does not have either 
of these limitations. And note that there is no real workaround to these 
limitations (short of rewriting your grammar); they're inherent in the 
parsing technique.

[Note: I learned all about parsing back in the early 1980's, so (1) I may be 
remembering something wrong, or (2) the state of the art may have changed 
some.]

I would suggest that the OP use a LR-based grammar generator. That won't be 
quite so "pretty" as using "Ada types", but it will work well with almost 
any original grammar that you can throw at it. (I believe that modern parser 
generators use LR(1) tables rather than the LALR(1) tables used in the past; 
LR(1) tables are a lot bigger to generate and we simply didn't have enough 
memory to use that back in the day.)

Hacking an LL-based scheme (such as OpenToken's) to work is pretty much 
doomed to failure. You might get it to work for limited circumstances, but I 
don't think it would work in any general sort of way.

                                         Randy.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: OpenToken: Handling the empty word token
  2012-01-31  0:56       ` Randy Brukardt
@ 2012-01-31  9:09         ` Georg Bauhaus
  2012-01-31 12:16         ` Stephen Leake
  1 sibling, 0 replies; 12+ messages in thread
From: Georg Bauhaus @ 2012-01-31  9:09 UTC (permalink / raw)


On 31.01.12 01:56, Randy Brukardt wrote:
> "Stephen Leake"<stephen_leake@stephe-leake.org>  wrote in message
> news:82ehuibdwt.fsf@stephe-leake.org...
>> "Randy Brukardt"<randy@rrsoftware.com>  writes:
> ...
>>> I'd be surprised if OpenToken didn't have something similar;
>>
>> Not quite. Because OpenToken uses Ada types to build the grammar, we
>> need an explicit Epsilon token (full code below):
>
> I see. It appears that OpenToken uses an LL-derived parsing scheme. (It
> appears to be a version of recursive descent, which is an LL scheme.] And
> it's important to note that LL parsing cannot include epsilon tokens and
> have problems with left-recursion.

LL parsing can include empty words. But then, the grammar might
be such that predicting the next rule via FIRST and FOLLOW sets
does not lead to a single result. (In which case one could say
"cannot parse" or boldly move on and pick a production, try it,
and if it fails, try the next, I should think.) No guarantees.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: OpenToken: Handling the empty word token
  2012-01-31  0:56       ` Randy Brukardt
  2012-01-31  9:09         ` Georg Bauhaus
@ 2012-01-31 12:16         ` Stephen Leake
  2012-02-02  1:39           ` Randy Brukardt
  1 sibling, 1 reply; 12+ messages in thread
From: Stephen Leake @ 2012-01-31 12:16 UTC (permalink / raw)

"Randy Brukardt" <randy@rrsoftware.com> writes:

> "Stephen Leake" <stephen_leake@stephe-leake.org> wrote in message 
> news:82ehuibdwt.fsf@stephe-leake.org...
>> "Randy Brukardt" <randy@rrsoftware.com> writes:
> ...
>>> I'd be surprised if OpenToken didn't have something similar;
>>
>> Not quite. Because OpenToken uses Ada types to build the grammar, we
>> need an explicit Epsilon token (full code below):
>
> I see. It appears that OpenToken uses an LL-derived parsing scheme. (It 
> appears to be a version of recursive descent, which is an LL scheme.] And 
> it's important to note that LL parsing cannot include epsilon tokens and 
> have problems with left-recursion. LR-derived parsing does not have either 
> of these limitations. And note that there is no real workaround to these 
> limitations (short of rewriting your grammar); they're inherent in the 
> parsing technique.

I'm also on somewhat shakey ground on parser theory, but I did read (and
understand :) "the red dragon book"
(http://en.wikipedia.org/wiki/Principles_of_Compiler_Design), when I
took over OpenToken and fixed some major bugs.

The parser is an LALR parser.

But I think the implementation is not complete. There are no examples of
grammars containing epsilon in the OpenToken tests. I don't remember
what the dragon book has to say about epsilon handling.

So the best approach here is to go back and read the theory about how to
handle epsilon, add some OpenToken tests that use epsilon, and see what
needs to be done.

More than I have the energy for at the moment, but I'm happy to provide
insight into the current code.

-- 
-- Stephe

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: OpenToken: Handling the empty word token
  2012-01-30 16:28   ` mtrenkmann
  2012-01-30 18:34     ` Dmitry A. Kazakov
@ 2012-01-31 12:58     ` Stephen Leake
  1 sibling, 0 replies; 12+ messages in thread
From: Stephen Leake @ 2012-01-31 12:58 UTC (permalink / raw)

mtrenkmann <martin.trenkmann@googlemail.com> writes:

> On Jan 28, 11:46 am, Stephen Leake <stephen_le...@stephe-leake.org>
> wrote:
>> mtrenkmann <martin.trenkm...@googlemail.com> writes:
>> > Very often grammars have so called epsilon-productions where one
>> > alternative for a non-terminal symbol points to theemptyword
>> > (epsilon).
>>
>> I've never seen this, but I haven't seen many grammars.
>>
> Here is the ASN.1 grammar I have to deal with (see Annex L):
> http://www.itu.int/rec/T-REC-X.680-200811-I/en

When OpenToken was first designed, the intent was to process small
languages for system configuration. And that is how I'm using it at
work. So I'm not surprised that it has a hard time with large languages.

> Ok, thanks for all your helpful hints. In the meanwhile I came up with
> the following workaround, that actually works fine.

<snip>

That looks like a reasonable workaround and proof of principle, but not
a long-term solution.

> I am wondering if that behavior cannot be implemented in the table-
> lookup procedure. That is when I call Action_For with an unexpected
> token, while there is the epsilon as one of the expected tokens, then
> this epsilon could be accepted automatically, the parser changes its
> state accordingly, and the unexpected token is checked again. Don't
> know if that is possible, since I am not yet familiar with the
> internals of table-driven parsing.

It would certainly be more efficient than your current code.

I suspect making epsilon an explicit token in the parser but not the
lexer is the right way to go.

I suggest you get "the red dragon book"
(en.wikipedia.org/wiki/Principles_of_Compiler_Design [1]), and read
chapter 5 (you probably have to read the earlier chapters also to
understand it).

Skimming thru chapter 5, I don't see any discussion on how to handle
epsilon, but it must be in there somewhere.

Example 5.36 has an example grammar with an epsilon. Implement that in
an OpenToken test (or pick a small subset of your real language), and
then see if you can figure out how to integrate it into the state
machine. 

I find small tests immensely helpful in understanding the code. See
test_lr0_kernels.adb for helpful techniques; setting Debug => True turns
on lots of useful output.

Another thing to be aware of; the lexer in OpenToken is horribly
inefficient. So if you are planning to parse large amounts of text, you
should consider replacing (or improving) the lexer. It would be
interesting to see how hard that is in OpenToken, or you may want to
switch to another generator (although I am not aware of any others
written in and generating Ada).

[1] The purple dragon book may be better; it's a lot more recent, but
the blurb indicates it's less focussed on actually implementing parsers.
And some of the reviews on Amazon imply there are better books with
better techniques out there. So you may want to look for a better book;
there are certainly plenty of compiler books on Amazon.

-- 
-- Stephe

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: OpenToken: Handling the empty word token
  2012-01-31 12:16         ` Stephen Leake
@ 2012-02-02  1:39           ` Randy Brukardt
  0 siblings, 0 replies; 12+ messages in thread
From: Randy Brukardt @ 2012-02-02  1:39 UTC (permalink / raw)

"Stephen Leake" <stephen_leake@stephe-leake.org> wrote in message 
news:82obtkgj81.fsf@stephe-leake.org...
> "Randy Brukardt" <randy@rrsoftware.com> writes:
>
>> "Stephen Leake" <stephen_leake@stephe-leake.org> wrote in message
>> news:82ehuibdwt.fsf@stephe-leake.org...
...
> I'm also on somewhat shakey ground on parser theory, but I did read (and
> understand :) "the red dragon book"
> (http://en.wikipedia.org/wiki/Principles_of_Compiler_Design), when I
> took over OpenToken and fixed some major bugs.
>
> The parser is an LALR parser.
>
> But I think the implementation is not complete. There are no examples of
> grammars containing epsilon in the OpenToken tests. I don't remember
> what the dragon book has to say about epsilon handling.

Well, you don't need to do anything special to handle "epsilon". Unless you 
actually want to write "epsilon", which is rather silly IMHO, as it 
represents nothing at all, and it is easier to write nothing at all.

As I recall, the only problem I had with our generator was with an ambiguity 
in recognizing the ends of productions in the input; in our case that was 
solved by requiring an action number at the end even if no action was 
needed. An alternative way to handle that would be to change the input 
grammar to have "epsilon" stand for absoluty nothing (that might be easier 
if there are already some sort of reserved words in the grammar source).

                                       Randy.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2012-02-02  1:39 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-27 16:22 OpenToken: Handling the empty word token mtrenkmann
2012-01-27 16:48 ` Dmitry A. Kazakov
2012-01-28  3:42   ` Randy Brukardt
2012-01-29 17:45     ` Stephen Leake
2012-01-31  0:56       ` Randy Brukardt
2012-01-31  9:09         ` Georg Bauhaus
2012-01-31 12:16         ` Stephen Leake
2012-02-02  1:39           ` Randy Brukardt
2012-01-28 10:46 ` Stephen Leake
2012-01-30 16:28   ` mtrenkmann
2012-01-30 18:34     ` Dmitry A. Kazakov
2012-01-31 12:58     ` Stephen Leake

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox