From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-0.3 required=5.0 tests=BAYES_00,
	REPLYTO_WITHOUT_TO_CC autolearn=no autolearn_force=no version=3.4.4
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!mx02.eternal-september.org!feeder.eternal-september.org!gegeweb.org!aioe.org!.POSTED!not-for-mail
From: "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de>
Newsgroups: comp.lang.ada
Subject: Re: [Slightly OT] How to process lightweight text markup languages?
Date: Thu, 22 Jan 2015 19:38:52 +0100
Organization: cbb software GmbH
Message-ID: <lskzqkn5ssua$.zt9p9m6m2yro$.dlg@40tude.net>
References: <slrnmbntdm.19vl.lithiumcat@nat.rebma.instinctive.eu>
 <ynm6coktfevl.1esu61g1n9477.dlg@40tude.net>
 <slrnmbt8mf.19vl.lithiumcat@nat.rebma.instinctive.eu>
 <1wclq766iu82d.b0k1hx30rgrt.dlg@40tude.net> <m9mj5b$92m$1@loke.gir.dk>
 <slrnmc1vgn.19vl.lithiumcat@nat.rebma.instinctive.eu>
Reply-To: mailbox@dmitry-kazakov.de
NNTP-Posting-Host: 0MSBVPcE8EdvhPFyEbPM4g.user.speranza.aioe.org
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Complaints-To: abuse@aioe.org
User-Agent: 40tude_Dialog/2.0.15.1
X-Notice: Filtered by postfilter v. 0.8.2
Xref: news.eternal-september.org comp.lang.ada:24699
Date: 2015-01-22T19:38:52+01:00
List-Id: <comp.lang.ada>

On Thu, 22 Jan 2015 13:41:12 +0000 (UTC), Natasha Kerensikova wrote:

> On 2015-01-20, Randy Brukardt <randy@rrsoftware.com> wrote:
>> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
>> news:1wclq766iu82d.b0k1hx30rgrt.dlg@40tude.net...
>>> On Tue, 20 Jan 2015 18:47:13 +0000 (UTC), Natasha Kerensikova wrote:
>>>> So if you remember my example,
>>>>
>>>>    [alpha`](http://example.com/)`](http://other.com)
>>>
>>> Since `` are quotation marks, the above should be:
>>>
>>>  +
>>>  |_ []
>>>  |   |_ +
>>>  |       |_ alpha
>>>  |       |_ ](http://example.com/)
>>>  |_ ()
>>>       |_ http://other.com
>>>
>>> + is an assumed infix catenation operation. No backtracking needed.
> 
> Maybe I got the vocabulary wrong, but the tree above is exactly what I
> expect as an output for the whole parser. Except I don't need it to be
> stored as a tree, if it's only an iterator or an event driver with a
> callback collection, that's fine for me too.

Tree is merely for an unambiguous representation of the result of parsing.
Whether the parser's callback indeed generates a tree node or doing
something else is up to you.

> However I'm still completely lost on how to reach that point.
> 
> Would you have a link that explains that? Or keywords to feed a search
> engine?

http://www.dmitry-kazakov.de/ada/components.htm#Parsers_etc

The method is described in

"Compiler Construction for Digital Computers" by D. Gries, if I correctly
remember.

> Dmitry's answer position led me to believe that he was claiming an
> online algorithm (i.e. operating on stream input and output) is possible,
> provided there are internal stacks. So I followed that idea, because I
> like online algorithms, for the same reason I like implementations
> without explicit memory allocation: it's easier to understand and reason
> about, but it's too limited for some situations.

You can allocate memory for stacks statically. The point is that certain
class of formal languages requires memory to be recognized. You can convert
memory into parser states and states into memory. But you must have it.
 
> So if there has to be two stages, that's fine, I'll go with two stages.

Note that it does not mean that the second stage must start at the source
end. You can resume that stage right after each lexeme recognized. Only
early compilers used to produce a file after each stage.

> The next problem is that I still have no clue about what these two
> stages are. The markers that begin or end elements depends on the
> context, and knowing both what are markers and in what context is having
> the full parser. So I can't imagine what data is produced by the first
> stage to be fed into the second.

The division into syntax elements and lexemes is kind of arbitrary, the
motivation behind having them is that lexemes can be recognized
independently on the context. Syntax elements interact each other, e.g. in
a+b*c operands are bound by precedence. The next stage, the semantic
analysis is even more "ambiguous".

> For example, a slight variation on my example above yields a completely
> different result:
> 
>    [alpha`](http://example.com/)'](http://other.com)
> 
> +
>|_ +
>|   |_ []
>|   |   |_ alpha`
>|   |_ ()
>|       |_ http://example.com/
>|_ '](http://other.com)
> 
> This means that the parser can't know how to parse the stuff around
> "example" unless it knows whether the backtick after "alpha" is matched
> or not.

Why should it care? The lexer simply goes to the first delimiter where it
would stop anyway. E.g. in Ada, if you write

    X := "hello;

It would be the line end. Similarly in C, unless it hits \ before the line
end. At that point you decide how to react. You can stop at once, or assume
that the literal ends here and continue.

> So determining whether the "](" in there are a lexical element
> or part a normal string means having parsed completely that backtick.

It is a part of the literal's body, provided that `xxx` is a literal. If
you do not satisfy with the choice, do it a bracket, e.g. it would make
sense for Linux bash.
 
> Unless, again, I'm missing something. But I'm starting to wonder whether
> I'm too ignorant and/or you too knowledgeable to bridge the conceptual
> gap between us.

I think you are overdesigning it. The lexer need not to know anything. It
simply matches a lexeme and moves to the next one. A lexeme in your case
could be

  [
  ]
  (
  )
  `anything-in-between`
  a-sequence-of-other-characters

When the lexer sees ` it moves forward accumulating the literal (which may
have some escaping) until the first non-escaped `. The accumulated body of
the literal feeds the second stage. The lexer can accumulate it straight on
the stack of arguments.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de