From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00
	autolearn=unavailable autolearn_force=no version=3.4.4
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!mx02.eternal-september.org!.POSTED!not-for-mail
From: Natasha Kerensikova <lithiumcat@instinctive.eu>
Newsgroups: comp.lang.ada
Subject: Re: [Slightly OT] How to process lightweight text markup languages?
Date: Thu, 22 Jan 2015 13:41:12 +0000 (UTC)
Organization: A noiseless patient Spider
Message-ID: <slrnmc1vgn.19vl.lithiumcat@nat.rebma.instinctive.eu>
References: <slrnmbntdm.19vl.lithiumcat@nat.rebma.instinctive.eu>
 <ynm6coktfevl.1esu61g1n9477.dlg@40tude.net>
 <slrnmbt8mf.19vl.lithiumcat@nat.rebma.instinctive.eu>
 <1wclq766iu82d.b0k1hx30rgrt.dlg@40tude.net> <m9mj5b$92m$1@loke.gir.dk>
Injection-Date: Thu, 22 Jan 2015 13:41:12 +0000 (UTC)
Injection-Info: mx02.eternal-september.org;
 posting-host="eab84d932a0f4c9d4606240766f0f5e7";
	logging-data="30367"; mail-complaints-to="abuse@eternal-september.org";
	posting-account="U2FsdGVkX1+lR0HL26CCJ8pdwujgExvx"
User-Agent: slrn/1.0.2 (FreeBSD)
Cancel-Lock: sha1:vWvSqvHVNojSgk56gueq6Ljl8XA=
Xref: news.eternal-september.org comp.lang.ada:24695
Date: 2015-01-22T13:41:12+00:00
List-Id: <comp.lang.ada>

Hello,

On 2015-01-20, Randy Brukardt <randy@rrsoftware.com> wrote:
> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
> news:1wclq766iu82d.b0k1hx30rgrt.dlg@40tude.net...
>> On Tue, 20 Jan 2015 18:47:13 +0000 (UTC), Natasha Kerensikova wrote:
>>> So if you remember my example,
>>>
>>>    [alpha`](http://example.com/)`](http://other.com)
>>
>> Since `` are quotation marks, the above should be:
>>
>>  +
>>  |_ []
>>  |   |_ +
>>  |       |_ alpha
>>  |       |_ ](http://example.com/)
>>  |_ ()
>>       |_ http://other.com
>>
>> + is an assumed infix catenation operation. No backtracking needed.

Maybe I got the vocabulary wrong, but the tree above is exactly what I
expect as an output for the whole parser. Except I don't need it to be
stored as a tree, if it's only an iterator or an event driver with a
callback collection, that's fine for me too.

However I'm still completely lost on how to reach that point.

Would you have a link that explains that? Or keywords to feed a search
engine?

>> [...]
>>> Am I right so far? Am I missing something?
>>
>> Distinguishing lexical and syntactical elements? You don't bother with
>> operators until expression terms (lexemes) matched. Once you matched them
>> you never return back. They are all on the stack if not already bound by 
>> an
>> operation. If `` is declared literal, it is a term of the expression,
>> atomically matched. It naturally takes precedence over anything else.
>
> I agree with Dmitry; "standard" parsing has two stages, lexing (converting 
> into a token stream) and parsing. You're trying to make due with only one, 
> which complicates the problem a lot for no reason.

Dmitry's answer position led me to believe that he was claiming an
online algorithm (i.e. operating on stream input and output) is possible,
provided there are internal stacks. So I followed that idea, because I
like online algorithms, for the same reason I like implementations
without explicit memory allocation: it's easier to understand and reason
about, but it's too limited for some situations.

So if there has to be two stages, that's fine, I'll go with two stages.

The next problem is that I still have no clue about what these two
stages are. The markers that begin or end elements depends on the
context, and knowing both what are markers and in what context is having
the full parser. So I can't imagine what data is produced by the first
stage to be fed into the second.

For example, a slight variation on my example above yields a completely
different result:

   [alpha`](http://example.com/)'](http://other.com)

+
|_ +
|   |_ []
|   |   |_ alpha`
|   |_ ()
|       |_ http://example.com/
|_ '](http://other.com)

This means that the parser can't know how to parse the stuff around
"example" unless it knows whether the backtick after "alpha" is matched
or not. So determining whether the "](" in there are a lexical element
or part a normal string means having parsed completely that backtick.

Unless, again, I'm missing something. But I'm starting to wonder whether
I'm too ignorant and/or you too knowledgeable to bridge the conceptual
gap between us.

> Also note that an LR parser acts similarly to your "parsing all 
> possibilities at once".

Thanks a lot. The keyword here is duly noted for future research.

>             As Dmitry says, such parsers (like the one we use in Janus/Ada) 
> make it more challenging to deal with error correction

The thing is that in these languages, there is really no error. Like the
unmatched tick in my second example, or the last bracket in the first
one. If it doesn't fit the rule, then it's literal. I guess that would
look like a catch-all rule at the end of a list of alternatives. That's
also the reason why I'm not completely sure these languages can be
described by a (mathematical) grammar, but I'll admit my ignorance in
the practical uses of such algebra constructs.


Thanks for your tentative help,
Natasha