From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-0.3 required=5.0 tests=BAYES_00,
	REPLYTO_WITHOUT_TO_CC autolearn=no autolearn_force=no version=3.4.4
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!mx02.eternal-september.org!feeder.eternal-september.org!aioe.org!.POSTED!not-for-mail
From: "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de>
Newsgroups: comp.lang.ada
Subject: Re: [Slightly OT] How to process lightweight text markup languages?
Date: Mon, 19 Jan 2015 18:58:57 +0100
Organization: cbb software GmbH
Message-ID: <x7neog5b4tss.jow36zcbc1k$.dlg@40tude.net>
References: <slrnmbntdm.19vl.lithiumcat@nat.rebma.instinctive.eu>
 <ynm6coktfevl.1esu61g1n9477.dlg@40tude.net> <m9iokj$upl$1@dont-email.me>
 <c9058n5dlu56.608mrt8042o0$.dlg@40tude.net> <m9jd2u$j08$1@dont-email.me>
Reply-To: mailbox@dmitry-kazakov.de
NNTP-Posting-Host: 0MSBVPcE8EdvhPFyEbPM4g.user.speranza.aioe.org
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Complaints-To: abuse@aioe.org
User-Agent: 40tude_Dialog/2.0.15.1
X-Notice: Filtered by postfilter v. 0.8.2
Xref: news.eternal-september.org comp.lang.ada:24607
Date: 2015-01-19T18:58:57+01:00
List-Id: <comp.lang.ada>

On Mon, 19 Jan 2015 17:58:38 +0100, G.B. wrote:

> On 19.01.15 14:21, Dmitry A. Kazakov wrote:
>> On Mon, 19 Jan 2015 12:09:40 +0100, G.B. wrote:
>>
>>> On 18.01.15 21:21, Dmitry A. Kazakov wrote:
>>>> This is a pretty straightforward and simple technique.
>>>
>>> The trouble is with expectations:
>>>
>>> Input:
>>>
>>>    ((){)([()[[]])]
>>>
>>> Typical parsers will respond with such useless results
>>> as "error at EOF". Not something that a (close to)
>>> natural language processor can afford, I think.
>>
>> Not with the technique I described. In your example, the operator stack
>> will contain:
>>
>>    (  at pos. 2   <--- stack top
>>    (  at pos. 1
>>
>> when } will try to wind it up by popping the last unmatched (. Since } does
>> not match ( you will easily generate "the closing curly bracket at pos. 3
>> does not match the opening round bracket at pos. 2"
> 
> That's a possible answer, but may not be what should
> have happened next if the brackets weren't tied together
> properly and something is in need of recovery.

Of course you can do recovery of any kind: you could simply throw } away
and treat this as () instead, or alternatively you could treat it as {}.
You have all necessary information to make a choice.

I am not a fan of recovery, it is quite annoying to me what, for example,
GNAT does. I prefer full stop after the first error. But it is a matter of
taste.

>> Your experience probably come from grammar-generated parsers. The
>> straightforward technique is so much better for all practical purposes, and
>> for error messages generation especially.
> 
> Leaving some issues aside such as right brackets being far away,
> or missing altogether, or superfluous due to having been placed
> twice as in Natasha's example, or structured and misspelled, this
> setup falls a little short of what is to be achieved.

This is why there are lexical and syntactical elements. You don't deal with
syntax if keywords are mangled. Quite simple.

> In
> particular in a live system where there is no human involved,
> something must be produced: If
> 
>   [alpha`]beta`
> 
> is a legitimate input, although possibly ungrammatical,
> then what is to be produced?

`` are either paired brackets (syntax) or else quotation marks (lexical
term). In the first case you get brackets mismatch, in the second case you
get missing quotation mark of a literal starting at pos.7. 

> If it was the writer's intention to write "`]", then the parser
> must not touch the input and a non-translation is the best
> solution. If not, then maybe error correction could switch
> the positions of "`" and "]", maybe when looking ahead reveals
> a likely match for "`".

Parser does not guess anything. It simply parses. The failure of generators
lies in a wrong assumption that there is one language to parse and anything
not recognized is a fault. In reality there is a large number of languages
the parser must parse. Only one of these languages is legal, e.g., Ada. All
other languages the parser understands are variants of broken Ada. Parser
does not reject them, it translates them into a set of error messages
rather than into AST.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de