From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-0.3 required=5.0 tests=BAYES_00, REPLYTO_WITHOUT_TO_CC autolearn=no autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!mx02.eternal-september.org!.POSTED!not-for-mail From: "G.B." Newsgroups: comp.lang.ada Subject: Re: [Slightly OT] How to process lightweight text markup languages? Date: Mon, 19 Jan 2015 17:58:38 +0100 Organization: A noiseless patient Spider Message-ID: References: Reply-To: nonlegitur@futureapps.de Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Mon, 19 Jan 2015 16:58:06 +0000 (UTC) Injection-Info: mx02.eternal-september.org; posting-host="b96887e80893c84a90c3007226ca0d1c"; logging-data="19464"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Iz8vFscjJID2bteJOyteGIHvAjss+n1w=" User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 In-Reply-To: Cancel-Lock: sha1:lbBmszYP3CKD2++8oaCBHHruoZA= Xref: news.eternal-september.org comp.lang.ada:24605 Date: 2015-01-19T17:58:38+01:00 List-Id: On 19.01.15 14:21, Dmitry A. Kazakov wrote: > On Mon, 19 Jan 2015 12:09:40 +0100, G.B. wrote: > >> On 18.01.15 21:21, Dmitry A. Kazakov wrote: >>> This is a pretty straightforward and simple technique. >> >> The trouble is with expectations: >> >> Input: >> >> ((){)([()[[]])] >> >> Typical parsers will respond with such useless results >> as "error at EOF". Not something that a (close to) >> natural language processor can afford, I think. > > Not with the technique I described. In your example, the operator stack > will contain: > > ( at pos. 2 <--- stack top > ( at pos. 1 > > when } will try to wind it up by popping the last unmatched (. Since } does > not match ( you will easily generate "the closing curly bracket at pos. 3 > does not match the opening round bracket at pos. 2" That's a possible answer, but may not be what should have happened next if the brackets weren't tied together properly and something is in need of recovery. See also http://www.youtube.com/watch?v=cog2a3YeDMM > Your experience probably come from grammar-generated parsers. The > straightforward technique is so much better for all practical purposes, and > for error messages generation especially. Leaving some issues aside such as right brackets being far away, or missing altogether, or superfluous due to having been placed twice as in Natasha's example, or structured and misspelled, this setup falls a little short of what is to be achieved. In particular in a live system where there is no human involved, something must be produced: If [alpha`]beta` is a legitimate input, although possibly ungrammatical, then what is to be produced? A good translator needs to make the best of it. The output should reflect the intention. That's only possible when there is a likely, or legitimate interpretation, as judged after the fact by readers of the output. What they will recognize should be what the author had wanted them to recognize. If it was the writer's intention to write "`]", then the parser must not touch the input and a non-translation is the best solution. If not, then maybe error correction could switch the positions of "`" and "]", maybe when looking ahead reveals a likely match for "`". In any case, the input could be shown alongside the translation, or at least be available for checking. I think the best solution is to come to terms with computers and use them for text editing! Do not again start an even more ad-hoc markup business than the one against which they drew up GML in 1969. I guess :-)