comp.lang.ada
 help / color / mirror / Atom feed
* [Slightly OT] How to process lightweight text markup languages?
@ 2015-01-18 18:04 Natasha Kerensikova
  2015-01-18 20:21 ` Dmitry A. Kazakov
  2015-01-21 14:54 ` Stephen Leake
  0 siblings, 2 replies; 27+ messages in thread
From: Natasha Kerensikova @ 2015-01-18 18:04 UTC (permalink / raw)


Hello,

I hope you'll forgive my straying slightly beyond the group topic,
since I'm asking about proper design that could be used no matter the
target language (though I will implement it in Ada), but I have the
feeling some regulars here can provide me with useful insights on the
matter.

I have been dealing with so-called "lightweight text markup languages"
for a while now, and despite having written numeros implementation in
various language, I've recently come to the realization that I've only
ever used one design, which feels inadequate for the latest such
languages I'm trying to use.

So I will briefly describe these languages look like, what my current
design look like, why I feel it's not adequate for some of these
languages, how I believe others are dealing with them and finally why I
don't want to like them.


The primary motivation behind "lightweight text markup languages" is to
define a language over raw ASCII or Unicode text to describe richly
formatted text (usually equivalent to a subset of HTML), while being
easy to learn, read and write in its raw or source form.

Most wikis use such languages, and other examples include
restructuredText, mostly used in automatically-generated python
documentation, and asciidoc, which aims to be equivalent to DocBook.

For some reason I still have trouble understanding, the easiness is
conflated with acceptable sloppiness. In effect, most of these language
accept absolutely any input text, and if for example an opening symbol
doesn't have a matching closing symbol, instead of having the text
rejected with a syntax error, the opening symbol is merely considered as
implicitly escaped and treated as raw text. That point isn't bad in
itself, but it's the source of one main unresolved difficulties.


When writing code to read such input formats and produce HTML or some
other similarly-powerful format, I always instinctively fell on the same
design, based on the following axioms (that I considered obvious until I
realized how seldom they are employed):

1. Input text and output text are conceptually of different,
incompatible types. So the only way some piece of input end up in the
output is by going through a conversion function, which has to take care
at least of escaping. Even when the target language does not support
such typing, at least in the design input and output are clearly
separated. It helps preventing injection vulnerabilities, but I came to
this axiom mainly for clarity.

By separating more and more the code dealing with input text and code
dealing with output, I got them so loosely coupled that different output
format can easily be plugged behind the same input code, which I
consider a good thing.

2. Input and output are treated as streams, in that output is
append-only and input position only moves forward, at least one
character per cycle, though I often needed unbounded look-ahead. Again
this reduces the power of code dealing with those, helping clarity, and
making termination proof and memory bounding much easier.

3. The lexer, parser and sematic analyzers are not well separated or
well defined, mostly due to ignorance on my part, but that might be also
one the goals of the "lightweight" part. As I said, any text is
syntactically valid, and considering the input as a stream of code
points felt enough of a lexical analysis.


These axioms, while helpful for developing, are quite limiting on what
processing can easily be performed, and that usually shows in how
malformed input is considered. Most lightweight text markup languages
don't specify what output is expected on malformed input, so whatever is
the easiest is usually fine, and that's how I got my code for years.

Then there was that CommonMark project, aiming at standardizing
Markdown, including how malformed text is expected to be interpreted, to
reduce surprise of users used to ambiguous constructs.

As a simple example, let's consider Markdown inline code fragments, that
are delimited by backtick characters, like `this`, and a type of links,
where the text link is surrounded by square brackets, followed by its
target between parentheses, like [this](http://example.com/).
Now consider mixing them in the following sequence:
   [alpha`](http://example.com/)`](http://other.com)

It can either be interpreted as either a link to example.com whose text
is "alpha`" with an implicitly escaped opening code span marker,
followed by some normal text, or as a link to other.com whose text
contains the code fragment "](http://example.com/)".

In my unbounded-look-ahead online parser, I have to make the decision
about which closing bracket ends the link text when encountering the
opening bracket, based only on the raw input text. So if I don't want to
mix code-fragment logic in the link logic, I will have to choose the
first closing bracket, and interpret the example a part without code
fragment.

Unfortunately, CommonMark came up with the idea of "element precedence",
and code fragment somehow have precedence over link constructions, so
only the second interpretation of the example, with a code fragment and
a link to other.com, is valid.

So lacking idea on language processing, I turn to you for inspiration of
design as clean as my original one, but able to deal with such rules.


After much thought, I came to the realization that lightweight text
markup processors don't actually follow my axioms 1 and 2. Instead, they
take input text as a mutable blob, and repeatedly apply transformations
on it.

To be honest, I had to study AsciiDoc, where such a model is almost
explicit in the language description, to realize it.

In that sense, it only means that code fragments are identified before
links, so at some point there is the following string in memory:
   [alpha<code>](http://example.com/)</code>](http://other.com)
or something like that.

I don't like it, because I don't like blind text replacement, especially
when escaping is supposed to be a replacement like another. Moreover,
it couples much more tightly the input format and the output format:
at what point in the transformation list should the escaping occur, when
you don't want to depend on what symbols need escaping? Escaping too
early risks mangling the symbols of your language, while escaping too
late risks escaping constructs build in previous rounds.

I feels like a sloppy process, the kind I find acceptable in one-shot
grep/sed sequences, but not as durable engineering.

Is there a way to implement these languages with proper engineering?


My latest attempts involve keeping the online architecture with separate
input and output types and streams, and keeping a stack of currently
opened constructs, with a dynamically dispatching ending test on each
character for each construct on the stack. It feels horribly
inefficient and complicated.

Or maybe giving up the online capacity, making the assumption that any
reasonable input can fit in memory, and build a serious
semantically-loaded token sequence, and build a tree from it. However,
it also feels horribly complicated compared to my earlier design, only
to resolved malformed input according to some "precedence" rather than
in text order.



Thanks in advance for your patience and advice,
Natasha


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Slightly OT] How to process lightweight text markup languages?
  2015-01-18 18:04 [Slightly OT] How to process lightweight text markup languages? Natasha Kerensikova
@ 2015-01-18 20:21 ` Dmitry A. Kazakov
  2015-01-19 11:09   ` G.B.
  2015-01-20 18:47   ` Natasha Kerensikova
  2015-01-21 14:54 ` Stephen Leake
  1 sibling, 2 replies; 27+ messages in thread
From: Dmitry A. Kazakov @ 2015-01-18 20:21 UTC (permalink / raw)


On Sun, 18 Jan 2015 18:04:08 +0000 (UTC), Natasha Kerensikova wrote:

[...] 

> My latest attempts involve keeping the online architecture with separate
> input and output types and streams, and keeping a stack of currently
> opened constructs, with a dynamically dispatching ending test on each
> character for each construct on the stack. It feels horribly
> inefficient and complicated.

Nothing complicated and most efficient in the sense that depending on the
formal language classification you could not be able eliminate the stack
(memory) in some or other form. You could think of it in terms of possible
states of the parser. If the number of states is infinite you must have
stack or else the parser itself must be infinite. Simple example is parsing
the language of balanced brackets: ()(()).
 
> Or maybe giving up the online capacity, making the assumption that any
> reasonable input can fit in memory, and build a serious
> semantically-loaded token sequence, and build a tree from it.

This is equivalent to having stacks.

It seems that the language of yours has brackets and prefix operators.
Nothing complicated, really. You push operands (terms) onto one stack.
Operators and brackets are pushed onto another stack. Precedence rules wind
up the second stack while popping operands from the first. This is a pretty
straightforward and simple technique.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Slightly OT] How to process lightweight text markup languages?
  2015-01-18 20:21 ` Dmitry A. Kazakov
@ 2015-01-19 11:09   ` G.B.
  2015-01-19 13:21     ` Dmitry A. Kazakov
  2015-01-20 18:47   ` Natasha Kerensikova
  1 sibling, 1 reply; 27+ messages in thread
From: G.B. @ 2015-01-19 11:09 UTC (permalink / raw)


On 18.01.15 21:21, Dmitry A. Kazakov wrote:
> This is a pretty
> straightforward and simple technique.

The trouble is with expectations:

Input:

  ((){)([()[[]])]

Typical parsers will respond with such useless results
as "error at EOF". Not something that a (close to)
natural language processor can afford, I think.

What is needed, maybe, is a way to judiciously transcend
the simplicity of a dumb, straight forward stack with REJECT
at the end. The processor should, after all, output text
that is maximally useful because this justifies the effort
in the first place. What syntactical criteria are there,
if any, that could be the input to finding these maxima?

Is context dependence required? A very simple example is
EOL, if there is one: one corrupted line of output is
better than all remaining lines corrupted.

Can "judicious" entail the use of special casing and be
done?

Would it be possible to describe a fixed point so that
the translation functions would close in on this best
result?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Slightly OT] How to process lightweight text markup languages?
  2015-01-19 11:09   ` G.B.
@ 2015-01-19 13:21     ` Dmitry A. Kazakov
  2015-01-19 16:58       ` G.B.
  0 siblings, 1 reply; 27+ messages in thread
From: Dmitry A. Kazakov @ 2015-01-19 13:21 UTC (permalink / raw)


On Mon, 19 Jan 2015 12:09:40 +0100, G.B. wrote:

> On 18.01.15 21:21, Dmitry A. Kazakov wrote:
>> This is a pretty straightforward and simple technique.
> 
> The trouble is with expectations:
> 
> Input:
> 
>   ((){)([()[[]])]
> 
> Typical parsers will respond with such useless results
> as "error at EOF". Not something that a (close to)
> natural language processor can afford, I think.

Not with the technique I described. In your example, the operator stack
will contain:

  (  at pos. 2   <--- stack top
  (  at pos. 1

when } will try to wind it up by popping the last unmatched (. Since } does
not match ( you will easily generate "the closing curly bracket at pos. 3
does not match the opening round bracket at pos. 2"

Your experience probably come from grammar-generated parsers. The
straightforward technique is so much better for all practical purposes, and
for error messages generation especially.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Slightly OT] How to process lightweight text markup languages?
  2015-01-19 13:21     ` Dmitry A. Kazakov
@ 2015-01-19 16:58       ` G.B.
  2015-01-19 17:58         ` Dmitry A. Kazakov
  2015-01-19 20:12         ` Randy Brukardt
  0 siblings, 2 replies; 27+ messages in thread
From: G.B. @ 2015-01-19 16:58 UTC (permalink / raw)


On 19.01.15 14:21, Dmitry A. Kazakov wrote:
> On Mon, 19 Jan 2015 12:09:40 +0100, G.B. wrote:
>
>> On 18.01.15 21:21, Dmitry A. Kazakov wrote:
>>> This is a pretty straightforward and simple technique.
>>
>> The trouble is with expectations:
>>
>> Input:
>>
>>    ((){)([()[[]])]
>>
>> Typical parsers will respond with such useless results
>> as "error at EOF". Not something that a (close to)
>> natural language processor can afford, I think.
>
> Not with the technique I described. In your example, the operator stack
> will contain:
>
>    (  at pos. 2   <--- stack top
>    (  at pos. 1
>
> when } will try to wind it up by popping the last unmatched (. Since } does
> not match ( you will easily generate "the closing curly bracket at pos. 3
> does not match the opening round bracket at pos. 2"

That's a possible answer, but may not be what should
have happened next if the brackets weren't tied together
properly and something is in need of recovery. See also
http://www.youtube.com/watch?v=cog2a3YeDMM

> Your experience probably come from grammar-generated parsers. The
> straightforward technique is so much better for all practical purposes, and
> for error messages generation especially.

Leaving some issues aside such as right brackets being far away,
or missing altogether, or superfluous due to having been placed
twice as in Natasha's example, or structured and misspelled, this
setup falls a little short of what is to be achieved. In
particular in a live system where there is no human involved,
something must be produced: If

  [alpha`]beta`

is a legitimate input, although possibly ungrammatical,
then what is to be produced?

A good translator needs to make the best of it. The
output should reflect the intention. That's only possible
when there is a likely, or legitimate interpretation,
as judged after the fact by readers of the output. What
they will recognize should be what the author had wanted
them to recognize.

If it was the writer's intention to write "`]", then the parser
must not touch the input and a non-translation is the best
solution. If not, then maybe error correction could switch
the positions of "`" and "]", maybe when looking ahead reveals
a likely match for "`". In any case, the input could be
shown alongside the translation, or at least be available
for checking.

I think the best solution is to come to terms with computers
and use them for text editing! Do not again start an even
more ad-hoc markup business than the one against which they
drew up GML in 1969.  I guess :-)


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Slightly OT] How to process lightweight text markup languages?
  2015-01-19 16:58       ` G.B.
@ 2015-01-19 17:58         ` Dmitry A. Kazakov
  2015-01-20 14:41           ` Robert A Duff
  2015-01-19 20:12         ` Randy Brukardt
  1 sibling, 1 reply; 27+ messages in thread
From: Dmitry A. Kazakov @ 2015-01-19 17:58 UTC (permalink / raw)


On Mon, 19 Jan 2015 17:58:38 +0100, G.B. wrote:

> On 19.01.15 14:21, Dmitry A. Kazakov wrote:
>> On Mon, 19 Jan 2015 12:09:40 +0100, G.B. wrote:
>>
>>> On 18.01.15 21:21, Dmitry A. Kazakov wrote:
>>>> This is a pretty straightforward and simple technique.
>>>
>>> The trouble is with expectations:
>>>
>>> Input:
>>>
>>>    ((){)([()[[]])]
>>>
>>> Typical parsers will respond with such useless results
>>> as "error at EOF". Not something that a (close to)
>>> natural language processor can afford, I think.
>>
>> Not with the technique I described. In your example, the operator stack
>> will contain:
>>
>>    (  at pos. 2   <--- stack top
>>    (  at pos. 1
>>
>> when } will try to wind it up by popping the last unmatched (. Since } does
>> not match ( you will easily generate "the closing curly bracket at pos. 3
>> does not match the opening round bracket at pos. 2"
> 
> That's a possible answer, but may not be what should
> have happened next if the brackets weren't tied together
> properly and something is in need of recovery.

Of course you can do recovery of any kind: you could simply throw } away
and treat this as () instead, or alternatively you could treat it as {}.
You have all necessary information to make a choice.

I am not a fan of recovery, it is quite annoying to me what, for example,
GNAT does. I prefer full stop after the first error. But it is a matter of
taste.

>> Your experience probably come from grammar-generated parsers. The
>> straightforward technique is so much better for all practical purposes, and
>> for error messages generation especially.
> 
> Leaving some issues aside such as right brackets being far away,
> or missing altogether, or superfluous due to having been placed
> twice as in Natasha's example, or structured and misspelled, this
> setup falls a little short of what is to be achieved.

This is why there are lexical and syntactical elements. You don't deal with
syntax if keywords are mangled. Quite simple.

> In
> particular in a live system where there is no human involved,
> something must be produced: If
> 
>   [alpha`]beta`
> 
> is a legitimate input, although possibly ungrammatical,
> then what is to be produced?

`` are either paired brackets (syntax) or else quotation marks (lexical
term). In the first case you get brackets mismatch, in the second case you
get missing quotation mark of a literal starting at pos.7. 

> If it was the writer's intention to write "`]", then the parser
> must not touch the input and a non-translation is the best
> solution. If not, then maybe error correction could switch
> the positions of "`" and "]", maybe when looking ahead reveals
> a likely match for "`".

Parser does not guess anything. It simply parses. The failure of generators
lies in a wrong assumption that there is one language to parse and anything
not recognized is a fault. In reality there is a large number of languages
the parser must parse. Only one of these languages is legal, e.g., Ada. All
other languages the parser understands are variants of broken Ada. Parser
does not reject them, it translates them into a set of error messages
rather than into AST.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Slightly OT] How to process lightweight text markup languages?
  2015-01-19 16:58       ` G.B.
  2015-01-19 17:58         ` Dmitry A. Kazakov
@ 2015-01-19 20:12         ` Randy Brukardt
  2015-01-19 21:37           ` gautier_niouzes
  2015-01-20 19:16           ` Natasha Kerensikova
  1 sibling, 2 replies; 27+ messages in thread
From: Randy Brukardt @ 2015-01-19 20:12 UTC (permalink / raw)


"G.B." <bauhaus@futureapps.invalid> wrote in message 
news:m9jd2u$j08$1@dont-email.me...
...
> Leaving some issues aside such as right brackets being far away,
> or missing altogether, or superfluous due to having been placed
> twice as in Natasha's example, or structured and misspelled, this
> setup falls a little short of what is to be achieved. In
> particular in a live system where there is no human involved,
> something must be produced:

No, this is a fallicy. The Internet (and world in general) would be better 
off if documents that are sufficiently malformed were simply not displayed 
at all. In that case, the document source would be fixed if anyone cares 
enough.

As it is, the attempts at autocorrection leave huge holes in the safety of 
the Internet (thus the world at large, as the Internet has become so 
intergral). Malware filters cannot by their nature guess all of the ways 
that some idiot tool will deal with malformed input, meaning that there are 
always holes in the defenses to exploit.

And for what? So that people who can't be bothered to get their documents 
formatted correctly still get some results (thus reenforcing their bad 
behavior). Bah humbug!!

                              Randy.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Slightly OT] How to process lightweight text markup languages?
  2015-01-19 20:12         ` Randy Brukardt
@ 2015-01-19 21:37           ` gautier_niouzes
  2015-01-20  8:44             ` Dmitry A. Kazakov
                               ` (2 more replies)
  2015-01-20 19:16           ` Natasha Kerensikova
  1 sibling, 3 replies; 27+ messages in thread
From: gautier_niouzes @ 2015-01-19 21:37 UTC (permalink / raw)


Le lundi 19 janvier 2015 21:12:41 UTC+1, Randy Brukardt a écrit :

> No, this is a fallicy. The Internet (and world in general) would be better 
> off if documents that are sufficiently malformed were simply not displayed 
> at all. In that case, the document source would be fixed if anyone cares 
> enough.

Alas the forces of competition resulted in the opposite: if a browser doesn't display a malformed HTML, users blame the browser and try another one. By developing Wasabee I was amazed by the proportion of sites with malformed HTML code. I guess it was always close to 100% but over time it is even closer. Chances are that only the W3 site is okay...

> As it is, the attempts at autocorrection leave huge holes in the safety of 
> the Internet (thus the world at large, as the Internet has become so 
> intergral). Malware filters cannot by their nature guess all of the ways 
> that some idiot tool will deal with malformed input, meaning that there are 
> always holes in the defenses to exploit.

Only the input is idiot. The tool has to be smart. See for instance the algorithm in Wasabee that corrects an ill-formed HTML tree. Same in real life: sometimes smart guys have to find solutions for idiots (customers, bosses,...)

> And for what? So that people who can't be bothered to get their documents 
> formatted correctly still get some results (thus reenforcing their bad 
> behavior). Bah humbug!!
> 
>                               Randy.

You'd like to start a new, parallel Web with only well formatted documents ?
_________________________ 
Gautier's Ada programming 
http://sf.net/users/gdemont

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Slightly OT] How to process lightweight text markup languages?
  2015-01-19 21:37           ` gautier_niouzes
@ 2015-01-20  8:44             ` Dmitry A. Kazakov
  2015-01-20 12:36               ` G.B.
  2015-01-20 20:36               ` Shark8
  2015-01-20 19:19             ` Natasha Kerensikova
  2015-01-20 21:43             ` Randy Brukardt
  2 siblings, 2 replies; 27+ messages in thread
From: Dmitry A. Kazakov @ 2015-01-20  8:44 UTC (permalink / raw)


On Mon, 19 Jan 2015 13:37:57 -0800 (PST), gautier_niouzes@hotmail.com
wrote:

> Le lundi 19 janvier 2015 21:12:41 UTC+1, Randy Brukardt a écrit :
> 
>> No, this is a fallicy. The Internet (and world in general) would be better 
>> off if documents that are sufficiently malformed were simply not displayed 
>> at all. In that case, the document source would be fixed if anyone cares 
>> enough.
> 
> Alas the forces of competition resulted in the opposite: if a browser
> doesn't display a malformed HTML, users blame the browser and try another
> one.

In a hypothetic case of "doing things right" they would not be able to
deploy broken pages at the server side. Can you execute a malformed
executable?

The problem is not only lack of an enforced standard but also the whole
concept of a directly interpreted weak language which could be equally
understood by a human and the machine. This is an inherently flawed idea as
HTML, XML, scripting languages promptly prove. In the end it is neither
readable nor executable with huge overhead for all parties and no safety at
all, plus, totally unmaintainable.

> Same in real
> life: sometimes smart guys have to find solutions for idiots (customers,
> bosses,...)

Yes, you could also put it this way: idiots are busy piling up work for
smart people...

> You'd like to start a new, parallel Web with only well formatted documents ?

If I had a say, I would begin with a file-/data-less typed OS. That would
preclude the very notion of "formatted document". It should be just
"document", a type with operations, no directly accessible data.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Slightly OT] How to process lightweight text markup languages?
  2015-01-20  8:44             ` Dmitry A. Kazakov
@ 2015-01-20 12:36               ` G.B.
  2015-01-20 13:14                 ` Dmitry A. Kazakov
  2015-01-20 20:36               ` Shark8
  1 sibling, 1 reply; 27+ messages in thread
From: G.B. @ 2015-01-20 12:36 UTC (permalink / raw)


On 20.01.15 09:44, Dmitry A. Kazakov wrote:
> Can you execute a malformed executable?

You can run an executable infected with a virus,
so form does not seem to help.

> If I had a say, I would begin with a file-/data-less typed OS. That would
> preclude the very notion of "formatted document". It should be just
> "document", a type with operations, no directly accessible data.

Where is your pessimism? Someone has to overcome HALT
to make this OS's type system sufficiently capable! ;-)




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Slightly OT] How to process lightweight text markup languages?
  2015-01-20 12:36               ` G.B.
@ 2015-01-20 13:14                 ` Dmitry A. Kazakov
  0 siblings, 0 replies; 27+ messages in thread
From: Dmitry A. Kazakov @ 2015-01-20 13:14 UTC (permalink / raw)


On Tue, 20 Jan 2015 13:36:50 +0100, G.B. wrote:

> On 20.01.15 09:44, Dmitry A. Kazakov wrote:
>> Can you execute a malformed executable?
> 
> You can run an executable infected with a virus,
> so form does not seem to help.

So what? This is still a well-formed program. The programmer's intention is
irrelevant. "Harmful" is not a language term. Nobody ever suggested that
harmful content could be stopped by [web-]language means. Not even people
agree on which content is harmful, as you should surely have noticed
watching recent news.

>> If I had a say, I would begin with a file-/data-less typed OS. That would
>> preclude the very notion of "formatted document". It should be just
>> "document", a type with operations, no directly accessible data.
> 
> Where is your pessimism? Someone has to overcome HALT
> to make this OS's type system sufficiently capable! ;-)

How so? Nominal typing is nowhere close to halting problem. On the
contrary, it is "understanding" intentions, which is more than just
halting, it is no less than full AI.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Slightly OT] How to process lightweight text markup languages?
  2015-01-19 17:58         ` Dmitry A. Kazakov
@ 2015-01-20 14:41           ` Robert A Duff
  0 siblings, 0 replies; 27+ messages in thread
From: Robert A Duff @ 2015-01-20 14:41 UTC (permalink / raw)


"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:

> I am not a fan of recovery, it is quite annoying to me what, for example,
> GNAT does. I prefer full stop after the first error. But it is a matter of
> taste.

Try the -gnatm=1 switch.

- Bob

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Slightly OT] How to process lightweight text markup languages?
  2015-01-18 20:21 ` Dmitry A. Kazakov
  2015-01-19 11:09   ` G.B.
@ 2015-01-20 18:47   ` Natasha Kerensikova
  2015-01-20 19:44     ` Dmitry A. Kazakov
  2015-01-23 10:24     ` Stephen Leake
  1 sibling, 2 replies; 27+ messages in thread
From: Natasha Kerensikova @ 2015-01-20 18:47 UTC (permalink / raw)


On 2015-01-18, Dmitry A. Kazakov <mailbox@dmitry-kazakov.de> wrote:
> On Sun, 18 Jan 2015 18:04:08 +0000 (UTC), Natasha Kerensikova wrote:
>
> [...] 
>
>> My latest attempts involve keeping the online architecture with separate
>> input and output types and streams, and keeping a stack of currently
>> opened constructs, with a dynamically dispatching ending test on each
>> character for each construct on the stack. It feels horribly
>> inefficient and complicated.
>
> Nothing complicated and most efficient in the sense that depending on the
> formal language classification you could not be able eliminate the stack
> (memory) in some or other form. You could think of it in terms of possible
> states of the parser. If the number of states is infinite you must have
> stack or else the parser itself must be infinite. Simple example is parsing
> the language of balanced brackets: ()(()).

Well it feels insurmountably complicated to me, that's why I posted in
first place -- to be enlightened.

What I still can't make fit with what I know is how to deal
simultaneously with the precedence and the "implicit escaping", which is
further mudded by the interpretation of what is in the constructs
depends on the particular current construct.

To put it in a grammar-like way (even though I doubt the considered
language has a grammar), I would have something like:

   formatted-text ::= code-fragment | link | ... | literal-text
   code-fragment ::= '`' literal-text '`'
   link ::= '[' formatted-text ']' '(' url [ link-title ] ')'
   link-title ::= '"' literal-text '"'

So if you remember my example,

   [alpha`](http://example.com/)`](http://other.com)

the part "](http://example.com/)" matches the end of a link, but only in
a formatted-text context, not inside a code fragment.

There for, considering an online algorithm, when the part
"[alpha`](http://example.com/)" has been read, there is no way to decide
whether it's a link or not, until the closing backtick is encountered,
or there is proof there is no closing backtick by encountering the end
of the text (or of the enclosing even-higher-precedence structure).

This led me to the conclusion that the only solutions are backtracking
and simultaneously parsing all available possibilities (i.e. using a
nondeterministic automaton). Considering the current state of computing,
I should probably go for backtracking.

Basic backtracking would be pushing the current state when encountering
an opening marker, and if the end is reached with a non-empty stack, pop
the top-most state, replace the opening marker by the literal
sequence of its representation, and restart parsing.

If I'm not mistaken, adding precedences to the mix would just change the
meaning of "end is reached" in the previous paragraph: it would not only
mean the end of input, but also the ending marker of any
higher-precedence construct currently in the stack.

Am I right so far? Am I missing something?


Thanks for your help,
Natasha

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Slightly OT] How to process lightweight text markup languages?
  2015-01-19 20:12         ` Randy Brukardt
  2015-01-19 21:37           ` gautier_niouzes
@ 2015-01-20 19:16           ` Natasha Kerensikova
  1 sibling, 0 replies; 27+ messages in thread
From: Natasha Kerensikova @ 2015-01-20 19:16 UTC (permalink / raw)


On 2015-01-19, Randy Brukardt <randy@rrsoftware.com> wrote:
> No, this is a fallicy. The Internet (and world in general) would be better 
> off if documents that are sufficiently malformed were simply not displayed 
> at all. In that case, the document source would be fixed if anyone cares 
> enough.
>
> As it is, the attempts at autocorrection leave huge holes in the safety of 
> the Internet (thus the world at large, as the Internet has become so 
> intergral). Malware filters cannot by their nature guess all of the ways 
> that some idiot tool will deal with malformed input, meaning that there are 
> always holes in the defenses to exploit.
>
> And for what? So that people who can't be bothered to get their documents 
> formatted correctly still get some results (thus reenforcing their bad 
> behavior). Bah humbug!!

I wholeheartedly agree, and I would have said as much myself, but...

... but I take the world as it is, and I won't go on a crusade against
behaviors I consider bad.

In that particular case, I'm taking an existing corpus and an existing
user base, so I feel my engineering duty is to not cause a regression,
even when doing so promotes sloppiness.

And even more so when my contribution (or lack of thereof) to sloppiness
promotion is such a tiny droplet in a markdown-powered ocean of
stack-overflow, reddit, github, etc.

I'm aware of how anti-Ada it is to cater the majority's sloppiness, but
I do have different standards for different areas of my life.


Natasha


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Slightly OT] How to process lightweight text markup languages?
  2015-01-19 21:37           ` gautier_niouzes
  2015-01-20  8:44             ` Dmitry A. Kazakov
@ 2015-01-20 19:19             ` Natasha Kerensikova
  2015-01-20 21:43             ` Randy Brukardt
  2 siblings, 0 replies; 27+ messages in thread
From: Natasha Kerensikova @ 2015-01-20 19:19 UTC (permalink / raw)


Hello,

On 2015-01-19, gautier_niouzes@hotmail.com <gautier_niouzes@hotmail.com> wrote:
>> And for what? So that people who can't be bothered to get their documents 
>> formatted correctly still get some results (thus reenforcing their bad 
>> behavior). Bah humbug!!
>
> You'd like to start a new, parallel Web with only well formatted documents ?

How about we take over the gophersphere and declare it a safe heaven for
well formatted documents and proper engineering?

I wonder whether Ada enthousiasts are numerous enough to actually weight
over the gophersphere...


Natasha

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Slightly OT] How to process lightweight text markup languages?
  2015-01-20 18:47   ` Natasha Kerensikova
@ 2015-01-20 19:44     ` Dmitry A. Kazakov
  2015-01-20 22:00       ` Randy Brukardt
  2015-01-23 10:24     ` Stephen Leake
  1 sibling, 1 reply; 27+ messages in thread
From: Dmitry A. Kazakov @ 2015-01-20 19:44 UTC (permalink / raw)


On Tue, 20 Jan 2015 18:47:13 +0000 (UTC), Natasha Kerensikova wrote:

> On 2015-01-18, Dmitry A. Kazakov <mailbox@dmitry-kazakov.de> wrote:
>> On Sun, 18 Jan 2015 18:04:08 +0000 (UTC), Natasha Kerensikova wrote:
>>
>> [...] 
>>
>>> My latest attempts involve keeping the online architecture with separate
>>> input and output types and streams, and keeping a stack of currently
>>> opened constructs, with a dynamically dispatching ending test on each
>>> character for each construct on the stack. It feels horribly
>>> inefficient and complicated.
>>
>> Nothing complicated and most efficient in the sense that depending on the
>> formal language classification you could not be able eliminate the stack
>> (memory) in some or other form. You could think of it in terms of possible
>> states of the parser. If the number of states is infinite you must have
>> stack or else the parser itself must be infinite. Simple example is parsing
>> the language of balanced brackets: ()(()).
> 
> Well it feels insurmountably complicated to me, that's why I posted in
> first place -- to be enlightened.

Nothing complicated to me, so far.

> What I still can't make fit with what I know is how to deal
> simultaneously with the precedence and the "implicit escaping", which is
> further mudded by the interpretation of what is in the constructs
> depends on the particular current construct.
> 
> To put it in a grammar-like way (even though I doubt the considered
> language has a grammar), I would have something like:
> 
>    formatted-text ::= code-fragment | link | ... | literal-text
>    code-fragment ::= '`' literal-text '`'
>    link ::= '[' formatted-text ']' '(' url [ link-title ] ')'
>    link-title ::= '"' literal-text '"'

[] - brackets
() - brackets
`` - quotation marks
"" - quotation marks

> So if you remember my example,
> 
>    [alpha`](http://example.com/)`](http://other.com)

Since `` are quotation marks, the above should be:

  +
  |_ []
  |   |_ +
  |       |_ alpha
  |       |_ ](http://example.com/)
  |_ ()
       |_ http://other.com

+ is an assumed infix catenation operation. No backtracking needed.

[...]
> Am I right so far? Am I missing something?

Distinguishing lexical and syntactical elements? You don't bother with
operators until expression terms (lexemes) matched. Once you matched them
you never return back. They are all on the stack if not already bound by an
operation. If `` is declared literal, it is a term of the expression,
atomically matched. It naturally takes precedence over anything else.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Slightly OT] How to process lightweight text markup languages?
  2015-01-20  8:44             ` Dmitry A. Kazakov
  2015-01-20 12:36               ` G.B.
@ 2015-01-20 20:36               ` Shark8
  2015-01-20 21:16                 ` Dmitry A. Kazakov
  1 sibling, 1 reply; 27+ messages in thread
From: Shark8 @ 2015-01-20 20:36 UTC (permalink / raw)


On 20-Jan-15 01:44, Dmitry A. Kazakov wrote:
>
> If I had a say, I would begin with a file-/data-less typed OS. That would
> preclude the very notion of "formatted document". It should be just
> "document", a type with operations, no directly accessible data.

Oh, you mean Oberon.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Slightly OT] How to process lightweight text markup languages?
  2015-01-20 20:36               ` Shark8
@ 2015-01-20 21:16                 ` Dmitry A. Kazakov
  2015-01-20 22:55                   ` J-P. Rosen
  0 siblings, 1 reply; 27+ messages in thread
From: Dmitry A. Kazakov @ 2015-01-20 21:16 UTC (permalink / raw)


On Tue, 20 Jan 2015 13:36:36 -0700, Shark8 wrote:

> On 20-Jan-15 01:44, Dmitry A. Kazakov wrote:
>>
>> If I had a say, I would begin with a file-/data-less typed OS. That would
>> preclude the very notion of "formatted document". It should be just
>> "document", a type with operations, no directly accessible data.
> 
> Oh, you mean Oberon.

No, I don't. I mean an OO OS with memory-mapped persistent objects. No
files, no I/O, no network.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Slightly OT] How to process lightweight text markup languages?
  2015-01-19 21:37           ` gautier_niouzes
  2015-01-20  8:44             ` Dmitry A. Kazakov
  2015-01-20 19:19             ` Natasha Kerensikova
@ 2015-01-20 21:43             ` Randy Brukardt
  2 siblings, 0 replies; 27+ messages in thread
From: Randy Brukardt @ 2015-01-20 21:43 UTC (permalink / raw)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2259 bytes --]

<gautier_niouzes@hotmail.com> wrote in message 
news:a26a1003-dff6-4f07-90e7-b9aa041ad1bd@googlegroups.com...
>Le lundi 19 janvier 2015 21:12:41 UTC+1, Randy Brukardt a écrit :
>
>> No, this is a fallicy. The Internet (and world in general) would be 
>> better
>> off if documents that are sufficiently malformed were simply not 
>> displayed
>> at all. In that case, the document source would be fixed if anyone cares
>> enough.
>
>Alas the forces of competition resulted in the opposite: if a browser 
>doesn't display
>a malformed HTML, users blame the browser and try another one.

You're right today, of course, but it didn't have to be that way. And 
shouldn't have been...

> By developing
>Wasabee I was amazed by the proportion of sites with malformed HTML code.
>I guess it was always close to 100% but over time it is even closer. 
>Chances are
>that only the W3 site is okay...

We checked all of the pages on the previous AdaIC site through the WC3 
tools. (Us Ada people care about Standards, you see.) We didn't intentially 
post anything with errors (although it happened a few times). So there was 
at least two such sites. I'm pretty sure that the various Ada standards and 
other documents produced by our tools are also correct (I've checked them 
periodically).

...
>Only the input is idiot. The tool has to be smart. See for instance the 
>algorithm
>in Wasabee that corrects an ill-formed HTML tree. Same in real life: 
>sometimes
>smart guys have to find solutions for idiots (customers, bosses,...)

It's impossible to guess all of the ways that some "smart" tool will 
interpret malformed input, and indeed those ways can be different for 
different tools. That provides a near-infinite number of ways for bad guys 
to sneak junk past one's defenses.

>> And for what? So that people who can't be bothered to get their documents
>> formatted correctly still get some results (thus reenforcing their bad
>> behavior). Bah humbug!!
>
>You'd like to start a new, parallel Web with only well formatted documents 
>?

Sounds good to me. I pretty much do that for e-mail today, and that 
certainly helps prevent spam from being delivered. I'd do it for the web, 
too, if I could figure out how.

                               Randy.





^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Slightly OT] How to process lightweight text markup languages?
  2015-01-20 19:44     ` Dmitry A. Kazakov
@ 2015-01-20 22:00       ` Randy Brukardt
  2015-01-22 13:41         ` Natasha Kerensikova
  0 siblings, 1 reply; 27+ messages in thread
From: Randy Brukardt @ 2015-01-20 22:00 UTC (permalink / raw)


"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
news:1wclq766iu82d.b0k1hx30rgrt.dlg@40tude.net...
> On Tue, 20 Jan 2015 18:47:13 +0000 (UTC), Natasha Kerensikova wrote:
>
>> On 2015-01-18, Dmitry A. Kazakov <mailbox@dmitry-kazakov.de> wrote:
>>> On Sun, 18 Jan 2015 18:04:08 +0000 (UTC), Natasha Kerensikova wrote:
>>>
>>> [...]
>>>
>>>> My latest attempts involve keeping the online architecture with 
>>>> separate
>>>> input and output types and streams, and keeping a stack of currently
>>>> opened constructs, with a dynamically dispatching ending test on each
>>>> character for each construct on the stack. It feels horribly
>>>> inefficient and complicated.
>>>
>>> Nothing complicated and most efficient in the sense that depending on 
>>> the
>>> formal language classification you could not be able eliminate the stack
>>> (memory) in some or other form. You could think of it in terms of 
>>> possible
>>> states of the parser. If the number of states is infinite you must have
>>> stack or else the parser itself must be infinite. Simple example is 
>>> parsing
>>> the language of balanced brackets: ()(()).
>>
>> Well it feels insurmountably complicated to me, that's why I posted in
>> first place -- to be enlightened.
>
> Nothing complicated to me, so far.
>
>> What I still can't make fit with what I know is how to deal
>> simultaneously with the precedence and the "implicit escaping", which is
>> further mudded by the interpretation of what is in the constructs
>> depends on the particular current construct.
>>
>> To put it in a grammar-like way (even though I doubt the considered
>> language has a grammar), I would have something like:
>>
>>    formatted-text ::= code-fragment | link | ... | literal-text
>>    code-fragment ::= '`' literal-text '`'
>>    link ::= '[' formatted-text ']' '(' url [ link-title ] ')'
>>    link-title ::= '"' literal-text '"'
>
> [] - brackets
> () - brackets
> `` - quotation marks
> "" - quotation marks
>
>> So if you remember my example,
>>
>>    [alpha`](http://example.com/)`](http://other.com)
>
> Since `` are quotation marks, the above should be:
>
>  +
>  |_ []
>  |   |_ +
>  |       |_ alpha
>  |       |_ ](http://example.com/)
>  |_ ()
>       |_ http://other.com
>
> + is an assumed infix catenation operation. No backtracking needed.
>
> [...]
>> Am I right so far? Am I missing something?
>
> Distinguishing lexical and syntactical elements? You don't bother with
> operators until expression terms (lexemes) matched. Once you matched them
> you never return back. They are all on the stack if not already bound by 
> an
> operation. If `` is declared literal, it is a term of the expression,
> atomically matched. It naturally takes precedence over anything else.

I agree with Dmitry; "standard" parsing has two stages, lexing (converting 
into a token stream) and parsing. You're trying to make due with only one, 
which complicates the problem a lot for no reason.

Also note that an LR parser acts similarly to your "parsing all 
possibilities at once". The parser state encodes all of the possibilities at 
the current point in the parsing, so it generally can handle quite a bit of 
complication. LR parsers are usually generated by a tool, and thus if there 
is not a unique solution, that's determined at the time of parser 
generation. As Dmitry says, such parsers (like the one we use in Janus/Ada) 
make it more challenging to deal with error correction (the parser generator 
we use originated as a research project into automated error correction --  
we don't use that error correction, which tells you how well that worked 
:-), but they can be quite small and very fast (especially on larger 
grammars like Ada's).

                                    Randy.


                             Randy.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Slightly OT] How to process lightweight text markup languages?
  2015-01-20 21:16                 ` Dmitry A. Kazakov
@ 2015-01-20 22:55                   ` J-P. Rosen
  2015-01-21  8:35                     ` Dmitry A. Kazakov
  0 siblings, 1 reply; 27+ messages in thread
From: J-P. Rosen @ 2015-01-20 22:55 UTC (permalink / raw)


Le 20/01/2015 22:16, Dmitry A. Kazakov a écrit :
> No, I don't. I mean an OO OS with memory-mapped persistent objects. No
> files, no I/O, no network.

Multics?

-- 
J-P. Rosen
Adalog
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52, Fax: +33 1 45 29 25 00
http://www.adalog.fr

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Slightly OT] How to process lightweight text markup languages?
  2015-01-20 22:55                   ` J-P. Rosen
@ 2015-01-21  8:35                     ` Dmitry A. Kazakov
  0 siblings, 0 replies; 27+ messages in thread
From: Dmitry A. Kazakov @ 2015-01-21  8:35 UTC (permalink / raw)


On Tue, 20 Jan 2015 23:55:12 +0100, J-P. Rosen wrote:

> Le 20/01/2015 22:16, Dmitry A. Kazakov a écrit :
>> No, I don't. I mean an OO OS with memory-mapped persistent objects. No
>> files, no I/O, no network.
> 
> Multics?

Anybody to suggest TOS/360? Same age as Multics... (:-))

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Slightly OT] How to process lightweight text markup languages?
  2015-01-18 18:04 [Slightly OT] How to process lightweight text markup languages? Natasha Kerensikova
  2015-01-18 20:21 ` Dmitry A. Kazakov
@ 2015-01-21 14:54 ` Stephen Leake
  1 sibling, 0 replies; 27+ messages in thread
From: Stephen Leake @ 2015-01-21 14:54 UTC (permalink / raw)


Natasha Kerensikova <lithiumcat@instinctive.eu> writes:

> I hope you'll forgive my straying slightly beyond the group topic,

I appreciate the interesting subject (but then, I'm into parsers
http://stephe-leake.org/ada/opentoken.html :).

> As a simple example, let's consider Markdown inline code fragments, that
> are delimited by backtick characters, like `this`, and a type of links,
> where the text link is surrounded by square brackets, followed by its
> target between parentheses, like [this](http://example.com/).
> Now consider mixing them in the following sequence:
>    [alpha`](http://example.com/)`](http://other.com)
>
> It can either be interpreted as either a link to example.com whose text
> is "alpha`" with an implicitly escaped opening code span marker,
> followed by some normal text, or as a link to other.com whose text
> contains the code fragment "](http://example.com/)".
>
> In my unbounded-look-ahead online parser, I have to make the decision
> about which closing bracket ends the link text when encountering the
> opening bracket, based only on the raw input text. So if I don't want to
> mix code-fragment logic in the link logic, I will have to choose the
> first closing bracket, and interpret the example a part without code
> fragment.
>
> Unfortunately, CommonMark came up with the idea of "element precedence",
> and code fragment somehow have precedence over link constructions, so
> only the second interpretation of the example, with a code fragment and
> a link to other.com, is valid.
>
> So lacking idea on language processing, I turn to you for inspiration of
> design as clean as my original one, but able to deal with such rules.

This is an example of coupling the lexer with the parser; normally the
parser would enforce precedence rules, but in this case that affects how
the code is lexed.

> After much thought, I came to the realization that lightweight text
> markup processors don't actually follow my axioms 1 and 2. Instead, they
> take input text as a mutable blob, and repeatedly apply transformations
> on it.
>
> In that sense, it only means that code fragments are identified before
> links, so at some point there is the following string in memory:
>    [alpha<code>](http://example.com/)</code>](http://other.com)

No, you don't need to transform the input text itself. You do need to
mess with the token stream returned by the lexer. The two
interpretations could be:

 LEFT_BRACKET STRING "alpha" ESCAPED_BACKTICK RIGHT_BRACKET  LINK
 "http://example.com/" STRING "`](http://other.com)"

 LEFT_BRACKET STRING "alpha" CODE "](http://example.com/)" RIGHT_BRACKET
 STRING "(http://other.com)"

One possible implementation:

    lexer returns the first interpretation

    parser realizes it is wrong, tells lexer to re-interpret
    ESCAPED_BACKTICK as CODE, and reparses.

That requires push-back of tokens from the parser to the lexer; messy. I
don't think OpenToken can do this currently.

Another choice is for the lexer to fork, and return both interpretations
in parallel; the parser would also fork. That's similar to the
generalized LALR parser (http://en.wikipedia.org/wiki/GLR_parser), which
the current version of the OpenToken parser supports (but not the
lexer). When you get to the closing backtick, the first parser reports
an error and closes, and the second parser keeps going.

> I don't like it, because I don't like blind text replacement, especially
> when escaping is supposed to be a replacement like another. Moreover,
> it couples much more tightly the input format and the output format:
> at what point in the transformation list should the escaping occur, when
> you don't want to depend on what symbols need escaping? Escaping too
> early risks mangling the symbols of your language, while escaping too
> late risks escaping constructs build in previous rounds.

Yes, backtracking is a pain. The parallel lexer/parser is simpler, at
the potential cost of being slower and taking more memory. I'm using GLR
in the Emacs Ada mode, and I got exponential explosions in the number of
parsers a couple of times; I had to clean up the grammar to avoid that.

> Is there a way to implement these languages with proper engineering?

I would pursue the parallel lexer/parser approach.

> My latest attempts involve keeping the online architecture with separate
> input and output types and streams, and keeping a stack of currently
> opened constructs, with a dynamically dispatching ending test on each
> character for each construct on the stack. It feels horribly
> inefficient and complicated.

Well, the problem is complicated, so that's not necessarily bad.

Parsing "natural" language is a complex problem. Structured languages,
like Ada, reduce parser complexity by forcing the user to structure the
input. CommonMark is attempting to remove that burden from the user, which
means the parser has to get more complex.

> Or maybe giving up the online capacity, making the assumption that any
> reasonable input can fit in memory, and build a serious
> semantically-loaded token sequence, and build a tree from it. 

Keeping the input in memory is probably practical; it is assumed in
Emacs, for example. Unless you have an explicit requirement to deal with
text > 100 MB, it's a reasonable approach.

-- 
-- Stephe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Slightly OT] How to process lightweight text markup languages?
  2015-01-20 22:00       ` Randy Brukardt
@ 2015-01-22 13:41         ` Natasha Kerensikova
  2015-01-22 18:38           ` Dmitry A. Kazakov
  0 siblings, 1 reply; 27+ messages in thread
From: Natasha Kerensikova @ 2015-01-22 13:41 UTC (permalink / raw)


Hello,

On 2015-01-20, Randy Brukardt <randy@rrsoftware.com> wrote:
> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
> news:1wclq766iu82d.b0k1hx30rgrt.dlg@40tude.net...
>> On Tue, 20 Jan 2015 18:47:13 +0000 (UTC), Natasha Kerensikova wrote:
>>> So if you remember my example,
>>>
>>>    [alpha`](http://example.com/)`](http://other.com)
>>
>> Since `` are quotation marks, the above should be:
>>
>>  +
>>  |_ []
>>  |   |_ +
>>  |       |_ alpha
>>  |       |_ ](http://example.com/)
>>  |_ ()
>>       |_ http://other.com
>>
>> + is an assumed infix catenation operation. No backtracking needed.

Maybe I got the vocabulary wrong, but the tree above is exactly what I
expect as an output for the whole parser. Except I don't need it to be
stored as a tree, if it's only an iterator or an event driver with a
callback collection, that's fine for me too.

However I'm still completely lost on how to reach that point.

Would you have a link that explains that? Or keywords to feed a search
engine?

>> [...]
>>> Am I right so far? Am I missing something?
>>
>> Distinguishing lexical and syntactical elements? You don't bother with
>> operators until expression terms (lexemes) matched. Once you matched them
>> you never return back. They are all on the stack if not already bound by 
>> an
>> operation. If `` is declared literal, it is a term of the expression,
>> atomically matched. It naturally takes precedence over anything else.
>
> I agree with Dmitry; "standard" parsing has two stages, lexing (converting 
> into a token stream) and parsing. You're trying to make due with only one, 
> which complicates the problem a lot for no reason.

Dmitry's answer position led me to believe that he was claiming an
online algorithm (i.e. operating on stream input and output) is possible,
provided there are internal stacks. So I followed that idea, because I
like online algorithms, for the same reason I like implementations
without explicit memory allocation: it's easier to understand and reason
about, but it's too limited for some situations.

So if there has to be two stages, that's fine, I'll go with two stages.

The next problem is that I still have no clue about what these two
stages are. The markers that begin or end elements depends on the
context, and knowing both what are markers and in what context is having
the full parser. So I can't imagine what data is produced by the first
stage to be fed into the second.

For example, a slight variation on my example above yields a completely
different result:

   [alpha`](http://example.com/)'](http://other.com)

+
|_ +
|   |_ []
|   |   |_ alpha`
|   |_ ()
|       |_ http://example.com/
|_ '](http://other.com)

This means that the parser can't know how to parse the stuff around
"example" unless it knows whether the backtick after "alpha" is matched
or not. So determining whether the "](" in there are a lexical element
or part a normal string means having parsed completely that backtick.

Unless, again, I'm missing something. But I'm starting to wonder whether
I'm too ignorant and/or you too knowledgeable to bridge the conceptual
gap between us.

> Also note that an LR parser acts similarly to your "parsing all 
> possibilities at once".

Thanks a lot. The keyword here is duly noted for future research.

>             As Dmitry says, such parsers (like the one we use in Janus/Ada) 
> make it more challenging to deal with error correction

The thing is that in these languages, there is really no error. Like the
unmatched tick in my second example, or the last bracket in the first
one. If it doesn't fit the rule, then it's literal. I guess that would
look like a catch-all rule at the end of a list of alternatives. That's
also the reason why I'm not completely sure these languages can be
described by a (mathematical) grammar, but I'll admit my ignorance in
the practical uses of such algebra constructs.




Thanks for your tentative help,
Natasha

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Slightly OT] How to process lightweight text markup languages?
  2015-01-22 13:41         ` Natasha Kerensikova
@ 2015-01-22 18:38           ` Dmitry A. Kazakov
  2015-01-22 21:48             ` Randy Brukardt
  0 siblings, 1 reply; 27+ messages in thread
From: Dmitry A. Kazakov @ 2015-01-22 18:38 UTC (permalink / raw)


On Thu, 22 Jan 2015 13:41:12 +0000 (UTC), Natasha Kerensikova wrote:

> On 2015-01-20, Randy Brukardt <randy@rrsoftware.com> wrote:
>> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
>> news:1wclq766iu82d.b0k1hx30rgrt.dlg@40tude.net...
>>> On Tue, 20 Jan 2015 18:47:13 +0000 (UTC), Natasha Kerensikova wrote:
>>>> So if you remember my example,
>>>>
>>>>    [alpha`](http://example.com/)`](http://other.com)
>>>
>>> Since `` are quotation marks, the above should be:
>>>
>>>  +
>>>  |_ []
>>>  |   |_ +
>>>  |       |_ alpha
>>>  |       |_ ](http://example.com/)
>>>  |_ ()
>>>       |_ http://other.com
>>>
>>> + is an assumed infix catenation operation. No backtracking needed.
> 
> Maybe I got the vocabulary wrong, but the tree above is exactly what I
> expect as an output for the whole parser. Except I don't need it to be
> stored as a tree, if it's only an iterator or an event driver with a
> callback collection, that's fine for me too.

Tree is merely for an unambiguous representation of the result of parsing.
Whether the parser's callback indeed generates a tree node or doing
something else is up to you.

> However I'm still completely lost on how to reach that point.
> 
> Would you have a link that explains that? Or keywords to feed a search
> engine?

http://www.dmitry-kazakov.de/ada/components.htm#Parsers_etc

The method is described in

"Compiler Construction for Digital Computers" by D. Gries, if I correctly
remember.

> Dmitry's answer position led me to believe that he was claiming an
> online algorithm (i.e. operating on stream input and output) is possible,
> provided there are internal stacks. So I followed that idea, because I
> like online algorithms, for the same reason I like implementations
> without explicit memory allocation: it's easier to understand and reason
> about, but it's too limited for some situations.

You can allocate memory for stacks statically. The point is that certain
class of formal languages requires memory to be recognized. You can convert
memory into parser states and states into memory. But you must have it.
 
> So if there has to be two stages, that's fine, I'll go with two stages.

Note that it does not mean that the second stage must start at the source
end. You can resume that stage right after each lexeme recognized. Only
early compilers used to produce a file after each stage.

> The next problem is that I still have no clue about what these two
> stages are. The markers that begin or end elements depends on the
> context, and knowing both what are markers and in what context is having
> the full parser. So I can't imagine what data is produced by the first
> stage to be fed into the second.

The division into syntax elements and lexemes is kind of arbitrary, the
motivation behind having them is that lexemes can be recognized
independently on the context. Syntax elements interact each other, e.g. in
a+b*c operands are bound by precedence. The next stage, the semantic
analysis is even more "ambiguous".

> For example, a slight variation on my example above yields a completely
> different result:
> 
>    [alpha`](http://example.com/)'](http://other.com)
> 
> +
>|_ +
>|   |_ []
>|   |   |_ alpha`
>|   |_ ()
>|       |_ http://example.com/
>|_ '](http://other.com)
> 
> This means that the parser can't know how to parse the stuff around
> "example" unless it knows whether the backtick after "alpha" is matched
> or not.

Why should it care? The lexer simply goes to the first delimiter where it
would stop anyway. E.g. in Ada, if you write

    X := "hello;

It would be the line end. Similarly in C, unless it hits \ before the line
end. At that point you decide how to react. You can stop at once, or assume
that the literal ends here and continue.

> So determining whether the "](" in there are a lexical element
> or part a normal string means having parsed completely that backtick.

It is a part of the literal's body, provided that `xxx` is a literal. If
you do not satisfy with the choice, do it a bracket, e.g. it would make
sense for Linux bash.
 
> Unless, again, I'm missing something. But I'm starting to wonder whether
> I'm too ignorant and/or you too knowledgeable to bridge the conceptual
> gap between us.

I think you are overdesigning it. The lexer need not to know anything. It
simply matches a lexeme and moves to the next one. A lexeme in your case
could be

  [
  ]
  (
  )
  `anything-in-between`
  a-sequence-of-other-characters

When the lexer sees ` it moves forward accumulating the literal (which may
have some escaping) until the first non-escaped `. The accumulated body of
the literal feeds the second stage. The lexer can accumulate it straight on
the stack of arguments.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Slightly OT] How to process lightweight text markup languages?
  2015-01-22 18:38           ` Dmitry A. Kazakov
@ 2015-01-22 21:48             ` Randy Brukardt
  0 siblings, 0 replies; 27+ messages in thread
From: Randy Brukardt @ 2015-01-22 21:48 UTC (permalink / raw)


"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
news:lskzqkn5ssua$.zt9p9m6m2yro$.dlg@40tude.net...
> On Thu, 22 Jan 2015 13:41:12 +0000 (UTC), Natasha Kerensikova wrote:
...
>> Unless, again, I'm missing something. But I'm starting to wonder whether
>> I'm too ignorant and/or you too knowledgeable to bridge the conceptual
>> gap between us.
>
> I think you are overdesigning it. The lexer need not to know anything. It
> simply matches a lexeme and moves to the next one. A lexeme in your case
> could be
...

I agree, the lexical level should be simple and as context-free as possible. 
And it may need some amount of lookahead. If you see a `, for instance, you 
may want to scan ahead until the natural end (whatever that is) to see if 
there is another ` (if not, then you just return the `). I wouldn't worry 
about pathological programs that have 10 million characters between the 
`s -- in most circumstances, that's just going to be a stand-alone ` 
followed by the next one in the markup. So even if there is no limit in the 
specification, I would add one just to prevent quoting the entire text when 
one quote mark is left out. (Ada uses line ends for this purpose, and it 
allows limiting the line length, so the maximum lookahead is the maximum 
line length. I suggest something similar in your case.)

And these stages are all interleaved. For instance, the front-end of the 
Janus/Ada compiler is driven by its parser. The parse starts by calling 
Get_Token to get the first token (lexeme) in the program. The parser then 
does its thing until that token is consumed, at which point Get_Token is 
called again. In the other direction, the parser calls various routines to 
do things when grammar productions are recognized. And that's the whole 
design.

Get_Token does the reading and buffering of the source file as needed, and 
determines the tokens just based on the source (it knows nothing about the 
state of the parser). (Note that you will not want to read one character at 
a time if you want anything resembling decent performance; you'll want to 
read chunks of text at a time, so that the lookahead requirement becomes a 
non-problem, just be sure to buffer more text than your lookahead 
requirement.)

Note that this effectively reads a file as a stream (modulo some buffering 
of lookahead). There's no backtracking in the parser, all of the lookahead 
is purely lexical.

                           Randy.





^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Slightly OT] How to process lightweight text markup languages?
  2015-01-20 18:47   ` Natasha Kerensikova
  2015-01-20 19:44     ` Dmitry A. Kazakov
@ 2015-01-23 10:24     ` Stephen Leake
  1 sibling, 0 replies; 27+ messages in thread
From: Stephen Leake @ 2015-01-23 10:24 UTC (permalink / raw)


Natasha Kerensikova <lithiumcat@instinctive.eu> writes:

> This led me to the conclusion that the only solutions are backtracking
> and simultaneously parsing all available possibilities (i.e. using a
> nondeterministic automaton). Considering the current state of computing,
> I should probably go for backtracking.

I don't follow; why is backtracking better than parallel parsing? They
both search the same tree of possible grammar sentences; it's just
depth-first vs breadth-first search.

> Basic backtracking would be pushing the current state when encountering
> an opening marker, and if the end is reached with a non-empty stack, pop
> the top-most state, replace the opening marker by the literal
> sequence of its representation, and restart parsing.

Right.

In parallel parsing, you don't wait for the first one to fail; you just
assume it will.

> If I'm not mistaken, adding precedences to the mix would just change the
> meaning of "end is reached" in the previous paragraph: it would not only
> mean the end of input, but also the ending marker of any
> higher-precedence construct currently in the stack.

Sounds right.

> Am I right so far? Am I missing something?

You are only missing an implemention of a parallel parser and lexer.
There is an example of the first in OpenToken; the second should not be
hard.

-- 
-- Stephe


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2015-01-23 10:24 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-18 18:04 [Slightly OT] How to process lightweight text markup languages? Natasha Kerensikova
2015-01-18 20:21 ` Dmitry A. Kazakov
2015-01-19 11:09   ` G.B.
2015-01-19 13:21     ` Dmitry A. Kazakov
2015-01-19 16:58       ` G.B.
2015-01-19 17:58         ` Dmitry A. Kazakov
2015-01-20 14:41           ` Robert A Duff
2015-01-19 20:12         ` Randy Brukardt
2015-01-19 21:37           ` gautier_niouzes
2015-01-20  8:44             ` Dmitry A. Kazakov
2015-01-20 12:36               ` G.B.
2015-01-20 13:14                 ` Dmitry A. Kazakov
2015-01-20 20:36               ` Shark8
2015-01-20 21:16                 ` Dmitry A. Kazakov
2015-01-20 22:55                   ` J-P. Rosen
2015-01-21  8:35                     ` Dmitry A. Kazakov
2015-01-20 19:19             ` Natasha Kerensikova
2015-01-20 21:43             ` Randy Brukardt
2015-01-20 19:16           ` Natasha Kerensikova
2015-01-20 18:47   ` Natasha Kerensikova
2015-01-20 19:44     ` Dmitry A. Kazakov
2015-01-20 22:00       ` Randy Brukardt
2015-01-22 13:41         ` Natasha Kerensikova
2015-01-22 18:38           ` Dmitry A. Kazakov
2015-01-22 21:48             ` Randy Brukardt
2015-01-23 10:24     ` Stephen Leake
2015-01-21 14:54 ` Stephen Leake

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox