From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00
	autolearn=unavailable autolearn_force=no version=3.4.4
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!mx02.eternal-september.org!.POSTED!not-for-mail
From: Natasha Kerensikova <lithiumcat@instinctive.eu>
Newsgroups: comp.lang.ada
Subject: [Slightly OT] How to process lightweight text markup languages?
Date: Sun, 18 Jan 2015 18:04:08 +0000 (UTC)
Organization: A noiseless patient Spider
Message-ID: <slrnmbntdm.19vl.lithiumcat@nat.rebma.instinctive.eu>
Injection-Date: Sun, 18 Jan 2015 18:04:08 +0000 (UTC)
Injection-Info: mx02.eternal-september.org;
 posting-host="eab84d932a0f4c9d4606240766f0f5e7";
	logging-data="28311"; mail-complaints-to="abuse@eternal-september.org";
	posting-account="U2FsdGVkX1+NCBQDt48Wm7nqZt3MP9wj"
User-Agent: slrn/1.0.2 (FreeBSD)
Cancel-Lock: sha1:Mgu+ODgJ9icL41uKUjZ80yzfrQk=
Xref: news.eternal-september.org comp.lang.ada:24600
Date: 2015-01-18T18:04:08+00:00
List-Id: <comp.lang.ada>

Hello,

I hope you'll forgive my straying slightly beyond the group topic,
since I'm asking about proper design that could be used no matter the
target language (though I will implement it in Ada), but I have the
feeling some regulars here can provide me with useful insights on the
matter.

I have been dealing with so-called "lightweight text markup languages"
for a while now, and despite having written numeros implementation in
various language, I've recently come to the realization that I've only
ever used one design, which feels inadequate for the latest such
languages I'm trying to use.

So I will briefly describe these languages look like, what my current
design look like, why I feel it's not adequate for some of these
languages, how I believe others are dealing with them and finally why I
don't want to like them.


The primary motivation behind "lightweight text markup languages" is to
define a language over raw ASCII or Unicode text to describe richly
formatted text (usually equivalent to a subset of HTML), while being
easy to learn, read and write in its raw or source form.

Most wikis use such languages, and other examples include
restructuredText, mostly used in automatically-generated python
documentation, and asciidoc, which aims to be equivalent to DocBook.

For some reason I still have trouble understanding, the easiness is
conflated with acceptable sloppiness. In effect, most of these language
accept absolutely any input text, and if for example an opening symbol
doesn't have a matching closing symbol, instead of having the text
rejected with a syntax error, the opening symbol is merely considered as
implicitly escaped and treated as raw text. That point isn't bad in
itself, but it's the source of one main unresolved difficulties.


When writing code to read such input formats and produce HTML or some
other similarly-powerful format, I always instinctively fell on the same
design, based on the following axioms (that I considered obvious until I
realized how seldom they are employed):

1. Input text and output text are conceptually of different,
incompatible types. So the only way some piece of input end up in the
output is by going through a conversion function, which has to take care
at least of escaping. Even when the target language does not support
such typing, at least in the design input and output are clearly
separated. It helps preventing injection vulnerabilities, but I came to
this axiom mainly for clarity.

By separating more and more the code dealing with input text and code
dealing with output, I got them so loosely coupled that different output
format can easily be plugged behind the same input code, which I
consider a good thing.

2. Input and output are treated as streams, in that output is
append-only and input position only moves forward, at least one
character per cycle, though I often needed unbounded look-ahead. Again
this reduces the power of code dealing with those, helping clarity, and
making termination proof and memory bounding much easier.

3. The lexer, parser and sematic analyzers are not well separated or
well defined, mostly due to ignorance on my part, but that might be also
one the goals of the "lightweight" part. As I said, any text is
syntactically valid, and considering the input as a stream of code
points felt enough of a lexical analysis.


These axioms, while helpful for developing, are quite limiting on what
processing can easily be performed, and that usually shows in how
malformed input is considered. Most lightweight text markup languages
don't specify what output is expected on malformed input, so whatever is
the easiest is usually fine, and that's how I got my code for years.

Then there was that CommonMark project, aiming at standardizing
Markdown, including how malformed text is expected to be interpreted, to
reduce surprise of users used to ambiguous constructs.

As a simple example, let's consider Markdown inline code fragments, that
are delimited by backtick characters, like `this`, and a type of links,
where the text link is surrounded by square brackets, followed by its
target between parentheses, like [this](http://example.com/).
Now consider mixing them in the following sequence:
   [alpha`](http://example.com/)`](http://other.com)

It can either be interpreted as either a link to example.com whose text
is "alpha`" with an implicitly escaped opening code span marker,
followed by some normal text, or as a link to other.com whose text
contains the code fragment "](http://example.com/)".

In my unbounded-look-ahead online parser, I have to make the decision
about which closing bracket ends the link text when encountering the
opening bracket, based only on the raw input text. So if I don't want to
mix code-fragment logic in the link logic, I will have to choose the
first closing bracket, and interpret the example a part without code
fragment.

Unfortunately, CommonMark came up with the idea of "element precedence",
and code fragment somehow have precedence over link constructions, so
only the second interpretation of the example, with a code fragment and
a link to other.com, is valid.

So lacking idea on language processing, I turn to you for inspiration of
design as clean as my original one, but able to deal with such rules.


After much thought, I came to the realization that lightweight text
markup processors don't actually follow my axioms 1 and 2. Instead, they
take input text as a mutable blob, and repeatedly apply transformations
on it.

To be honest, I had to study AsciiDoc, where such a model is almost
explicit in the language description, to realize it.

In that sense, it only means that code fragments are identified before
links, so at some point there is the following string in memory:
   [alpha<code>](http://example.com/)</code>](http://other.com)
or something like that.

I don't like it, because I don't like blind text replacement, especially
when escaping is supposed to be a replacement like another. Moreover,
it couples much more tightly the input format and the output format:
at what point in the transformation list should the escaping occur, when
you don't want to depend on what symbols need escaping? Escaping too
early risks mangling the symbols of your language, while escaping too
late risks escaping constructs build in previous rounds.

I feels like a sloppy process, the kind I find acceptable in one-shot
grep/sed sequences, but not as durable engineering.

Is there a way to implement these languages with proper engineering?


My latest attempts involve keeping the online architecture with separate
input and output types and streams, and keeping a stack of currently
opened constructs, with a dynamically dispatching ending test on each
character for each construct on the stack. It feels horribly
inefficient and complicated.

Or maybe giving up the online capacity, making the assumption that any
reasonable input can fit in memory, and build a serious
semantically-loaded token sequence, and build a tree from it. However,
it also feels horribly complicated compared to my earlier design, only
to resolved malformed input according to some "precedence" rather than
in text order.


Thanks in advance for your patience and advice,
Natasha