From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!mx02.eternal-september.org!.POSTED!not-for-mail From: Natasha Kerensikova Newsgroups: comp.lang.ada Subject: [Slightly OT] How to process lightweight text markup languages? Date: Sun, 18 Jan 2015 18:04:08 +0000 (UTC) Organization: A noiseless patient Spider Message-ID: Injection-Date: Sun, 18 Jan 2015 18:04:08 +0000 (UTC) Injection-Info: mx02.eternal-september.org; posting-host="eab84d932a0f4c9d4606240766f0f5e7"; logging-data="28311"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+NCBQDt48Wm7nqZt3MP9wj" User-Agent: slrn/1.0.2 (FreeBSD) Cancel-Lock: sha1:Mgu+ODgJ9icL41uKUjZ80yzfrQk= Xref: news.eternal-september.org comp.lang.ada:24600 Date: 2015-01-18T18:04:08+00:00 List-Id: Hello, I hope you'll forgive my straying slightly beyond the group topic, since I'm asking about proper design that could be used no matter the target language (though I will implement it in Ada), but I have the feeling some regulars here can provide me with useful insights on the matter. I have been dealing with so-called "lightweight text markup languages" for a while now, and despite having written numeros implementation in various language, I've recently come to the realization that I've only ever used one design, which feels inadequate for the latest such languages I'm trying to use. So I will briefly describe these languages look like, what my current design look like, why I feel it's not adequate for some of these languages, how I believe others are dealing with them and finally why I don't want to like them. The primary motivation behind "lightweight text markup languages" is to define a language over raw ASCII or Unicode text to describe richly formatted text (usually equivalent to a subset of HTML), while being easy to learn, read and write in its raw or source form. Most wikis use such languages, and other examples include restructuredText, mostly used in automatically-generated python documentation, and asciidoc, which aims to be equivalent to DocBook. For some reason I still have trouble understanding, the easiness is conflated with acceptable sloppiness. In effect, most of these language accept absolutely any input text, and if for example an opening symbol doesn't have a matching closing symbol, instead of having the text rejected with a syntax error, the opening symbol is merely considered as implicitly escaped and treated as raw text. That point isn't bad in itself, but it's the source of one main unresolved difficulties. When writing code to read such input formats and produce HTML or some other similarly-powerful format, I always instinctively fell on the same design, based on the following axioms (that I considered obvious until I realized how seldom they are employed): 1. Input text and output text are conceptually of different, incompatible types. So the only way some piece of input end up in the output is by going through a conversion function, which has to take care at least of escaping. Even when the target language does not support such typing, at least in the design input and output are clearly separated. It helps preventing injection vulnerabilities, but I came to this axiom mainly for clarity. By separating more and more the code dealing with input text and code dealing with output, I got them so loosely coupled that different output format can easily be plugged behind the same input code, which I consider a good thing. 2. Input and output are treated as streams, in that output is append-only and input position only moves forward, at least one character per cycle, though I often needed unbounded look-ahead. Again this reduces the power of code dealing with those, helping clarity, and making termination proof and memory bounding much easier. 3. The lexer, parser and sematic analyzers are not well separated or well defined, mostly due to ignorance on my part, but that might be also one the goals of the "lightweight" part. As I said, any text is syntactically valid, and considering the input as a stream of code points felt enough of a lexical analysis. These axioms, while helpful for developing, are quite limiting on what processing can easily be performed, and that usually shows in how malformed input is considered. Most lightweight text markup languages don't specify what output is expected on malformed input, so whatever is the easiest is usually fine, and that's how I got my code for years. Then there was that CommonMark project, aiming at standardizing Markdown, including how malformed text is expected to be interpreted, to reduce surprise of users used to ambiguous constructs. As a simple example, let's consider Markdown inline code fragments, that are delimited by backtick characters, like `this`, and a type of links, where the text link is surrounded by square brackets, followed by its target between parentheses, like [this](http://example.com/). Now consider mixing them in the following sequence: [alpha`](http://example.com/)`](http://other.com) It can either be interpreted as either a link to example.com whose text is "alpha`" with an implicitly escaped opening code span marker, followed by some normal text, or as a link to other.com whose text contains the code fragment "](http://example.com/)". In my unbounded-look-ahead online parser, I have to make the decision about which closing bracket ends the link text when encountering the opening bracket, based only on the raw input text. So if I don't want to mix code-fragment logic in the link logic, I will have to choose the first closing bracket, and interpret the example a part without code fragment. Unfortunately, CommonMark came up with the idea of "element precedence", and code fragment somehow have precedence over link constructions, so only the second interpretation of the example, with a code fragment and a link to other.com, is valid. So lacking idea on language processing, I turn to you for inspiration of design as clean as my original one, but able to deal with such rules. After much thought, I came to the realization that lightweight text markup processors don't actually follow my axioms 1 and 2. Instead, they take input text as a mutable blob, and repeatedly apply transformations on it. To be honest, I had to study AsciiDoc, where such a model is almost explicit in the language description, to realize it. In that sense, it only means that code fragments are identified before links, so at some point there is the following string in memory: [alpha](http://example.com/)](http://other.com) or something like that. I don't like it, because I don't like blind text replacement, especially when escaping is supposed to be a replacement like another. Moreover, it couples much more tightly the input format and the output format: at what point in the transformation list should the escaping occur, when you don't want to depend on what symbols need escaping? Escaping too early risks mangling the symbols of your language, while escaping too late risks escaping constructs build in previous rounds. I feels like a sloppy process, the kind I find acceptable in one-shot grep/sed sequences, but not as durable engineering. Is there a way to implement these languages with proper engineering? My latest attempts involve keeping the online architecture with separate input and output types and streams, and keeping a stack of currently opened constructs, with a dynamically dispatching ending test on each character for each construct on the stack. It feels horribly inefficient and complicated. Or maybe giving up the online capacity, making the assumption that any reasonable input can fit in memory, and build a serious semantically-loaded token sequence, and build a tree from it. However, it also feels horribly complicated compared to my earlier design, only to resolved malformed input according to some "precedence" rather than in text order. Thanks in advance for your patience and advice, Natasha