From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 Path: border2.nntp.dca1.giganews.com!nntp.giganews.com!news.glorb.com!peer02.iad.highwinds-media.com!news.highwinds-media.com!feed-me.highwinds-media.com!post02.iad.highwinds-media.com!news.flashnewsgroups.com-b7.4zTQh5tI3A!not-for-mail From: Stephen Leake Newsgroups: comp.lang.ada Subject: Re: [Slightly OT] How to process lightweight text markup languages? References: Date: Wed, 21 Jan 2015 08:54:03 -0600 Message-ID: <85egqob2wk.fsf@stephe-leake.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.0.50 (windows-nt) Cancel-Lock: sha1:CnbyE+eHGAmayzzYg5X+vtx2lyc= MIME-Version: 1.0 Content-Type: text/plain X-Complaints-To: abuse@flashnewsgroups.com Organization: FlashNewsgroups.com X-Trace: d88ed54bfbd8ce3fb833006118 X-Received-Bytes: 6112 X-Received-Body-CRC: 793656703 Xref: number.nntp.giganews.com comp.lang.ada:191974 Date: 2015-01-21T08:54:03-06:00 List-Id: Natasha Kerensikova writes: > I hope you'll forgive my straying slightly beyond the group topic, I appreciate the interesting subject (but then, I'm into parsers http://stephe-leake.org/ada/opentoken.html :). > As a simple example, let's consider Markdown inline code fragments, that > are delimited by backtick characters, like `this`, and a type of links, > where the text link is surrounded by square brackets, followed by its > target between parentheses, like [this](http://example.com/). > Now consider mixing them in the following sequence: > [alpha`](http://example.com/)`](http://other.com) > > It can either be interpreted as either a link to example.com whose text > is "alpha`" with an implicitly escaped opening code span marker, > followed by some normal text, or as a link to other.com whose text > contains the code fragment "](http://example.com/)". > > In my unbounded-look-ahead online parser, I have to make the decision > about which closing bracket ends the link text when encountering the > opening bracket, based only on the raw input text. So if I don't want to > mix code-fragment logic in the link logic, I will have to choose the > first closing bracket, and interpret the example a part without code > fragment. > > Unfortunately, CommonMark came up with the idea of "element precedence", > and code fragment somehow have precedence over link constructions, so > only the second interpretation of the example, with a code fragment and > a link to other.com, is valid. > > So lacking idea on language processing, I turn to you for inspiration of > design as clean as my original one, but able to deal with such rules. This is an example of coupling the lexer with the parser; normally the parser would enforce precedence rules, but in this case that affects how the code is lexed. > After much thought, I came to the realization that lightweight text > markup processors don't actually follow my axioms 1 and 2. Instead, they > take input text as a mutable blob, and repeatedly apply transformations > on it. > > In that sense, it only means that code fragments are identified before > links, so at some point there is the following string in memory: > [alpha](http://example.com/)](http://other.com) No, you don't need to transform the input text itself. You do need to mess with the token stream returned by the lexer. The two interpretations could be: LEFT_BRACKET STRING "alpha" ESCAPED_BACKTICK RIGHT_BRACKET LINK "http://example.com/" STRING "`](http://other.com)" LEFT_BRACKET STRING "alpha" CODE "](http://example.com/)" RIGHT_BRACKET STRING "(http://other.com)" One possible implementation: lexer returns the first interpretation parser realizes it is wrong, tells lexer to re-interpret ESCAPED_BACKTICK as CODE, and reparses. That requires push-back of tokens from the parser to the lexer; messy. I don't think OpenToken can do this currently. Another choice is for the lexer to fork, and return both interpretations in parallel; the parser would also fork. That's similar to the generalized LALR parser (http://en.wikipedia.org/wiki/GLR_parser), which the current version of the OpenToken parser supports (but not the lexer). When you get to the closing backtick, the first parser reports an error and closes, and the second parser keeps going. > I don't like it, because I don't like blind text replacement, especially > when escaping is supposed to be a replacement like another. Moreover, > it couples much more tightly the input format and the output format: > at what point in the transformation list should the escaping occur, when > you don't want to depend on what symbols need escaping? Escaping too > early risks mangling the symbols of your language, while escaping too > late risks escaping constructs build in previous rounds. Yes, backtracking is a pain. The parallel lexer/parser is simpler, at the potential cost of being slower and taking more memory. I'm using GLR in the Emacs Ada mode, and I got exponential explosions in the number of parsers a couple of times; I had to clean up the grammar to avoid that. > Is there a way to implement these languages with proper engineering? I would pursue the parallel lexer/parser approach. > My latest attempts involve keeping the online architecture with separate > input and output types and streams, and keeping a stack of currently > opened constructs, with a dynamically dispatching ending test on each > character for each construct on the stack. It feels horribly > inefficient and complicated. Well, the problem is complicated, so that's not necessarily bad. Parsing "natural" language is a complex problem. Structured languages, like Ada, reduce parser complexity by forcing the user to structure the input. CommonMark is attempting to remove that burden from the user, which means the parser has to get more complex. > Or maybe giving up the online capacity, making the assumption that any > reasonable input can fit in memory, and build a serious > semantically-loaded token sequence, and build a tree from it. Keeping the input in memory is probably practical; it is assumed in Emacs, for example. Unless you have an explicit requirement to deal with text > 100 MB, it's a reasonable approach. -- -- Stephe