From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-0.3 required=5.0 tests=BAYES_00, REPLYTO_WITHOUT_TO_CC autolearn=no autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!mx02.eternal-september.org!feeder.eternal-september.org!gegeweb.org!aioe.org!.POSTED!not-for-mail From: "Dmitry A. Kazakov" Newsgroups: comp.lang.ada Subject: Re: [Slightly OT] How to process lightweight text markup languages? Date: Thu, 22 Jan 2015 19:38:52 +0100 Organization: cbb software GmbH Message-ID: References: <1wclq766iu82d.b0k1hx30rgrt.dlg@40tude.net> Reply-To: mailbox@dmitry-kazakov.de NNTP-Posting-Host: 0MSBVPcE8EdvhPFyEbPM4g.user.speranza.aioe.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Complaints-To: abuse@aioe.org User-Agent: 40tude_Dialog/2.0.15.1 X-Notice: Filtered by postfilter v. 0.8.2 Xref: news.eternal-september.org comp.lang.ada:24699 Date: 2015-01-22T19:38:52+01:00 List-Id: On Thu, 22 Jan 2015 13:41:12 +0000 (UTC), Natasha Kerensikova wrote: > On 2015-01-20, Randy Brukardt wrote: >> "Dmitry A. Kazakov" wrote in message >> news:1wclq766iu82d.b0k1hx30rgrt.dlg@40tude.net... >>> On Tue, 20 Jan 2015 18:47:13 +0000 (UTC), Natasha Kerensikova wrote: >>>> So if you remember my example, >>>> >>>> [alpha`](http://example.com/)`](http://other.com) >>> >>> Since `` are quotation marks, the above should be: >>> >>> + >>> |_ [] >>> | |_ + >>> | |_ alpha >>> | |_ ](http://example.com/) >>> |_ () >>> |_ http://other.com >>> >>> + is an assumed infix catenation operation. No backtracking needed. > > Maybe I got the vocabulary wrong, but the tree above is exactly what I > expect as an output for the whole parser. Except I don't need it to be > stored as a tree, if it's only an iterator or an event driver with a > callback collection, that's fine for me too. Tree is merely for an unambiguous representation of the result of parsing. Whether the parser's callback indeed generates a tree node or doing something else is up to you. > However I'm still completely lost on how to reach that point. > > Would you have a link that explains that? Or keywords to feed a search > engine? http://www.dmitry-kazakov.de/ada/components.htm#Parsers_etc The method is described in "Compiler Construction for Digital Computers" by D. Gries, if I correctly remember. > Dmitry's answer position led me to believe that he was claiming an > online algorithm (i.e. operating on stream input and output) is possible, > provided there are internal stacks. So I followed that idea, because I > like online algorithms, for the same reason I like implementations > without explicit memory allocation: it's easier to understand and reason > about, but it's too limited for some situations. You can allocate memory for stacks statically. The point is that certain class of formal languages requires memory to be recognized. You can convert memory into parser states and states into memory. But you must have it. > So if there has to be two stages, that's fine, I'll go with two stages. Note that it does not mean that the second stage must start at the source end. You can resume that stage right after each lexeme recognized. Only early compilers used to produce a file after each stage. > The next problem is that I still have no clue about what these two > stages are. The markers that begin or end elements depends on the > context, and knowing both what are markers and in what context is having > the full parser. So I can't imagine what data is produced by the first > stage to be fed into the second. The division into syntax elements and lexemes is kind of arbitrary, the motivation behind having them is that lexemes can be recognized independently on the context. Syntax elements interact each other, e.g. in a+b*c operands are bound by precedence. The next stage, the semantic analysis is even more "ambiguous". > For example, a slight variation on my example above yields a completely > different result: > > [alpha`](http://example.com/)'](http://other.com) > > + >|_ + >| |_ [] >| | |_ alpha` >| |_ () >| |_ http://example.com/ >|_ '](http://other.com) > > This means that the parser can't know how to parse the stuff around > "example" unless it knows whether the backtick after "alpha" is matched > or not. Why should it care? The lexer simply goes to the first delimiter where it would stop anyway. E.g. in Ada, if you write X := "hello; It would be the line end. Similarly in C, unless it hits \ before the line end. At that point you decide how to react. You can stop at once, or assume that the literal ends here and continue. > So determining whether the "](" in there are a lexical element > or part a normal string means having parsed completely that backtick. It is a part of the literal's body, provided that `xxx` is a literal. If you do not satisfy with the choice, do it a bracket, e.g. it would make sense for Linux bash. > Unless, again, I'm missing something. But I'm starting to wonder whether > I'm too ignorant and/or you too knowledgeable to bridge the conceptual > gap between us. I think you are overdesigning it. The lexer need not to know anything. It simply matches a lexeme and moves to the next one. A lexeme in your case could be [ ] ( ) `anything-in-between` a-sequence-of-other-characters When the lexer sees ` it moves forward accumulating the literal (which may have some escaping) until the first non-escaped `. The accumulated body of the literal feeds the second stage. The lexer can accumulate it straight on the stack of arguments. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de