From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!mx02.eternal-september.org!.POSTED!not-for-mail From: Natasha Kerensikova Newsgroups: comp.lang.ada Subject: Re: [Slightly OT] How to process lightweight text markup languages? Date: Thu, 22 Jan 2015 13:41:12 +0000 (UTC) Organization: A noiseless patient Spider Message-ID: References: <1wclq766iu82d.b0k1hx30rgrt.dlg@40tude.net> Injection-Date: Thu, 22 Jan 2015 13:41:12 +0000 (UTC) Injection-Info: mx02.eternal-september.org; posting-host="eab84d932a0f4c9d4606240766f0f5e7"; logging-data="30367"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+lR0HL26CCJ8pdwujgExvx" User-Agent: slrn/1.0.2 (FreeBSD) Cancel-Lock: sha1:vWvSqvHVNojSgk56gueq6Ljl8XA= Xref: news.eternal-september.org comp.lang.ada:24695 Date: 2015-01-22T13:41:12+00:00 List-Id: Hello, On 2015-01-20, Randy Brukardt wrote: > "Dmitry A. Kazakov" wrote in message > news:1wclq766iu82d.b0k1hx30rgrt.dlg@40tude.net... >> On Tue, 20 Jan 2015 18:47:13 +0000 (UTC), Natasha Kerensikova wrote: >>> So if you remember my example, >>> >>> [alpha`](http://example.com/)`](http://other.com) >> >> Since `` are quotation marks, the above should be: >> >> + >> |_ [] >> | |_ + >> | |_ alpha >> | |_ ](http://example.com/) >> |_ () >> |_ http://other.com >> >> + is an assumed infix catenation operation. No backtracking needed. Maybe I got the vocabulary wrong, but the tree above is exactly what I expect as an output for the whole parser. Except I don't need it to be stored as a tree, if it's only an iterator or an event driver with a callback collection, that's fine for me too. However I'm still completely lost on how to reach that point. Would you have a link that explains that? Or keywords to feed a search engine? >> [...] >>> Am I right so far? Am I missing something? >> >> Distinguishing lexical and syntactical elements? You don't bother with >> operators until expression terms (lexemes) matched. Once you matched them >> you never return back. They are all on the stack if not already bound by >> an >> operation. If `` is declared literal, it is a term of the expression, >> atomically matched. It naturally takes precedence over anything else. > > I agree with Dmitry; "standard" parsing has two stages, lexing (converting > into a token stream) and parsing. You're trying to make due with only one, > which complicates the problem a lot for no reason. Dmitry's answer position led me to believe that he was claiming an online algorithm (i.e. operating on stream input and output) is possible, provided there are internal stacks. So I followed that idea, because I like online algorithms, for the same reason I like implementations without explicit memory allocation: it's easier to understand and reason about, but it's too limited for some situations. So if there has to be two stages, that's fine, I'll go with two stages. The next problem is that I still have no clue about what these two stages are. The markers that begin or end elements depends on the context, and knowing both what are markers and in what context is having the full parser. So I can't imagine what data is produced by the first stage to be fed into the second. For example, a slight variation on my example above yields a completely different result: [alpha`](http://example.com/)'](http://other.com) + |_ + | |_ [] | | |_ alpha` | |_ () | |_ http://example.com/ |_ '](http://other.com) This means that the parser can't know how to parse the stuff around "example" unless it knows whether the backtick after "alpha" is matched or not. So determining whether the "](" in there are a lexical element or part a normal string means having parsed completely that backtick. Unless, again, I'm missing something. But I'm starting to wonder whether I'm too ignorant and/or you too knowledgeable to bridge the conceptual gap between us. > Also note that an LR parser acts similarly to your "parsing all > possibilities at once". Thanks a lot. The keyword here is duly noted for future research. > As Dmitry says, such parsers (like the one we use in Janus/Ada) > make it more challenging to deal with error correction The thing is that in these languages, there is really no error. Like the unmatched tick in my second example, or the last bracket in the first one. If it doesn't fit the rule, then it's literal. I guess that would look like a catch-all rule at the end of a list of alternatives. That's also the reason why I'm not completely sure these languages can be described by a (mathematical) grammar, but I'll admit my ignorance in the practical uses of such algebra constructs. Thanks for your tentative help, Natasha