From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 Path: border2.nntp.dca1.giganews.com!nntp.giganews.com!usenet.blueworldhosting.com!feeder01.blueworldhosting.com!feeder.erje.net!eu.feeder.erje.net!gandalf.srv.welterde.de!news.jacob-sparre.dk!loke.jacob-sparre.dk!pnx.dk!.POSTED!not-for-mail From: "Randy Brukardt" Newsgroups: comp.lang.ada Subject: Re: [Slightly OT] How to process lightweight text markup languages? Date: Thu, 22 Jan 2015 15:48:34 -0600 Organization: Jacob Sparre Andersen Research & Innovation Message-ID: References: <1wclq766iu82d.b0k1hx30rgrt.dlg@40tude.net> NNTP-Posting-Host: rrsoftware.com X-Trace: loke.gir.dk 1421963315 3359 24.196.82.226 (22 Jan 2015 21:48:35 GMT) X-Complaints-To: news@jacob-sparre.dk NNTP-Posting-Date: Thu, 22 Jan 2015 21:48:35 +0000 (UTC) X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 6.00.2900.5931 X-RFC2646: Format=Flowed; Original X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 Xref: number.nntp.giganews.com comp.lang.ada:192010 Date: 2015-01-22T15:48:34-06:00 List-Id: "Dmitry A. Kazakov" wrote in message news:lskzqkn5ssua$.zt9p9m6m2yro$.dlg@40tude.net... > On Thu, 22 Jan 2015 13:41:12 +0000 (UTC), Natasha Kerensikova wrote: ... >> Unless, again, I'm missing something. But I'm starting to wonder whether >> I'm too ignorant and/or you too knowledgeable to bridge the conceptual >> gap between us. > > I think you are overdesigning it. The lexer need not to know anything. It > simply matches a lexeme and moves to the next one. A lexeme in your case > could be ... I agree, the lexical level should be simple and as context-free as possible. And it may need some amount of lookahead. If you see a `, for instance, you may want to scan ahead until the natural end (whatever that is) to see if there is another ` (if not, then you just return the `). I wouldn't worry about pathological programs that have 10 million characters between the `s -- in most circumstances, that's just going to be a stand-alone ` followed by the next one in the markup. So even if there is no limit in the specification, I would add one just to prevent quoting the entire text when one quote mark is left out. (Ada uses line ends for this purpose, and it allows limiting the line length, so the maximum lookahead is the maximum line length. I suggest something similar in your case.) And these stages are all interleaved. For instance, the front-end of the Janus/Ada compiler is driven by its parser. The parse starts by calling Get_Token to get the first token (lexeme) in the program. The parser then does its thing until that token is consumed, at which point Get_Token is called again. In the other direction, the parser calls various routines to do things when grammar productions are recognized. And that's the whole design. Get_Token does the reading and buffering of the source file as needed, and determines the tokens just based on the source (it knows nothing about the state of the parser). (Note that you will not want to read one character at a time if you want anything resembling decent performance; you'll want to read chunks of text at a time, so that the lookahead requirement becomes a non-problem, just be sure to buffer more text than your lookahead requirement.) Note that this effectively reads a file as a stream (modulo some buffering of lookahead). There's no backtracking in the parser, all of the lookahead is purely lexical. Randy.