From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00
	autolearn=unavailable autolearn_force=no version=3.4.4
Path: 
 border2.nntp.dca1.giganews.com!nntp.giganews.com!usenet.blueworldhosting.com!feeder01.blueworldhosting.com!feeder.erje.net!eu.feeder.erje.net!gandalf.srv.welterde.de!news.jacob-sparre.dk!loke.jacob-sparre.dk!pnx.dk!.POSTED!not-for-mail
From: "Randy Brukardt" <randy@rrsoftware.com>
Newsgroups: comp.lang.ada
Subject: Re: [Slightly OT] How to process lightweight text markup languages?
Date: Thu, 22 Jan 2015 15:48:34 -0600
Organization: Jacob Sparre Andersen Research & Innovation
Message-ID: <m9rr7j$38v$1@loke.gir.dk>
References: <slrnmbntdm.19vl.lithiumcat@nat.rebma.instinctive.eu>
 <ynm6coktfevl.1esu61g1n9477.dlg@40tude.net>
 <slrnmbt8mf.19vl.lithiumcat@nat.rebma.instinctive.eu>
 <1wclq766iu82d.b0k1hx30rgrt.dlg@40tude.net> <m9mj5b$92m$1@loke.gir.dk>
 <slrnmc1vgn.19vl.lithiumcat@nat.rebma.instinctive.eu>
 <lskzqkn5ssua$.zt9p9m6m2yro$.dlg@40tude.net>
NNTP-Posting-Host: rrsoftware.com
X-Trace: loke.gir.dk 1421963315 3359 24.196.82.226 (22 Jan 2015 21:48:35 GMT)
X-Complaints-To: news@jacob-sparre.dk
NNTP-Posting-Date: Thu, 22 Jan 2015 21:48:35 +0000 (UTC)
X-Priority: 3
X-MSMail-Priority: Normal
X-Newsreader: Microsoft Outlook Express 6.00.2900.5931
X-RFC2646: Format=Flowed; Original
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157
Xref: number.nntp.giganews.com comp.lang.ada:192010
Date: 2015-01-22T15:48:34-06:00
List-Id: <comp.lang.ada>

"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
news:lskzqkn5ssua$.zt9p9m6m2yro$.dlg@40tude.net...
> On Thu, 22 Jan 2015 13:41:12 +0000 (UTC), Natasha Kerensikova wrote:
...
>> Unless, again, I'm missing something. But I'm starting to wonder whether
>> I'm too ignorant and/or you too knowledgeable to bridge the conceptual
>> gap between us.
>
> I think you are overdesigning it. The lexer need not to know anything. It
> simply matches a lexeme and moves to the next one. A lexeme in your case
> could be
...

I agree, the lexical level should be simple and as context-free as possible. 
And it may need some amount of lookahead. If you see a `, for instance, you 
may want to scan ahead until the natural end (whatever that is) to see if 
there is another ` (if not, then you just return the `). I wouldn't worry 
about pathological programs that have 10 million characters between the 
`s -- in most circumstances, that's just going to be a stand-alone ` 
followed by the next one in the markup. So even if there is no limit in the 
specification, I would add one just to prevent quoting the entire text when 
one quote mark is left out. (Ada uses line ends for this purpose, and it 
allows limiting the line length, so the maximum lookahead is the maximum 
line length. I suggest something similar in your case.)

And these stages are all interleaved. For instance, the front-end of the 
Janus/Ada compiler is driven by its parser. The parse starts by calling 
Get_Token to get the first token (lexeme) in the program. The parser then 
does its thing until that token is consumed, at which point Get_Token is 
called again. In the other direction, the parser calls various routines to 
do things when grammar productions are recognized. And that's the whole 
design.

Get_Token does the reading and buffering of the source file as needed, and 
determines the tokens just based on the source (it knows nothing about the 
state of the parser). (Note that you will not want to read one character at 
a time if you want anything resembling decent performance; you'll want to 
read chunks of text at a time, so that the lookahead requirement becomes a 
non-problem, just be sure to buffer more text than your lookahead 
requirement.)

Note that this effectively reads a file as a stream (modulo some buffering 
of lookahead). There's no backtracking in the parser, all of the lookahead 
is purely lexical.

                           Randy.