comp.lang.ada
 help / color / mirror / Atom feed
From: "Randy Brukardt" <randy@rrsoftware.com>
Subject: Re: [Slightly OT] How to process lightweight text markup languages?
Date: Thu, 22 Jan 2015 15:48:34 -0600
Date: 2015-01-22T15:48:34-06:00	[thread overview]
Message-ID: <m9rr7j$38v$1@loke.gir.dk> (raw)
In-Reply-To: lskzqkn5ssua$.zt9p9m6m2yro$.dlg@40tude.net

"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
news:lskzqkn5ssua$.zt9p9m6m2yro$.dlg@40tude.net...
> On Thu, 22 Jan 2015 13:41:12 +0000 (UTC), Natasha Kerensikova wrote:
...
>> Unless, again, I'm missing something. But I'm starting to wonder whether
>> I'm too ignorant and/or you too knowledgeable to bridge the conceptual
>> gap between us.
>
> I think you are overdesigning it. The lexer need not to know anything. It
> simply matches a lexeme and moves to the next one. A lexeme in your case
> could be
...

I agree, the lexical level should be simple and as context-free as possible. 
And it may need some amount of lookahead. If you see a `, for instance, you 
may want to scan ahead until the natural end (whatever that is) to see if 
there is another ` (if not, then you just return the `). I wouldn't worry 
about pathological programs that have 10 million characters between the 
`s -- in most circumstances, that's just going to be a stand-alone ` 
followed by the next one in the markup. So even if there is no limit in the 
specification, I would add one just to prevent quoting the entire text when 
one quote mark is left out. (Ada uses line ends for this purpose, and it 
allows limiting the line length, so the maximum lookahead is the maximum 
line length. I suggest something similar in your case.)

And these stages are all interleaved. For instance, the front-end of the 
Janus/Ada compiler is driven by its parser. The parse starts by calling 
Get_Token to get the first token (lexeme) in the program. The parser then 
does its thing until that token is consumed, at which point Get_Token is 
called again. In the other direction, the parser calls various routines to 
do things when grammar productions are recognized. And that's the whole 
design.

Get_Token does the reading and buffering of the source file as needed, and 
determines the tokens just based on the source (it knows nothing about the 
state of the parser). (Note that you will not want to read one character at 
a time if you want anything resembling decent performance; you'll want to 
read chunks of text at a time, so that the lookahead requirement becomes a 
non-problem, just be sure to buffer more text than your lookahead 
requirement.)

Note that this effectively reads a file as a stream (modulo some buffering 
of lookahead). There's no backtracking in the parser, all of the lookahead 
is purely lexical.

                           Randy.





  reply	other threads:[~2015-01-22 21:48 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-01-18 18:04 [Slightly OT] How to process lightweight text markup languages? Natasha Kerensikova
2015-01-18 20:21 ` Dmitry A. Kazakov
2015-01-19 11:09   ` G.B.
2015-01-19 13:21     ` Dmitry A. Kazakov
2015-01-19 16:58       ` G.B.
2015-01-19 17:58         ` Dmitry A. Kazakov
2015-01-20 14:41           ` Robert A Duff
2015-01-19 20:12         ` Randy Brukardt
2015-01-19 21:37           ` gautier_niouzes
2015-01-20  8:44             ` Dmitry A. Kazakov
2015-01-20 12:36               ` G.B.
2015-01-20 13:14                 ` Dmitry A. Kazakov
2015-01-20 20:36               ` Shark8
2015-01-20 21:16                 ` Dmitry A. Kazakov
2015-01-20 22:55                   ` J-P. Rosen
2015-01-21  8:35                     ` Dmitry A. Kazakov
2015-01-20 19:19             ` Natasha Kerensikova
2015-01-20 21:43             ` Randy Brukardt
2015-01-20 19:16           ` Natasha Kerensikova
2015-01-20 18:47   ` Natasha Kerensikova
2015-01-20 19:44     ` Dmitry A. Kazakov
2015-01-20 22:00       ` Randy Brukardt
2015-01-22 13:41         ` Natasha Kerensikova
2015-01-22 18:38           ` Dmitry A. Kazakov
2015-01-22 21:48             ` Randy Brukardt [this message]
2015-01-23 10:24     ` Stephen Leake
2015-01-21 14:54 ` Stephen Leake
replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox