From: "Randy Brukardt" <randy@rrsoftware.com>
Subject: Re: [Slightly OT] How to process lightweight text markup languages?
Date: Thu, 22 Jan 2015 15:48:34 -0600
Date: 2015-01-22T15:48:34-06:00 [thread overview]
Message-ID: <m9rr7j$38v$1@loke.gir.dk> (raw)
In-Reply-To: lskzqkn5ssua$.zt9p9m6m2yro$.dlg@40tude.net
"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message
news:lskzqkn5ssua$.zt9p9m6m2yro$.dlg@40tude.net...
> On Thu, 22 Jan 2015 13:41:12 +0000 (UTC), Natasha Kerensikova wrote:
...
>> Unless, again, I'm missing something. But I'm starting to wonder whether
>> I'm too ignorant and/or you too knowledgeable to bridge the conceptual
>> gap between us.
>
> I think you are overdesigning it. The lexer need not to know anything. It
> simply matches a lexeme and moves to the next one. A lexeme in your case
> could be
...
I agree, the lexical level should be simple and as context-free as possible.
And it may need some amount of lookahead. If you see a `, for instance, you
may want to scan ahead until the natural end (whatever that is) to see if
there is another ` (if not, then you just return the `). I wouldn't worry
about pathological programs that have 10 million characters between the
`s -- in most circumstances, that's just going to be a stand-alone `
followed by the next one in the markup. So even if there is no limit in the
specification, I would add one just to prevent quoting the entire text when
one quote mark is left out. (Ada uses line ends for this purpose, and it
allows limiting the line length, so the maximum lookahead is the maximum
line length. I suggest something similar in your case.)
And these stages are all interleaved. For instance, the front-end of the
Janus/Ada compiler is driven by its parser. The parse starts by calling
Get_Token to get the first token (lexeme) in the program. The parser then
does its thing until that token is consumed, at which point Get_Token is
called again. In the other direction, the parser calls various routines to
do things when grammar productions are recognized. And that's the whole
design.
Get_Token does the reading and buffering of the source file as needed, and
determines the tokens just based on the source (it knows nothing about the
state of the parser). (Note that you will not want to read one character at
a time if you want anything resembling decent performance; you'll want to
read chunks of text at a time, so that the lookahead requirement becomes a
non-problem, just be sure to buffer more text than your lookahead
requirement.)
Note that this effectively reads a file as a stream (modulo some buffering
of lookahead). There's no backtracking in the parser, all of the lookahead
is purely lexical.
Randy.
next prev parent reply other threads:[~2015-01-22 21:48 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-01-18 18:04 [Slightly OT] How to process lightweight text markup languages? Natasha Kerensikova
2015-01-18 20:21 ` Dmitry A. Kazakov
2015-01-19 11:09 ` G.B.
2015-01-19 13:21 ` Dmitry A. Kazakov
2015-01-19 16:58 ` G.B.
2015-01-19 17:58 ` Dmitry A. Kazakov
2015-01-20 14:41 ` Robert A Duff
2015-01-19 20:12 ` Randy Brukardt
2015-01-19 21:37 ` gautier_niouzes
2015-01-20 8:44 ` Dmitry A. Kazakov
2015-01-20 12:36 ` G.B.
2015-01-20 13:14 ` Dmitry A. Kazakov
2015-01-20 20:36 ` Shark8
2015-01-20 21:16 ` Dmitry A. Kazakov
2015-01-20 22:55 ` J-P. Rosen
2015-01-21 8:35 ` Dmitry A. Kazakov
2015-01-20 19:19 ` Natasha Kerensikova
2015-01-20 21:43 ` Randy Brukardt
2015-01-20 19:16 ` Natasha Kerensikova
2015-01-20 18:47 ` Natasha Kerensikova
2015-01-20 19:44 ` Dmitry A. Kazakov
2015-01-20 22:00 ` Randy Brukardt
2015-01-22 13:41 ` Natasha Kerensikova
2015-01-22 18:38 ` Dmitry A. Kazakov
2015-01-22 21:48 ` Randy Brukardt [this message]
2015-01-23 10:24 ` Stephen Leake
2015-01-21 14:54 ` Stephen Leake
replies disabled
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox