From: "Randy Brukardt" <randy@rrsoftware.com>
Subject: Re: Parser interface design
Date: Thu, 14 Apr 2011 19:22:43 -0500
Date: 2011-04-14T19:22:43-05:00 [thread overview]
Message-ID: <io834n$f6a$1@munin.nbi.dk> (raw)
In-Reply-To: slrniqd6if.2fnq.lithiumcat@sigil.instinctive.eu
"Natasha Kerensikova" <lithiumcat@gmail.com> wrote in message
news:slrniqd6if.2fnq.lithiumcat@sigil.instinctive.eu...
> Hello,
>
> On 2011-04-13, Randy Brukardt <randy@rrsoftware.com> wrote:
>> "Natasha Kerensikova" <lithiumcat@gmail.com> wrote in message
>> news:slrniqan7b.2fnq.lithiumcat@sigil.instinctive.eu...
>> ...
>>> These dangerous features are what made me want to cripple the parser in
>>> the first place, and I thought it makes no sense to allow only a few
>>> features to be disabled when I can just as easily allow all of them to
>>> be independently turned on or off -- hence my example of disabling
>>> emphasis.
>>>
>>> Are my motivations clearer now, or is it still just a whim of the
>>> customer imposing a fragile design?
>>
>> Your intentions are fine, but I still don't think you should be trying to
>> modify the behavior of the parser; that's the job for the
>> "interpretation"
>> layer. Maybe that's because of my compiler background, but what you are
>> trying to do is very similar to a compiler, or to the Ada Standard
>> formatter, or many other batch-oriented tools.
>
> Well, I intended to do both, modify the parser behavior and put some
> logic on the interpretation/output layer.
>
> Isn't it the parser role to tell whether the string "<script>" is normal
> text or an HTML tag? That's the kind of modification I was thinking
> about.
Not really, but I suspect that we are talking about different things. To me,
"a parser" is a very specific piece of technology (yacc being one example,
but of course they can be hand-coded as well). These days, people seem to be
lumping a lot of non-parser stuff into the term "parser". To take a concrete
example, an "XML parser" is some very complex piece of software. But hardly
anything it does has anything to do with parsing! The syntax of XML is so
simply that no parser is actually needed, just a smart scanner. (None of my
HTML tools [nor the RM tool] have a formal parser, because the input
language is so simple that a couple of helpers in the scanner is
sufficient.)
Anyway, it doesn't make sense to "modify" a parser, because that implies
that you are taking a different input grammer. And doing that means that you
have a *different* parser. It might make sense in some circumstances to take
multiple input languages, but I would consider that (and implement that) as
a forest of different parsers (one per grammar) with a common output format.
(That is, I would create an abstract Parser type, and then create a separate
derived object to represent the specific parser. Again, look at the RM
formatter code to see how I did it there.)
It might make sense to "modify" a scanner or some other phase, but there too
the best organization probably is a forest of object (chose the right one
for the job). If there was a minor difference, I'd probably control it with
an "options" parameter when the object is set up.
> Isn't it the HTML renderer role to escape angular bracket when the
> script "<script>" is normal text? I believe it is, because the escaping
> is HTML-specific. It wouldn't need the same escaping if the output was
> PDF, for example.
Yes, see the rest of my message.
> Isn't it again the renderer role to make whatever sense it can out of a
> "<script>" tag depending on the output format? For HTML output it's a
> simple copy, but it seems non-trivial for a PDF output, and impossible
> for a plain-text output. But that's not something for the parser to
> worry about.
Yes, this is exactly what I was suggesting.
>> In your specific case, I believe that preventing "execution" of embedded
>> HTML and the like is the job of the output layer (renderer), because that
>> way it is impossible to forget a case and allow something through. In the
>> RM
>> Formatter tool, that is accomplished by having all text that is intended
>> to
>> be visible in the output format go through a particular output interface:
>> "Ordinary_Text". And that interface is responsible for quoting any
>> characters that might be interpreted as commands ("<", ">", "&" for HTML,
>> "\" for RTF, and so on.) You would have a separate interface for anything
>> that you wanted to output directly (so that it could be executed), such
>> as
>> your script example.
>
> In my case, escaping special character like angular bracket so that they
> are considered normal text when it is normal text, is indeed something
> on the renderer level. But this is different from enabling or disabling
> language features.
Right. I normally do that in the middle layer. That is, the parser returns
the structures that it finds, and then the middle layer decides what to do
with (including ignoring them).
But I wouldn't even consider trying to allow "commands" or whatever you are
trying to parse in the input. I've always required them to be escaped
somehow. So perhaps we're solving different problems.
>> If the rule is that the renderer should always making everything it
>> outputs harmless unless it is explicitly instructed otherwise, you'll
>> have a lot less trouble.
>
> I never intended not to follow that rule. But a script tag *is*
> harmless, if the input can be trusted.
The number one rule of secure programming is that *no* input can be trusted.
Yes, we all violate that from time-to-time, but it is a good rule to keep in
mind.
> Now if it was a matter of forbidding specifically the script-tag, while
> allowing others deemed "harmless", then I agree it should be done on the
> renderer level. But changing the language grammar to wipe out the very
> concept of inline HTML tag is definitely something to be handled in the
> parser.
As I said, that's a *different* parser from one that supports HTML. I'd use
a different object to represent each, rather than trying to share them.
>> To take an example, an Ada compiler doesn't "modify the behavior of the
>> parser" to deal with comments or strings in the source; these are treated
>> as
>> single elements and aren't parsed at all. If one of these needs to be
>> output, it will just be output with the renderer making any
>> transformations
>> needed to keep the output safe. Thus, there is no need to look inside of
>> these constructs to see what is in them.
>
> Does an Ada compiler modify the behavior of the parser when selecting
> Ada83 vs Ada95 vs Ada05? That's exactly what this is about here: it's
> different feature sets, except that for convenience and coherence the
> features are not enabled or disabled individually.
No, absolutely not. There is only one grammar for the compiler (an extended
Ada 2005); anything not supported is flagged by the middle layer (the
semantic pass).
There is a practical reason for this; error handling by parsers tends to be
somewhere between sorta OK and terrible. We can provide much more targeted
error messages (like "Silly programmer, you used not null, an Ada 2005
feature, in your Ada 83 program" :-) by putting them into the middle pass.
That's probably one reason that I tend to avoid parsing at all when possible
(just keeping the scanning part).
> The standard Markdown grammar might look like this:
>
> ...
> Span_Element ::= Normal_Text | Emphasis | Code_Span | ...
> Emphasis ::= "*" Span_Element "*" | "_" Span_Element "_"
> Code_Span ::= "`" Inner_Code_Span "`"
> Inner_Code_Span ::= Code_Text | Code_Span
> ...
>
> Now when I'm talking about "disabling emphasis", I mean parsing the
> following grammar instead:
>
> ...
> Span_Element ::= Normal_Text | Code_Span | ...
> Code_Span ::= "`" Inner_Code_Span "`"
> Inner_Code_Span ::= Code_Text | Code_Span
> ...
>
> This is of course very different from "rendering emphasis spans like
> normal text" or "apply no formatting to mark emphasis" or whatever. It's
> just ensuring that the feature cannot cause any harm by preventing its
> very existence. How can you make it any safer than that?
As I said, these are thus different parsers. The table-driven parsers that I
typically use can't be modified, so the issue never comes up. OTOH, if the
grammar is simple enough that a hand-written parser would do, I probably
would write a parser at all and just use the scanner directly (that's what
the RM Formatter does).
And I'd control the scanner/parser by a combination of separate objects and
a parameter to the create object routine to set whatever settings.
But then again, I hate call-back subprograms, and would only use them when
there is no other solution. An OO solution would work well here, so I don't
see any reason to use unstructured call-backs. Thus, using them as some sort
of parameter control isn't an idea that I would ever intend to use (and it
seems unnecessarily tricky on top of that). At best, it's premature
optimization (you're saving one byte somewhere, and perhaps one compare
instruction, although on a lot of architechtures, it probably doesn't save
any instructions).
Randy.
next prev parent reply other threads:[~2011-04-15 0:22 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-04-06 10:11 Parser interface design Natasha Kerensikova
2011-04-06 12:17 ` Georg Bauhaus
2011-04-07 18:56 ` Natasha Kerensikova
2011-04-08 11:49 ` Stephen Leake
2011-04-06 12:20 ` Dmitry A. Kazakov
2011-04-07 19:14 ` Natasha Kerensikova
2011-04-07 20:31 ` Dmitry A. Kazakov
2011-04-08 13:51 ` Natasha Kerensikova
2011-04-08 14:21 ` Dmitry A. Kazakov
2011-04-12 15:58 ` Natasha Kerensikova
2011-04-12 17:14 ` Dmitry A. Kazakov
2011-04-06 15:51 ` Georg Bauhaus
2011-04-07 19:44 ` Natasha Kerensikova
2011-04-07 20:52 ` Dmitry A. Kazakov
2011-04-07 22:09 ` Simon Wright
2011-04-08 14:03 ` Natasha Kerensikova
2011-04-08 19:06 ` Jeffrey Carter
2011-04-08 19:59 ` Simon Wright
2011-04-12 16:13 ` Natasha Kerensikova
2011-04-12 17:22 ` Dmitry A. Kazakov
2011-04-12 19:02 ` Simon Wright
2011-04-13 8:20 ` Natasha Kerensikova
2011-04-13 8:37 ` Dmitry A. Kazakov
2011-04-13 11:06 ` Georg Bauhaus
2011-04-13 12:46 ` Dmitry A. Kazakov
2011-04-13 22:33 ` Randy Brukardt
2011-04-14 6:55 ` Natasha Kerensikova
2011-04-15 0:22 ` Randy Brukardt [this message]
2011-04-12 21:54 ` Randy Brukardt
2011-04-07 22:13 ` Georg Bauhaus
2011-04-08 15:30 ` Natasha Kerensikova
2011-04-07 0:36 ` Randy Brukardt
2011-04-08 11:16 ` Brian Drummond
2011-04-19 9:08 ` Natasha Kerensikova
2011-04-19 12:35 ` Ludovic Brenta
2011-04-20 10:44 ` Brian Drummond
2011-04-19 17:28 ` Jeffrey Carter
replies disabled
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox