Re: Parser interface design

comp.lang.ada
 help / color / mirror / Atom feed

From: "Randy Brukardt" <randy@rrsoftware.com>
Subject: Re: Parser interface design
Date: Thu, 14 Apr 2011 19:22:43 -0500
Date: 2011-04-14T19:22:43-05:00	[thread overview]
Message-ID: <io834n$f6a$1@munin.nbi.dk> (raw)
In-Reply-To: slrniqd6if.2fnq.lithiumcat@sigil.instinctive.eu

"Natasha Kerensikova" <lithiumcat@gmail.com> wrote in message 
news:slrniqd6if.2fnq.lithiumcat@sigil.instinctive.eu...
> Hello,
>
> On 2011-04-13, Randy Brukardt <randy@rrsoftware.com> wrote:
>> "Natasha Kerensikova" <lithiumcat@gmail.com> wrote in message
>> news:slrniqan7b.2fnq.lithiumcat@sigil.instinctive.eu...
>> ...
>>> These dangerous features are what made me want to cripple the parser in
>>> the first place, and I thought it makes no sense to allow only a few
>>> features to be disabled when I can just as easily allow all of them to
>>> be independently turned on or off -- hence my example of disabling
>>> emphasis.
>>>
>>> Are my motivations clearer now, or is it still just a whim of the
>>> customer imposing a fragile design?
>>
>> Your intentions are fine, but I still don't think you should be trying to
>> modify the behavior of the parser; that's the job for the 
>> "interpretation"
>> layer. Maybe that's because of my compiler background, but what you are
>> trying to do is very similar to a compiler, or to the Ada Standard
>> formatter, or many other batch-oriented tools.
>
> Well, I intended to do both, modify the parser behavior and put some
> logic on the interpretation/output layer.
>
> Isn't it the parser role to tell whether the string "<script>" is normal
> text or an HTML tag? That's the kind of modification I was thinking
> about.

Not really, but I suspect that we are talking about different things. To me, 
"a parser" is a very specific piece of technology (yacc being one example, 
but of course they can be hand-coded as well). These days, people seem to be 
lumping a lot of non-parser stuff into the term "parser". To take a concrete 
example, an "XML parser" is some very complex piece of software. But hardly 
anything it does has anything to do with parsing! The syntax of XML is so 
simply that no parser is actually needed, just a smart scanner. (None of my 
HTML tools [nor the RM tool] have a formal parser, because the input 
language is so simple that a couple of helpers in the scanner is 
sufficient.)

Anyway, it doesn't make sense to "modify" a parser, because that implies 
that you are taking a different input grammer. And doing that means that you 
have a *different* parser. It might make sense in some circumstances to take 
multiple input languages, but I would consider that (and implement that) as 
a forest of different parsers (one per grammar) with a common output format. 
(That is, I would create an abstract Parser type, and then create a separate 
derived object to represent the specific parser. Again, look at the RM 
formatter code to see how I did it there.)

It might make sense to "modify" a scanner or some other phase, but there too 
the best organization probably is a forest of object (chose the right one 
for the job). If there was a minor difference, I'd probably control it with 
an "options" parameter when the object is set up.

> Isn't it the HTML renderer role to escape angular bracket when the
> script "<script>" is normal text? I believe it is, because the escaping
> is HTML-specific. It wouldn't need the same escaping if the output was
> PDF, for example.

Yes, see the rest of my message.

> Isn't it again the renderer role to make whatever sense it can out of a
> "<script>" tag depending on the output format? For HTML output it's a
> simple copy, but it seems non-trivial for a PDF output, and impossible
> for a plain-text output. But that's not something for the parser to
> worry about.

Yes, this is exactly what I was suggesting.

>> In your specific case, I believe that preventing "execution" of embedded
>> HTML and the like is the job of the output layer (renderer), because that
>> way it is impossible to forget a case and allow something through. In the 
>> RM
>> Formatter tool, that is accomplished by having all text that is intended 
>> to
>> be visible in the output format go through a particular output interface:
>> "Ordinary_Text". And that interface is responsible for quoting any
>> characters that might be interpreted as commands ("<", ">", "&" for HTML,
>> "\" for RTF, and so on.) You would have a separate interface for anything
>> that you wanted to output directly (so that it could be executed), such 
>> as
>> your script example.
>
> In my case, escaping special character like angular bracket so that they
> are considered normal text when it is normal text, is indeed something
> on the renderer level. But this is different from enabling or disabling
> language features.

Right. I normally do that in the middle layer. That is, the parser returns 
the structures that it finds, and then the middle layer decides what to do 
with (including ignoring them).

But I wouldn't even consider trying to allow "commands" or whatever you are 
trying to parse in the input. I've always required them to be escaped 
somehow. So perhaps we're solving different problems.

>> If the rule is that the renderer should always making everything it
>> outputs harmless unless it is explicitly instructed otherwise, you'll
>> have a lot less trouble.
>
> I never intended not to follow that rule. But a script tag *is*
> harmless, if the input can be trusted.

The number one rule of secure programming is that *no* input can be trusted. 
Yes, we all violate that from time-to-time, but it is a good rule to keep in 
mind.

> Now if it was a matter of forbidding specifically the script-tag, while
> allowing others deemed "harmless", then I agree it should be done on the
> renderer level. But changing the language grammar to wipe out the very
> concept of inline HTML tag is definitely something to be handled in the
> parser.

As I said, that's a *different* parser from one that supports HTML. I'd use 
a different object to represent each, rather than trying to share them.

>> To take an example, an Ada compiler doesn't "modify the behavior of the
>> parser" to deal with comments or strings in the source; these are treated 
>> as
>> single elements and aren't parsed at all. If one of these needs to be
>> output, it will just be output with the renderer making any 
>> transformations
>> needed to keep the output safe. Thus, there is no need to look inside of
>> these constructs to see what is in them.
>
> Does an Ada compiler modify the behavior of the parser when selecting
> Ada83 vs Ada95 vs Ada05? That's exactly what this is about here: it's
> different feature sets, except that for convenience and coherence the
> features are not enabled or disabled individually.

No, absolutely not. There is only one grammar for the compiler (an extended 
Ada 2005); anything not supported is flagged by the middle layer (the 
semantic pass).

There is a practical reason for this; error handling by parsers tends to be 
somewhere between sorta OK and terrible. We can provide much more targeted 
error messages (like "Silly programmer, you used not null, an Ada 2005 
feature, in your Ada 83 program" :-) by putting them into the middle pass.

That's probably one reason that I tend to avoid parsing at all when possible 
(just keeping the scanning part).

> The standard Markdown grammar might look like this:
>
> ...
> Span_Element ::= Normal_Text | Emphasis | Code_Span | ...
> Emphasis ::= "*" Span_Element "*" | "_" Span_Element "_"
> Code_Span ::= "`" Inner_Code_Span "`"
> Inner_Code_Span ::= Code_Text | Code_Span
> ...
>
> Now when I'm talking about "disabling emphasis", I mean parsing the
> following grammar instead:
>
> ...
> Span_Element ::= Normal_Text | Code_Span | ...
> Code_Span ::= "`" Inner_Code_Span "`"
> Inner_Code_Span ::= Code_Text | Code_Span
> ...
>
> This is of course very different from "rendering emphasis spans like
> normal text" or "apply no formatting to mark emphasis" or whatever. It's
> just ensuring that the feature cannot cause any harm by preventing its
> very existence. How can you make it any safer than that?

As I said, these are thus different parsers. The table-driven parsers that I 
typically use can't be modified, so the issue never comes up. OTOH, if the 
grammar is simple enough that a hand-written parser would do, I probably 
would write a parser at all and just use the scanner directly (that's what 
the RM Formatter does).

And I'd control the scanner/parser by a combination of separate objects and 
a parameter to the create object routine to set whatever settings.

But then again, I hate call-back subprograms, and would only use them when 
there is no other solution. An OO solution would work well here, so I don't 
see any reason to use unstructured call-backs. Thus, using them as some sort 
of parameter control isn't an idea that I would ever intend to use (and it 
seems unnecessarily tricky on top of that). At best, it's premature 
optimization (you're saving one byte somewhere, and perhaps one compare 
instruction, although on a lot of architechtures, it probably doesn't save 
any instructions).

                                                                      Randy.

next prev parent reply	other threads:[~2011-04-15  0:22 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-06 10:11 Parser interface design Natasha Kerensikova
2011-04-06 12:17 ` Georg Bauhaus
2011-04-07 18:56   ` Natasha Kerensikova
2011-04-08 11:49     ` Stephen Leake
2011-04-06 12:20 ` Dmitry A. Kazakov
2011-04-07 19:14   ` Natasha Kerensikova
2011-04-07 20:31     ` Dmitry A. Kazakov
2011-04-08 13:51       ` Natasha Kerensikova
2011-04-08 14:21         ` Dmitry A. Kazakov
2011-04-12 15:58           ` Natasha Kerensikova
2011-04-12 17:14             ` Dmitry A. Kazakov
2011-04-06 15:51 ` Georg Bauhaus
2011-04-07 19:44   ` Natasha Kerensikova
2011-04-07 20:52     ` Dmitry A. Kazakov
2011-04-07 22:09     ` Simon Wright
2011-04-08 14:03       ` Natasha Kerensikova
2011-04-08 19:06         ` Jeffrey Carter
2011-04-08 19:59         ` Simon Wright
2011-04-12 16:13           ` Natasha Kerensikova
2011-04-12 17:22             ` Dmitry A. Kazakov
2011-04-12 19:02               ` Simon Wright
2011-04-13  8:20                 ` Natasha Kerensikova
2011-04-13  8:37                   ` Dmitry A. Kazakov
2011-04-13 11:06                     ` Georg Bauhaus
2011-04-13 12:46                       ` Dmitry A. Kazakov
2011-04-13 22:33                   ` Randy Brukardt
2011-04-14  6:55                     ` Natasha Kerensikova
2011-04-15  0:22                       ` Randy Brukardt [this message]
2011-04-12 21:54               ` Randy Brukardt
2011-04-07 22:13     ` Georg Bauhaus
2011-04-08 15:30       ` Natasha Kerensikova
2011-04-07  0:36 ` Randy Brukardt
2011-04-08 11:16 ` Brian Drummond
2011-04-19  9:08 ` Natasha Kerensikova
2011-04-19 12:35   ` Ludovic Brenta
2011-04-20 10:44     ` Brian Drummond
2011-04-19 17:28   ` Jeffrey Carter

replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox