From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII-7-bit
X-Google-Thread: 103376,79e55eadd97001c2
X-Google-Attributes: gid103376,public
From: dewar@merv.cs.nyu.edu (Robert Dewar)
Subject: Re: Compiler error messages
Date: 1998/01/23
Message-ID: <dewar.885556568@merv>
X-Deja-AN: 318585480
References: <01bd278c$bea48680$9dfc82c1@xhv46.dial.pipex.com>
X-Complaints-To: usenet@news.nyu.edu
X-Trace: news.nyu.edu 885557940 3238 (None) 128.122.140.58
Organization: New York University
Newsgroups: comp.lang.ada
Date: 1998-01-23T00:00:00+00:00
List-Id: <comp.lang.ada>


Nick Roberts said

<<My advice to compiler writers would be: make SURE that the compiler reports
any error 100% accurately.  That means making NO assumptions about what
caused the error ("oh, it was _probably_ because the user forgot to type a
semicolon", etc...).  It means reporting everything that could possibly
have caused the error (directly!), even if this means a humungous error
message.  It means producing a technically precise message, even if you
feel some users would prefer something more 'down to earth' (because 'down
to earth' invariably means inaccurate/incomplete/vague/wrong).
>>

The trouble is that this reasonable prescription is meaningless.

A program is either right or wrong from a formal point of view. Especially
when it comes to syntax errors, the only possible syntax error that the
above principle could permit is

"The above program does not meet the syntax in the Ada RM"

without any indication of where or what is wrong. To give *any* more 
detailed indication of what is wrong requires that you make assumptions
of the kind that you say you don't like.

I don't know how much you know about compiler techniques, but a compiler
never really knows anything about what is wrong in the absence of
assumptions of some kind.

The question always boils down to how to make these assumptions.
It is of course huge and unuseful hyperbole to say that compilers
that attempt to give a clear message "invariably [result in]
inaccurate/incomplete/vague/wrong [messages]".

Your mention of "technically precise" message is not thought through
carefully. It makes me think that you are a user and not builder of
compilers, since if you built them, you would be more aware of this
obvious point.

For example, in the discussion at hand

  a := b & + c;

all the following messages are technically precise in the only possible
sense that this can be meaningful

  Missing operand between & and +
  + c must be parenthesized
  Redundant + ignored
  
These are relatively reasonable, the following are just as precise
from a formal point of view

  identifier You_Did_Not_Want_This_Here missing between & and +
  above statement should have been "accept abc"
  & + replaced by minus operator

etc. The only reason these "technically correct" messages are "wrong" is
because they are making less likely assumptions than the first set.

Let's take an example where GNAT does a lot of work in trying to cdome
up with a correct message (try this on various Ada compilers).

Write a big package body that looks like

    package body XYZ is
      procedure A;
      procedure B;
      procedure C;
      ...
      procedure Z;
      
      procedure A is ...
      procedure B is ...
      ...
      procedure Z is ...
    end;

that's fine, now change the semicolon after the procedure spec for M to
an is:

      procedure L;
      procedure M is
      procedure N;

that's an *easy* cut and paste error.

GNAT will tell you that the is should be a semicolon.

This is obvious to a human, but not at all obvious to a compiler.
Why not?

Well the text from procedure M is, up to and including the final end
statement, is a valid procedure body. 

OOOPS slight mistake for this to be 100% true, add just before the final
begin a null package body:

    begin
       null;
    end;


the favorite Ada compiler that I used for years before GNAT simply said

"unexpected end of file" pointing to the end of the program for this.

Easy to see why, it scanned out what it thought was the body of M 
successfully, and then planned on resuming the scan of the package
body and was surprised to find an end of file.

THis was a truly horrid error. After a while you got to know it meant
that somewhere you had is in place of semicolon, and sometimes I 
would have to do edits in a binary search to find the bad one.

Note that both the GNAT and other compiler errors are both technically
valid error messages, but one is MUCH more helpful than the other.

My experience in error messages is that it is not something that can
be addressed by simplistic principles of the type Nick is reaching
for. On the contrary getting to the point of generating useful
error messages is extremely difficult.

Most people are pleasantly surprised at how well GNAT does in pinning
down messages (one of the students in my compiler class last semester,
where eveyerone was using Ada, sent some email asking how GNAT manages
to give such accurate error messages.

Now when that student was asking that question, what did he mean by
accurate?

Technnically accurate?

NOt at all. He meant messages that corresponded to the error he had made.

Now only the programmmer knwos the true fix for an error message. 

An informative error message means guessing correctly at something that
is close enough to this "real" reason to click.

This is difficult. A huge amount of effort in the GNAT sources goes into
this. Let's take another example.


Suppose during parsing you encounter a junk end line, i.e. one that is
not what is expected.

There are three possibilities

  1. It is a piece of junk that should be ignored

  2. It is a corruption of the currently expected end line, and should
	be accepted as such

  3. There is a missing end line, and this one belongs to an outer scope

It is absolutely crucial to make the "right" decision here, since an
error will cause chaos in cascaded messages.

Of course you can't always make the right decision, but you can try.
GNAT uses all sorts of heuristics. It pays close attention to any
tokens used, to help match up end lines, and it even looks at the
indentation for a clue as to what was meant. If you are interested
in pursuing this, have a look at unit par-endh.adb in the GNAT sources.

Of course GNAT does not do a perfect job in generating error messages.
This is not possible, in the sense that it is not a well defined task.

But it does pretty well, and we work on improving it all the time.

It is much more instructive to look at specific examples than to
speak in generalities here.

I certainly agree with Nick that many compilers have incredibly appallingly
bad error message generation. In particular, I have never seen a C compiler
that I thought was even vaguely acceptable in this regard.

Ada compilers have generally been better, partly because Gerry Fisher's
interest in error detection meant that the original Ada Ed was pretty
good, and as a result the ACVC tests came to expect pretty decent
error recovery. Many of the Ada 83 compilers actually directly borrowed
some of the NYU work here.

We think GNAT takes the generation of good error messages to a stage
that is a definite notch better than what has been there previously,
but there is lots of room for improvement. 

We are always happy to get error message suggestions, and examples where
things did not work well. SOmetimes the answer is "sorry, we can't be
this telepathic", other times the answer is "this may surprise you,
but actually this case is easy to fix!"

Robert Dewar
Ada Core Technologies