From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=0.4 required=5.0 tests=BAYES_00,FORGED_MUA_MOZILLA
	autolearn=no autolearn_force=no version=3.4.4
X-Google-Thread: 103376,5412c98a3943e746
X-Google-NewGroupId: yes
X-Google-Attributes: gida07f3367d7,domainid0,public,usenet
X-Google-Language: ENGLISH,ASCII-7-bit
Received: by 10.68.220.229 with SMTP id pz5mr6341209pbc.5.1331411711942;
        Sat, 10 Mar 2012 12:35:11 -0800 (PST)
Path: 
 h9ni11297pbe.0!nntp.google.com!news2.google.com!news3.google.com!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: Niklas Holsti <niklas.holsti@tidorum.invalid>
Newsgroups: comp.lang.ada
Subject: Re: Verified compilers?
Date: Sat, 10 Mar 2012 22:35:09 +0200
Organization: Tidorum Ltd
Message-ID: <9s1s7tF6pcU1@mid.individual.net>
References: <op.v91cleyiule2fv@douda-yannick> <jj3p5v$2tp$1@munin.nbi.dk>
 <9207716.776.1331054644462.JavaMail.geo-discussion-forums@ynaz38>
 <nhzeozam35it$.13mzx8ap4cmdc$.dlg@40tude.net>
 <op.wascesqrule2fv@douda-yannick>
 <4edda5mav3cf$.149pbgyxl1wx5.dlg@40tude.net>
 <9rplcgF5a2U1@mid.individual.net>
 <1psd0g0womgxi.1sle7ol12x3d5.dlg@40tude.net>
 <9rsahhFmr3U1@mid.individual.net> <ezfn0og1qfov.14k34sxoky5ke.dlg@40tude.net>
 <9rvdjvFfa8U1@mid.individual.net> <4pp58h1br5sp$.yylr20e5vzxb.dlg@40tude.net>
Mime-Version: 1.0
X-Trace: individual.net 8FsgVHSMw6sPm7b9ZRYiuA9qSH1j6p/uEru4XGVF/lX8sEEuvU
Cancel-Lock: sha1:YbuT5t7Vd3nK4PoktT8rfKkeOw8=
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6;
 rv:10.0.2) Gecko/20120216 Thunderbird/10.0.2
In-Reply-To: <4pp58h1br5sp$.yylr20e5vzxb.dlg@40tude.net>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Date: 2012-03-10T22:35:09+02:00
List-Id: <comp.lang.ada>

On 12-03-10 12:36 , Dmitry A. Kazakov wrote:
> On Sat, 10 Mar 2012 00:13:19 +0200, Niklas Holsti wrote:
>
>> On 12-03-08 22:40 , Dmitry A. Kazakov wrote:
>>> On Thu, 08 Mar 2012 20:02:25 +0200, Niklas Holsti wrote:
>>>
>>>> On 12-03-07 22:17 , Dmitry A. Kazakov wrote:
>>>>> On Wed, 07 Mar 2012 19:49:04 +0200, Niklas Holsti wrote:
>>>>>
>>>>>> On 12-03-07 11:12 , Dmitry A. Kazakov wrote:
>>>>>>
>>>>>>> I was always wondering why people bother to study grammars, LR, LL parsers
>>>>>>> and other useless stuff.
>>>>>>
>>>>>> Grammars and the notion of parsing (derivations) rigorously define what
>>>>>> a given string of characters means, in some formally defined language.
>>>>>
>>>>> No. A grammar defines some formal language: a set of strings.
>>>>
>>>> Ok, I was being too brief. You are right that a grammar defines a
>>>> language. But notice that I was also speaking of parsing and
>>>> derivations, not just grammars.
>>>>
>>>> If you want to define the meanings of the sentences of a language in a
>>>> rigorous way, you have basically two options:
>>>>
>>>> 1. Define the meaning operationally, by giving a standard interpreter, a
>>>> black box, for example a Turing machine, that accepts the language, and
>>>> define that the actions of that machine, while reading a sentence, are
>>>> the meaning of that sentence. (From your point of view, the meaning
>>>> could be the "translation" that appears on the output tape.)
>>>>
>>>> 2. Define the meaning mathematically in some way. Note that by
>>>> "mathematically" I don't necessarily mean formally proven math, but
>>>> mathematically precise definitions with the usual amount of hand-waving
>>>> and shortcuts.
>>>
>>> 1 and 2 are same to me. In all cases it is an input for some machine. An
>>> inference system based on Peano axioms or whatever, is no less a machine
>>> than the Turing one. The only difference is in the computational power of
>>> such machines.
>>
>> I see a big difference in the structure of the machine.
>
> Which is more or less equivalent to its power.

Van Winjgaarden grammars are Turing complete (which also causes 
problems, of course).

One should use the simplest formalism that is powerful enough for one's 
needs. You probably don't consider grammars and parsing tools simple, 
but that is a matter of opinion and personal taste.

>>>> As far as I know, all methods -- formal or informal -- for defining the
>>>> meaning of sentences in formal languages are based on how the sentence
>>>> is derived, or equivalently how it is parsed.
>>>
>>> No. 1234 has the same meaning for Ada parser, human reader and an OCR
>>> system.
>>
>> 1234 has no meaning, except as a literal string, until you give it some
>> other meaning, for example as the integer one thousand two hundred and
>> thirty four.
>
> 1234 is an example of a language (of decimal literal).

Or an example of octal literals, or hexadecimal literals, or telephone 
numbers. The meaning is not intrinsic in the string; it must be defined 
by other means.

> The point is that
> the meaning of the sentence in that language does depend on how you would
> parse it.

Is there a "not" missing in that sentence? Otherwise it seems to agree 
with me that the meaning depends on the parsing, roughly speaking.

> An OCR system could simply compare pictures. What human brain
> does, nobody knows. An Ada program would yield a machine number. Where is a
> tree? Which tree?

It is not intrinsic in the string. The grammar, and the corresponding 
parse trees, are tools for defining the strings in the language, and the 
meaning of each string.

>
>     +
>    / \
>   4  +
>      / \
>     *
>    / \
>   3  10
>
> This one? There is a countable infinity of such trees. A German would say
> that this is twelve hundred, four and thirty (12,4,3). A way different
> structure then?

Yes, the grammar of (decimal) literals can be written in several ways, 
and all of these grammars can be attributed to define the same meaning 
(numerical value) of the string (of course they can also be attributed 
in different ways, to define different meanings). So what? I never 
claimed that the grammar is unique, just that grammars and parse trees 
are IMO the unique known method to define the meanings of complex sentences.

> The meaning has nothing to do with parsing.

Ok, if you can define the meaning in some other way, and not by parsing. 
How do you define it?

> In fact parsing has to be validated against the meaning.

Why? If you define the meaning without using parsing, why is that 
validation necessary?

>>>> How else could you explain why "2*3+4*5" means 26, and not 70, nor
>>>> something else?
>>>
>>> Explain? It is a subject of proof. There are more than one way to do it.
>>
>> Operator precedence, perhaps? See earlier post...
>
> I didn't mean variations of operator precedence generating different
> results. I meant that 2*3+4*5 = 26 is a statement to be proven. It is
> either true or false (or undecidable). 26 is the value of the formula.

The truth of the formula depends on the meaning assigned to the 
expressions in the formula. Before one can prove anything, one must 
define the proof rules, including the rules for modifying sentences 
(expressions) while preserving their meaning. The latter rules are 
derived from the rules that define the meaning of an expression based on 
the structure (parse tree) of the expression, and the properties of the 
operators in the expression.

>>> BTW the "meaning" you are advocating for looks rather something like
>>> 23*45*+.
>>
>> More like: the expression is a SUM of (a TERM that is the PRODUCT of
>> (the number 2) and (the number 3)) and (a TERM that is the PRODUCT of
>> (the number 4) and (the number 5)), together with a rule for evaluating
>> such expression trees.
>
> The above is alike, the difference is that it uses the Polish Reverse
> instead of your tree. Why is your "meaning" better?

My intent was to show a tree. I was too lazy to use ASCII art, so I used 
parentheses instead, to make the groupings unambiguous. Please imagine a 
drawing of the tree instead.

Note that the tree is not yet the meaning, but the meaning of the tree 
(and thus the meaning of the sentence) is defined by structural 
inductive rules over the tree.

>>>> In practice, the role of the parse/derivation tree appears in the fact
>>>> that parser/compiler generators let you attach "semantic actions" to
>>>> grammar productions, and often help you to consider larger parts of the
>>>> derivation by passing information around the parse/derivation tree in
>>>> well-defined ways. Attribute grammars are one example.
>>>
>>> Parsing tree is just a less ambiguous representation of the program.
>>
>> "Just"? It is the first step in the assignment of meaning to a sentence.
>
> It could be a step of evaluating the sentence, validating it etc. But the
> meaning of the sentence is already assigned by the sentence itself
> (provided the sentence is legal).

No, meanings are not inherent in sentences. There must be a definition 
of the meaning of an arbitrary sentence: a function "meaning" from the 
sent of sentences to the set of meanings. For use in computation, this 
function must be defined in a practically computable way. It is not 
enough that we have an intuitive or automatic understanding of the sentence.

>> The second step is a computation defined inductively on the parse tree.
>
> The second step is semantic analysis, if we consider compilers. AST gets a
> lot of kicking at this stage: transformed, manipulated, pruned etc.

Right. The AST represents the meaning of the program, and the compiler 
applies meaning-preserving transformations to the AST. This requires a 
definition of the meaning of an AST (for compilers: the computation 
represented by the AST). The only practical way to define the meaning of 
an AST is by inductive rules using the structure of the AST.

>>> You don't even need to generate it in order to evaluate 2*3+4*5
>>> using a finite state machine.
>>
>> You cannot do it with a finite state machine (assuming you want to allow
>> literals of any size and expressions of any length). You need a stack of
>> numbers, too.
>
> First, stack is not tree.

True, but it is the simplest structure that lets finite state machines 
handle this kind of computation. I was humoring you by avoiding trees.

> Second, regarding finiteness, that is of course
> true, just because an infinite language cannot be mapped (injectively) onto
> a finite number of states. But this has nothing to do with the "structure."
> Once you limit the language you can parse it by brute force, no trees, no
> stacks, just strcmp.

Eh? I don't understand what you are talking about. Finite languages? Not 
worth discussing, IMO.

>> The generation of parsing machines from grammars is one of the main
>> achievements of formal language and parsing theory. Why sweat to make
>> the machine manually, when you can do it automatically?
>
> Because:
>
> 1. You need not to generate anything. All infix expressions can be parsed
> by the same engine controlled by a few tables.

Infix expressions are a simple case.

> 2. The efforts are immense. I already gave an example of Ada grammar
> capturing operator precedence, association constraints (e.g. 2**+6), left
> to right, right to left etc. It is a huge work.

Hardly huge. There are only 6 levels: expression, relation, 
simple_expression, term, factor, primary.

> And you also need to verify
> if the mess of productions you have written indeed means that "unary + is
> illegal as power." Because productions is NOT the meaning, they are just a
> program.

And have you created tables for parsing Ada? How long did it take you, 
and are you sure they are correct?

> 3. You need tools, you need integration of the generated code into your
> project. You need to keep track of changes of the grammar program, of the
> generated program, of the rest.

True. But many SW projects already use one or more such tools: 
preprocessors, GUI code generators, DB interface generators, ...

>>> Care to carve a grammar where
>>>
>>>      declare
>>>         X : Integer := 1.0 + 2;
>>>
>>> would be a syntax error?
>>
>> Easy, the grammar that defines only the empty string :-)
>>
>> But I guess you are thinking about types, and how to make the mixing of
>> integers and floating point illegal. It is trivial to write a grammar
>> that makes it a syntax error to mix integer literals and floating point
>> literals in the same expression, but you have to duplicate the
>> definition of expressions and have separate metasymbols
>> Integer_Expression and FP_Expression, and all the way down to literals.
>
> ... and for all types, user-defined included. And also - "a return required
> in the function". And also - all alternatives of a case do not intersect
> and covering all choices ... Just for the start.

The practical way is to use attribute grammars:

>> If you are thinking about identifiers with types, use an attribute
>> grammar that checks types by means of a symbol table, and there you are.
>> Perhaps you don't want to call that a "syntax error", but at least it is
>> a "parse-time error".
>
> That is yet another nail in the coffin. Grammars are not only incapable to
> describe language semantics beyond classroom exercises.

I don't agree. But so be it.

> Even if they could,
> that would not yet make a compiler. A compiler should also explain why it
> rejects illegal program.

Here we are entering usability and friendliness, which are 
quality-of-implementation issues. There has certainly been work in the 
formal-languages and parsing theory communities on error reporting and 
automatic recovery from parsing errors. And I know some manually written 
parsers that are not very friendly with error reports...

> Formally speaking, the language being parsed consists of all possible
> strings. Nothing is a syntax error. If the input is illegal in the language
> 1 (e.g. Ada), then in the language 2 it is legal and yields "closing
> bracket is missing for the opening bracket found at xx:yy."

I don't understand your point. Are you being ironical?

>>>>> Furthermore the merits of formal grammar definitions are debatable, because
>>>>> the complexity of grammar descriptions usually make it impossible to judge
>>>>> if a given string do belong to the formal language or not.
    ...
> If nobody can understand grammar descriptions, they cannot serve as a
> meaning-carrier.

To be sure, grammars can be complex and hard to grasp. But your 
alternatives seem worse, IMO.

>> If you want to describe what language your hand-written parser accepts,
>> and what its processing means, I think you have to present the grammar
>> and describe the meaning of the sentences.
>
> Or, I can present the tables with the same effect.

I'll bet any users would prefer the grammar to the tables.

>>>>> On one hand these classes do not capture real differences
>>>>> between syntaxes of programming languages.
>>>>
>>>> What sort of "real differences" do you mean?
>>>
>>> Like between C++ and Ada.
>>
>> Sorry, not clear yet. I'm sure you don't mean just that one uses {} and
>> the other begin-end, but what do you mean?
>
> I mean that a human reader instantly recognizes that the language is "like
> C++" or "like Ada".

Sorry, not clear yet. Is the human reader looking at some C++ code and 
some Ada code, or at a C++ grammar and an Ada grammar? Or at a manually 
written C++ parser and a manually written Ada parser? Or at parsing 
tables for C++, and parsing tables for Ada?

> This means that somewhere in the brain there is a
> picture, a class of syntaxes, we, as humans, are capable to operate.

I suspect that if you show some non-programmers examples of C++ and Ada 
code, they will find both equally incomprehensible.

> On the
> contrary the classes of formal languages are difficult to us to recognize.

I agree, they are technical distinctions.

> If anybody were presented by a sample from this or that class, he would not
> be able to figure out what it were. It is like the FFT of an image. Can you
> tell if the picture shows a cat or a girl by looking at its spectrum? Ergo,
> spectral analysis is poor for scene recognition.

It is a poor way for humans to recognize scenes, sure. But it can be 
useful in machine vision and recognition applications.

> (The difference is though
> spectral analysis is useful for other things, while grammars are useful for
> nothing... (:-))

Going back to the differences between C++ and Ada, one systematic 
difference that comes to mind is the "overloading" of the same terminal 
symbols in C++, especially the closing braces, '}'. Writing too few or 
too many closing braces can lead to confusing error messages. In 
contrast, Ada uses different terminal symbol combinations (end if, end 
loop, end case, end <identifier>) which usually lets the compiler give 
more informative error messages.

I don't think this corresponds to a difference in the formal language 
classes of C++ and Ada, but I think the difference could be observed by 
som kind of "grammar metrics" (or "language metrics", if you prefer) 
that would give Ada a better score than C++, on this metric. But such 
grammar metrics have not been developed AFAIK.

-- 
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
       .      @       .