From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=0.4 required=5.0 tests=BAYES_00,FORGED_MUA_MOZILLA autolearn=no autolearn_force=no version=3.4.4 X-Google-Thread: 103376,5412c98a3943e746 X-Google-NewGroupId: yes X-Google-Attributes: gida07f3367d7,domainid0,public,usenet X-Google-Language: ENGLISH,ASCII-7-bit Received: by 10.68.220.229 with SMTP id pz5mr6341209pbc.5.1331411711942; Sat, 10 Mar 2012 12:35:11 -0800 (PST) Path: h9ni11297pbe.0!nntp.google.com!news2.google.com!news3.google.com!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail From: Niklas Holsti Newsgroups: comp.lang.ada Subject: Re: Verified compilers? Date: Sat, 10 Mar 2012 22:35:09 +0200 Organization: Tidorum Ltd Message-ID: <9s1s7tF6pcU1@mid.individual.net> References: <9207716.776.1331054644462.JavaMail.geo-discussion-forums@ynaz38> <4edda5mav3cf$.149pbgyxl1wx5.dlg@40tude.net> <9rplcgF5a2U1@mid.individual.net> <1psd0g0womgxi.1sle7ol12x3d5.dlg@40tude.net> <9rsahhFmr3U1@mid.individual.net> <9rvdjvFfa8U1@mid.individual.net> <4pp58h1br5sp$.yylr20e5vzxb.dlg@40tude.net> Mime-Version: 1.0 X-Trace: individual.net 8FsgVHSMw6sPm7b9ZRYiuA9qSH1j6p/uEru4XGVF/lX8sEEuvU Cancel-Lock: sha1:YbuT5t7Vd3nK4PoktT8rfKkeOw8= User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:10.0.2) Gecko/20120216 Thunderbird/10.0.2 In-Reply-To: <4pp58h1br5sp$.yylr20e5vzxb.dlg@40tude.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Date: 2012-03-10T22:35:09+02:00 List-Id: On 12-03-10 12:36 , Dmitry A. Kazakov wrote: > On Sat, 10 Mar 2012 00:13:19 +0200, Niklas Holsti wrote: > >> On 12-03-08 22:40 , Dmitry A. Kazakov wrote: >>> On Thu, 08 Mar 2012 20:02:25 +0200, Niklas Holsti wrote: >>> >>>> On 12-03-07 22:17 , Dmitry A. Kazakov wrote: >>>>> On Wed, 07 Mar 2012 19:49:04 +0200, Niklas Holsti wrote: >>>>> >>>>>> On 12-03-07 11:12 , Dmitry A. Kazakov wrote: >>>>>> >>>>>>> I was always wondering why people bother to study grammars, LR, LL parsers >>>>>>> and other useless stuff. >>>>>> >>>>>> Grammars and the notion of parsing (derivations) rigorously define what >>>>>> a given string of characters means, in some formally defined language. >>>>> >>>>> No. A grammar defines some formal language: a set of strings. >>>> >>>> Ok, I was being too brief. You are right that a grammar defines a >>>> language. But notice that I was also speaking of parsing and >>>> derivations, not just grammars. >>>> >>>> If you want to define the meanings of the sentences of a language in a >>>> rigorous way, you have basically two options: >>>> >>>> 1. Define the meaning operationally, by giving a standard interpreter, a >>>> black box, for example a Turing machine, that accepts the language, and >>>> define that the actions of that machine, while reading a sentence, are >>>> the meaning of that sentence. (From your point of view, the meaning >>>> could be the "translation" that appears on the output tape.) >>>> >>>> 2. Define the meaning mathematically in some way. Note that by >>>> "mathematically" I don't necessarily mean formally proven math, but >>>> mathematically precise definitions with the usual amount of hand-waving >>>> and shortcuts. >>> >>> 1 and 2 are same to me. In all cases it is an input for some machine. An >>> inference system based on Peano axioms or whatever, is no less a machine >>> than the Turing one. The only difference is in the computational power of >>> such machines. >> >> I see a big difference in the structure of the machine. > > Which is more or less equivalent to its power. Van Winjgaarden grammars are Turing complete (which also causes problems, of course). One should use the simplest formalism that is powerful enough for one's needs. You probably don't consider grammars and parsing tools simple, but that is a matter of opinion and personal taste. >>>> As far as I know, all methods -- formal or informal -- for defining the >>>> meaning of sentences in formal languages are based on how the sentence >>>> is derived, or equivalently how it is parsed. >>> >>> No. 1234 has the same meaning for Ada parser, human reader and an OCR >>> system. >> >> 1234 has no meaning, except as a literal string, until you give it some >> other meaning, for example as the integer one thousand two hundred and >> thirty four. > > 1234 is an example of a language (of decimal literal). Or an example of octal literals, or hexadecimal literals, or telephone numbers. The meaning is not intrinsic in the string; it must be defined by other means. > The point is that > the meaning of the sentence in that language does depend on how you would > parse it. Is there a "not" missing in that sentence? Otherwise it seems to agree with me that the meaning depends on the parsing, roughly speaking. > An OCR system could simply compare pictures. What human brain > does, nobody knows. An Ada program would yield a machine number. Where is a > tree? Which tree? It is not intrinsic in the string. The grammar, and the corresponding parse trees, are tools for defining the strings in the language, and the meaning of each string. > > + > / \ > 4 + > / \ > * > / \ > 3 10 > > This one? There is a countable infinity of such trees. A German would say > that this is twelve hundred, four and thirty (12,4,3). A way different > structure then? Yes, the grammar of (decimal) literals can be written in several ways, and all of these grammars can be attributed to define the same meaning (numerical value) of the string (of course they can also be attributed in different ways, to define different meanings). So what? I never claimed that the grammar is unique, just that grammars and parse trees are IMO the unique known method to define the meanings of complex sentences. > The meaning has nothing to do with parsing. Ok, if you can define the meaning in some other way, and not by parsing. How do you define it? > In fact parsing has to be validated against the meaning. Why? If you define the meaning without using parsing, why is that validation necessary? >>>> How else could you explain why "2*3+4*5" means 26, and not 70, nor >>>> something else? >>> >>> Explain? It is a subject of proof. There are more than one way to do it. >> >> Operator precedence, perhaps? See earlier post... > > I didn't mean variations of operator precedence generating different > results. I meant that 2*3+4*5 = 26 is a statement to be proven. It is > either true or false (or undecidable). 26 is the value of the formula. The truth of the formula depends on the meaning assigned to the expressions in the formula. Before one can prove anything, one must define the proof rules, including the rules for modifying sentences (expressions) while preserving their meaning. The latter rules are derived from the rules that define the meaning of an expression based on the structure (parse tree) of the expression, and the properties of the operators in the expression. >>> BTW the "meaning" you are advocating for looks rather something like >>> 23*45*+. >> >> More like: the expression is a SUM of (a TERM that is the PRODUCT of >> (the number 2) and (the number 3)) and (a TERM that is the PRODUCT of >> (the number 4) and (the number 5)), together with a rule for evaluating >> such expression trees. > > The above is alike, the difference is that it uses the Polish Reverse > instead of your tree. Why is your "meaning" better? My intent was to show a tree. I was too lazy to use ASCII art, so I used parentheses instead, to make the groupings unambiguous. Please imagine a drawing of the tree instead. Note that the tree is not yet the meaning, but the meaning of the tree (and thus the meaning of the sentence) is defined by structural inductive rules over the tree. >>>> In practice, the role of the parse/derivation tree appears in the fact >>>> that parser/compiler generators let you attach "semantic actions" to >>>> grammar productions, and often help you to consider larger parts of the >>>> derivation by passing information around the parse/derivation tree in >>>> well-defined ways. Attribute grammars are one example. >>> >>> Parsing tree is just a less ambiguous representation of the program. >> >> "Just"? It is the first step in the assignment of meaning to a sentence. > > It could be a step of evaluating the sentence, validating it etc. But the > meaning of the sentence is already assigned by the sentence itself > (provided the sentence is legal). No, meanings are not inherent in sentences. There must be a definition of the meaning of an arbitrary sentence: a function "meaning" from the sent of sentences to the set of meanings. For use in computation, this function must be defined in a practically computable way. It is not enough that we have an intuitive or automatic understanding of the sentence. >> The second step is a computation defined inductively on the parse tree. > > The second step is semantic analysis, if we consider compilers. AST gets a > lot of kicking at this stage: transformed, manipulated, pruned etc. Right. The AST represents the meaning of the program, and the compiler applies meaning-preserving transformations to the AST. This requires a definition of the meaning of an AST (for compilers: the computation represented by the AST). The only practical way to define the meaning of an AST is by inductive rules using the structure of the AST. >>> You don't even need to generate it in order to evaluate 2*3+4*5 >>> using a finite state machine. >> >> You cannot do it with a finite state machine (assuming you want to allow >> literals of any size and expressions of any length). You need a stack of >> numbers, too. > > First, stack is not tree. True, but it is the simplest structure that lets finite state machines handle this kind of computation. I was humoring you by avoiding trees. > Second, regarding finiteness, that is of course > true, just because an infinite language cannot be mapped (injectively) onto > a finite number of states. But this has nothing to do with the "structure." > Once you limit the language you can parse it by brute force, no trees, no > stacks, just strcmp. Eh? I don't understand what you are talking about. Finite languages? Not worth discussing, IMO. >> The generation of parsing machines from grammars is one of the main >> achievements of formal language and parsing theory. Why sweat to make >> the machine manually, when you can do it automatically? > > Because: > > 1. You need not to generate anything. All infix expressions can be parsed > by the same engine controlled by a few tables. Infix expressions are a simple case. > 2. The efforts are immense. I already gave an example of Ada grammar > capturing operator precedence, association constraints (e.g. 2**+6), left > to right, right to left etc. It is a huge work. Hardly huge. There are only 6 levels: expression, relation, simple_expression, term, factor, primary. > And you also need to verify > if the mess of productions you have written indeed means that "unary + is > illegal as power." Because productions is NOT the meaning, they are just a > program. And have you created tables for parsing Ada? How long did it take you, and are you sure they are correct? > 3. You need tools, you need integration of the generated code into your > project. You need to keep track of changes of the grammar program, of the > generated program, of the rest. True. But many SW projects already use one or more such tools: preprocessors, GUI code generators, DB interface generators, ... >>> Care to carve a grammar where >>> >>> declare >>> X : Integer := 1.0 + 2; >>> >>> would be a syntax error? >> >> Easy, the grammar that defines only the empty string :-) >> >> But I guess you are thinking about types, and how to make the mixing of >> integers and floating point illegal. It is trivial to write a grammar >> that makes it a syntax error to mix integer literals and floating point >> literals in the same expression, but you have to duplicate the >> definition of expressions and have separate metasymbols >> Integer_Expression and FP_Expression, and all the way down to literals. > > ... and for all types, user-defined included. And also - "a return required > in the function". And also - all alternatives of a case do not intersect > and covering all choices ... Just for the start. The practical way is to use attribute grammars: >> If you are thinking about identifiers with types, use an attribute >> grammar that checks types by means of a symbol table, and there you are. >> Perhaps you don't want to call that a "syntax error", but at least it is >> a "parse-time error". > > That is yet another nail in the coffin. Grammars are not only incapable to > describe language semantics beyond classroom exercises. I don't agree. But so be it. > Even if they could, > that would not yet make a compiler. A compiler should also explain why it > rejects illegal program. Here we are entering usability and friendliness, which are quality-of-implementation issues. There has certainly been work in the formal-languages and parsing theory communities on error reporting and automatic recovery from parsing errors. And I know some manually written parsers that are not very friendly with error reports... > Formally speaking, the language being parsed consists of all possible > strings. Nothing is a syntax error. If the input is illegal in the language > 1 (e.g. Ada), then in the language 2 it is legal and yields "closing > bracket is missing for the opening bracket found at xx:yy." I don't understand your point. Are you being ironical? >>>>> Furthermore the merits of formal grammar definitions are debatable, because >>>>> the complexity of grammar descriptions usually make it impossible to judge >>>>> if a given string do belong to the formal language or not. ... > If nobody can understand grammar descriptions, they cannot serve as a > meaning-carrier. To be sure, grammars can be complex and hard to grasp. But your alternatives seem worse, IMO. >> If you want to describe what language your hand-written parser accepts, >> and what its processing means, I think you have to present the grammar >> and describe the meaning of the sentences. > > Or, I can present the tables with the same effect. I'll bet any users would prefer the grammar to the tables. >>>>> On one hand these classes do not capture real differences >>>>> between syntaxes of programming languages. >>>> >>>> What sort of "real differences" do you mean? >>> >>> Like between C++ and Ada. >> >> Sorry, not clear yet. I'm sure you don't mean just that one uses {} and >> the other begin-end, but what do you mean? > > I mean that a human reader instantly recognizes that the language is "like > C++" or "like Ada". Sorry, not clear yet. Is the human reader looking at some C++ code and some Ada code, or at a C++ grammar and an Ada grammar? Or at a manually written C++ parser and a manually written Ada parser? Or at parsing tables for C++, and parsing tables for Ada? > This means that somewhere in the brain there is a > picture, a class of syntaxes, we, as humans, are capable to operate. I suspect that if you show some non-programmers examples of C++ and Ada code, they will find both equally incomprehensible. > On the > contrary the classes of formal languages are difficult to us to recognize. I agree, they are technical distinctions. > If anybody were presented by a sample from this or that class, he would not > be able to figure out what it were. It is like the FFT of an image. Can you > tell if the picture shows a cat or a girl by looking at its spectrum? Ergo, > spectral analysis is poor for scene recognition. It is a poor way for humans to recognize scenes, sure. But it can be useful in machine vision and recognition applications. > (The difference is though > spectral analysis is useful for other things, while grammars are useful for > nothing... (:-)) Going back to the differences between C++ and Ada, one systematic difference that comes to mind is the "overloading" of the same terminal symbols in C++, especially the closing braces, '}'. Writing too few or too many closing braces can lead to confusing error messages. In contrast, Ada uses different terminal symbol combinations (end if, end loop, end case, end ) which usually lets the compiler give more informative error messages. I don't think this corresponds to a difference in the formal language classes of C++ and Ada, but I think the difference could be observed by som kind of "grammar metrics" (or "language metrics", if you prefer) that would give Ada a better score than C++, on this metric. But such grammar metrics have not been developed AFAIK. -- Niklas Holsti Tidorum Ltd niklas holsti tidorum fi . @ .