* OpenToken: Parsing Ada (subset)? @ 2015-06-01 13:08 Jacob Sparre Andersen 2015-06-02 22:12 ` Stephen Leake 2015-06-17 17:50 ` AdaMagica 0 siblings, 2 replies; 26+ messages in thread From: Jacob Sparre Andersen @ 2015-06-01 13:08 UTC (permalink / raw) I'm attempting to use OpenToken to parse an Ada subset without much success. I've reduced my problem all the way down to attempting to parse a "dotted" identifier: Example -> Dotted_Identifier EOF Dotted_Identifier -> Identifier | Dotted_Identifier "." Identifier I declare the syntax like this (I've removed the reserved words for now): Syntax : constant Tokenizer.Syntax := (Dot_T => Tokenizer.Get (OpenToken.Recognizer.Separator.Get (".")), Identifier_T => Tokenizer.Get (OpenToken.Recognizer.Identifier.Get (Start_Chars => Ada.Strings.Maps.Constants.Letter_Set, Body_Chars => Ada.Strings.Maps.Constants.Alphanumeric_Set)), Comment_T => Tokenizer.Get (OpenToken.Recognizer.Line_Comment.Get ("--")), Whitespace_T => Tokenizer.Get (OpenToken.Recognizer.Character_Set.Get (OpenToken.Recognizer.Character_Set.Standard_Whitespace)), End_Of_File_T => Tokenizer.Get (OpenToken.Recognizer.End_Of_File.Get)); I declare the token objects (for writing the grammar) like this: Dot : constant Master_Token.Class := Master_Token.Get (Dot_T); EOF : constant Master_Token.Class := Master_Token.Get (End_Of_File_T); Identifier : constant Identifiers.Instance'Class := Identifiers.Get (Identifier_T); Compilation_Unit : constant Nonterminal.Class := Nonterminal.Get (Compilation_Unit_T); Dotted_Identifier : constant Dotted_Identifiers.Class := Dotted_Identifiers.Get (Dotted_Identifier_T); I declare the grammar as: Grammar : constant Production_List.Instance := Compilation_Unit <= Dotted_Identifier & EOF and Dotted_Identifier <= Dotted_Identifier & Dot & Identifier + Dotted_Identifiers.Join_Dotted_Identifiers and Dotted_Identifier <= Identifier; When I attempt to use this grammar to parse a "dotted" identifier, I run into problems when OpenToken attempts to convert an identifier token to a Dotted_Identifier.Instance object. The problem which occurs is that the identifier is represented as a plain Master_Token.Instance object, which doesn't contain the actual identifier - which means that I can't copy it into the "dotted" identifier. I'm using OpenToken 6.0b. The full compilable source (including the used OpenToken packages) can be downloaded from <http://jacob.sparre-andersen.dk/temp/ada_parsing_with_opentoken-2015-06-01-a.zip>. To build the example: gnatmake parse_dotted_identifier To test it: echo Ada.Text_IO.Integer_IO | ./parse_dotted_identifier I'm probably doing something obvious wrong, but what? Greetings, Jacob -- "It is very easy to get ridiculously confused about the tenses of time travel, but most things can be resolved by a sufficiently large ego." ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: OpenToken: Parsing Ada (subset)? 2015-06-01 13:08 OpenToken: Parsing Ada (subset)? Jacob Sparre Andersen @ 2015-06-02 22:12 ` Stephen Leake 2015-06-03 1:43 ` Shark8 2015-06-17 17:50 ` AdaMagica 1 sibling, 1 reply; 26+ messages in thread From: Stephen Leake @ 2015-06-02 22:12 UTC (permalink / raw) Jacob Sparre Andersen <sparre@nbi.dk> writes: > I'm attempting to use OpenToken to parse an Ada subset without much > success. (Shameless plug) An alternative approach is to start from the full Ada grammar in Emacs ada-mode (ada-grammar.wy), and reduce it to what you want. You'd also have to change the actions; they are focused on supporting indentation and navigation in Emacs. That grammar is in a .wy file, which must be processed by wisi-generate to produce Ada code. The code produced by wisi-generate does not support Ada actions at the moment (it only supports Emacs elisp, or a process or dll communicating with Emacs), but you could either add that to wisi-generate or treat the generated code as a template and add your own actions. wisi-generate also assumes a single token type, so that changes how you write the actions (more code in the actions, less in the tokens). If you want to try adding actions to wisi-generate, I suggest you start from FastToken (in monotone branch org.fasttoken, no release yet). It is passing all its tests. I am still working on it, so the code will be changing quite a bit. I plain to completely eliminate the token type hierarchy, to avoid the run-time dynamic memory management it requires. But that should be mostly orthogonal to actions. I'd be happy to help with this; it is on my list of things to do someday, and having an actual user is fun :). > Identifier_T => > Tokenizer.Get (OpenToken.Recognizer.Identifier.Get > (Start_Chars => Ada.Strings.Maps.Constants.Letter_Set, > Body_Chars => Ada.Strings.Maps.Constants.Alphanumeric_Set)), This line says to use the default 'Get' routine to specify the token type to use for Identifier_T; that is Master_Token.Instance, as you complain later (see opentoken-token-enumerated-analyzer.ads:88). Change to: Identifier_T => Tokenizer.Get (OpenToken.Recognizer.Identifier.Get (Start_Chars => Ada.Strings.Maps.Constants.Letter_Set, Body_Chars => Ada.Strings.Maps.Constants.Alphanumeric_Set), Identifiers.Get (Identifier_T), I hate default parameters in general; this one is sort of convenient (most terminals are happy being Master_Token.Instance, and you don't have to repeat Identifier_T), but easily leads to this kind of bug (most default parameters lead to hard to find bugs!). Note that the ID parameter to Identifiers.Get could be defaulted; it is overwritten in Analyzer.Initialize. To see this in the code; terminal token objects are created on the parser stack at opentoken-production-parser-lalr-parser.adb:318, as a copy of a token returned by Analyzer.Get. Analyzer.Get returns the token stored in the syntax; see opentoken-token-enumerated-analyzer.adb:772. Nonterminal token objects are created at opentoken-production-parser-lalr-parser.adb:177, as a copy of the production left hand side (ie Dotted_Identifier below). > Identifier : constant Identifiers.Instance'Class := Identifiers.Get (Identifier_T); > > I declare the grammar as: > > Grammar : constant Production_List.Instance := > Compilation_Unit <= Dotted_Identifier & EOF and > Dotted_Identifier <= Dotted_Identifier & Dot & Identifier + Dotted_Identifiers.Join_Dotted_Identifiers and > Dotted_Identifier <= Identifier; You might think that the use of Identifier.Instance here does what you want, but it doesn't; the grammar determines the types only of the nonterminals, the syntax determines the types of the terminals. I suppose I should add that to the OpenToken user manual, but I'd rather work on the FastToken user manual, which won't have this problem (only one type for tokens :). > I'm probably doing something obvious wrong, but what? Obvious to me, but I've been messing with the lexer code in FastToken recently; I switched to using regular expressions for the token recognizers (to be closer to Aflex, which is also now supported). While doing that, I deleted the default Get above. Using regular expressions instead of the old OpenToken recognizers is a big change, but wisi-generate takes care of most of the drudge work for you (you just have to specify the actual regular expressions). Glad to hear someone is using OpenToken, and it's fun to actually talk about this stuff once in a while :). -- -- Stephe ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: OpenToken: Parsing Ada (subset)? 2015-06-02 22:12 ` Stephen Leake @ 2015-06-03 1:43 ` Shark8 2015-06-03 7:36 ` Dmitry A. Kazakov 0 siblings, 1 reply; 26+ messages in thread From: Shark8 @ 2015-06-03 1:43 UTC (permalink / raw) On Tuesday, June 2, 2015 at 4:12:37 PM UTC-6, Stephen Leake wrote: > > Obvious to me, but I've been messing with the lexer code in FastToken > recently; I switched to using regular expressions for the token > recognizers (to be closer to Aflex, which is also now supported). While > doing that, I deleted the default Get above. > > Using regular expressions instead of the old OpenToken recognizers is a > big change, but wisi-generate takes care of most of the drudge work for > you (you just have to specify the actual regular expressions). Is that a good idea? From my experience [mainly maintenance] RegEx is almost always a bad solution (though I will grant that most of my encounters w/ it involved applying it as a formatting/parsing tool for items that generally weren't amiable to such breakdowns [street addresses, for example, are good at containing info/formatting that kills a simple regex]). ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: OpenToken: Parsing Ada (subset)? 2015-06-03 1:43 ` Shark8 @ 2015-06-03 7:36 ` Dmitry A. Kazakov 2015-06-05 9:03 ` Stephen Leake 0 siblings, 1 reply; 26+ messages in thread From: Dmitry A. Kazakov @ 2015-06-03 7:36 UTC (permalink / raw) On Tue, 2 Jun 2015 18:43:50 -0700 (PDT), Shark8 wrote: > On Tuesday, June 2, 2015 at 4:12:37 PM UTC-6, Stephen Leake wrote: >> >> Obvious to me, but I've been messing with the lexer code in FastToken >> recently; I switched to using regular expressions for the token >> recognizers (to be closer to Aflex, which is also now supported). While >> doing that, I deleted the default Get above. >> >> Using regular expressions instead of the old OpenToken recognizers is a >> big change, but wisi-generate takes care of most of the drudge work for >> you (you just have to specify the actual regular expressions). > > Is that a good idea? No. > From my experience [mainly maintenance] RegEx is almost always a bad > solution (though I will grant that most of my encounters w/ it involved > applying it as a formatting/parsing tool for items that generally weren't > amiable to such breakdowns [street addresses, for example, are good at > containing info/formatting that kills a simple regex]). Yes. Maintenance is one problem, another is that the family of languages recognized by regular expressions is far too weak. More powerful languages, e.g. SNOBOL patterns are slower. In the end it is always worth of efforts writing a manual token scanner by hand. Firstly, there are not so many things you would have to recognize that way. Secondly it is much more efficient than pattern matching. Thirdly it would allow sane error messaging, because usually it is more outcomes than matched vs. not matched, e.g. malformed identifier or missing quotation mark. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: OpenToken: Parsing Ada (subset)? 2015-06-03 7:36 ` Dmitry A. Kazakov @ 2015-06-05 9:03 ` Stephen Leake 2015-06-05 9:23 ` Georg Bauhaus ` (2 more replies) 0 siblings, 3 replies; 26+ messages in thread From: Stephen Leake @ 2015-06-05 9:03 UTC (permalink / raw) "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes: > On Tue, 2 Jun 2015 18:43:50 -0700 (PDT), Shark8 wrote: > >> On Tuesday, June 2, 2015 at 4:12:37 PM UTC-6, Stephen Leake wrote: >>> >>> Obvious to me, but I've been messing with the lexer code in FastToken >>> recently; I switched to using regular expressions for the token >>> recognizers (to be closer to Aflex, which is also now supported). > >> From my experience [mainly maintenance] RegEx is almost always a bad >> solution (though I will grant that most of my encounters w/ it involved >> applying it as a formatting/parsing tool for items that generally weren't >> amiable to such breakdowns [street addresses, for example, are good at >> containing info/formatting that kills a simple regex]). > > Yes. Maintenance is one problem, another is that the family of languages > recognized by regular expressions is far too weak. More powerful languages, > e.g. SNOBOL patterns are slower. The simple lexer in FastToken is intended only to support the FastToken unit tests, which focus on testing the parser generator and executor. Simplicity is the prime driver here; no need for expressive power or speed. Aflex compiles all the regular expressions for all of the tokens into one state machine, that visits each character in the input stream once. You can't get faster than that. > In the end it is always worth of efforts writing a manual token scanner by > hand. "always" is way too strong a statement here. If you trust that the regexp engine is well written and maintained, the expressive power is adequate for your language, and the speed is adequate for your application, then why waste resources reimplementing the tools? Use them and get on with the interesting work. regexp are perfectly adequate for Ada. > Firstly, there are not so many things you would have to recognize that > way. I guess you are saying that implementing a lexer for a restricted set of tokens is easier than implementing a general regular expression engine. True, but that's not the choice at hand; the choice is between implementing a new lexer for a restricted set of tokens, or reusing an existing regular expression engine (supported and maintained externally) and specifying a small set of regular expressions (most of which are simple strings for the reserved words). > Secondly it is much more efficient than pattern matching. Not if you use the Aflex approach; the hand-written OpenToken lexer is far less efficient than the compiled state machine that Aflex produces. > Thirdly it would allow sane error messaging, because usually it is > more outcomes than matched vs. not matched, e.g. malformed identifier > or missing quotation mark. This is a valid but minor point. For Ada strings, since new line is excluded, a missing quotation mark does not produce a very confusing error message (which is precisely why new line is excluded). Other languages are worse for string errors, but the parser stage can provide a better error message if desired. Syntax highlighting in the typical IDE is the best way to address this particular problem; when the entire rest of the file changes color, you know you are missing a quote. It's not an issue for the FastToken unit tests, and the syntax error messages from OpenToken in other contexts have not bothered me yet (certainly not as much as some Microsoft error messages: "cannot load the specified module" indeed; tell me _which_ module!). Nor have I gotten any complaints in that area from ada-mode users. -- -- Stephe ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: OpenToken: Parsing Ada (subset)? 2015-06-05 9:03 ` Stephen Leake @ 2015-06-05 9:23 ` Georg Bauhaus 2015-06-05 20:49 ` Shark8 2015-06-05 12:20 ` Dmitry A. Kazakov 2015-06-05 20:53 ` Shark8 2 siblings, 1 reply; 26+ messages in thread From: Georg Bauhaus @ 2015-06-05 9:23 UTC (permalink / raw) On 05.06.15 11:03, Stephen Leake wrote: > Aflex compiles all the regular expressions for all of the tokens into > one state machine, that visits each character in the input stream once. > You can't get faster than that. You can get faster than that if the OS supports line-oriented I/O of text: NotInString -> '-' ... -> '-' -> SKIPThisLine :) ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: OpenToken: Parsing Ada (subset)? 2015-06-05 9:23 ` Georg Bauhaus @ 2015-06-05 20:49 ` Shark8 2015-06-05 23:52 ` Dennis Lee Bieber 0 siblings, 1 reply; 26+ messages in thread From: Shark8 @ 2015-06-05 20:49 UTC (permalink / raw) On Friday, June 5, 2015 at 3:23:40 AM UTC-6, Georg Bauhaus wrote: > On 05.06.15 11:03, Stephen Leake wrote: > > Aflex compiles all the regular expressions for all of the tokens into > > one state machine, that visits each character in the input stream once. > > You can't get faster than that. > > You can get faster than that if the OS supports > line-oriented I/O of text: > > NotInString -> '-' ... -> '-' -> SKIPThisLine :) Hm, I seem to recall something about OpenVMS text-file that might qualify there. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: OpenToken: Parsing Ada (subset)? 2015-06-05 20:49 ` Shark8 @ 2015-06-05 23:52 ` Dennis Lee Bieber 0 siblings, 0 replies; 26+ messages in thread From: Dennis Lee Bieber @ 2015-06-05 23:52 UTC (permalink / raw) On Fri, 5 Jun 2015 13:49:02 -0700 (PDT), Shark8 <onewingedshark@gmail.com> declaimed the following: >On Friday, June 5, 2015 at 3:23:40 AM UTC-6, Georg Bauhaus wrote: >> On 05.06.15 11:03, Stephen Leake wrote: >> > Aflex compiles all the regular expressions for all of the tokens into >> > one state machine, that visits each character in the input stream once. >> > You can't get faster than that. >> >> You can get faster than that if the OS supports >> line-oriented I/O of text: >> >> NotInString -> '-' ... -> '-' -> SKIPThisLine :) > >Hm, I seem to recall something about OpenVMS text-file that might qualify there. VMS had an abundance of RMS (Record Management Services) formats even for text files -- one of which was "carriage control" (<lf> delimited lines). And don't get me started on the FORTRAN "native" "segmented record" type (a logical record is split across 1 or more physical records, wherein each physical record had a marker bits for first, continuation, last record, and a record length value; continuation basically set neither flag, a short record would have first and last bits set). -- Wulfraed Dennis Lee Bieber AF6VN wlfraed@ix.netcom.com HTTP://wlfraed.home.netcom.com/ ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: OpenToken: Parsing Ada (subset)? 2015-06-05 9:03 ` Stephen Leake 2015-06-05 9:23 ` Georg Bauhaus @ 2015-06-05 12:20 ` Dmitry A. Kazakov 2015-06-16 12:43 ` Stephen Leake 2015-06-05 20:53 ` Shark8 2 siblings, 1 reply; 26+ messages in thread From: Dmitry A. Kazakov @ 2015-06-05 12:20 UTC (permalink / raw) On Fri, 05 Jun 2015 04:03:27 -0500, Stephen Leake wrote: > Aflex compiles all the regular expressions for all of the tokens into > one state machine, that visits each character in the input stream once. > You can't get faster than that. It is visiting source character vs. visiting internal states of the machine and transition computations. You cannot say what is more expensive in advance. A transition computation may be much more expensive than accessing a character stored in an array. However hand-written scanners do not roll back either, usually. >> In the end it is always worth of efforts writing a manual token scanner by >> hand. > > "always" is way too strong a statement here. > > If you trust that the regexp engine is well written and maintained, > the expressive power is adequate for your language, and the speed is > adequate for your application, then why waste resources reimplementing > the tools? Use them and get on with the interesting work. > > regexp are perfectly adequate for Ada. Examples I had in mind were Ada identifier and Ada string literal, e.g. in UTF-8 encoding. I don't think regular expression for these would be shorter than a hand-written in Ada scanner. >> Firstly, there are not so many things you would have to recognize that >> way. > > I guess you are saying that implementing a lexer for a restricted set of > tokens is easier than implementing a general regular expression engine. > True, but that's not the choice at hand; the choice is between > implementing a new lexer for a restricted set of tokens, or reusing an > existing regular expression engine (supported and maintained > externally) and specifying a small set of regular expressions (most of > which are simple strings for the reserved words). Yes, it is writing regular expression vs. writing Ada program. I prefer Ada program. >> Thirdly it would allow sane error messaging, because usually it is >> more outcomes than matched vs. not matched, e.g. malformed identifier >> or missing quotation mark. > > This is a valid but minor point. > > For Ada strings, since new line is excluded, a missing quotation mark > does not produce a very confusing error message (which is precisely why > new line is excluded). Which is tricky in the latest standard because RM refers to line terminating characters, whereas the OS file system may have its own definition of line end. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: OpenToken: Parsing Ada (subset)? 2015-06-05 12:20 ` Dmitry A. Kazakov @ 2015-06-16 12:43 ` Stephen Leake 2015-06-16 13:24 ` Dmitry A. Kazakov 0 siblings, 1 reply; 26+ messages in thread From: Stephen Leake @ 2015-06-16 12:43 UTC (permalink / raw) "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes: > On Fri, 05 Jun 2015 04:03:27 -0500, Stephen Leake wrote: > >> Aflex compiles all the regular expressions for all of the tokens into >> one state machine, that visits each character in the input stream once. >> You can't get faster than that. > > It is visiting source character vs. visiting internal states of the machine > and transition computations. You cannot say what is more expensive in > advance. True; measuring is always required. > A transition computation may be much more expensive than accessing > a character stored in an array. However hand-written scanners do not roll > back either, usually. Ok. >>> In the end it is always worth of efforts writing a manual token scanner by >>> hand. >> >> "always" is way too strong a statement here. >> >> If you trust that the regexp engine is well written and maintained, >> the expressive power is adequate for your language, and the speed is >> adequate for your application, then why waste resources reimplementing >> the tools? Use them and get on with the interesting work. >> >> regexp are perfectly adequate for Ada. > > Examples I had in mind were Ada identifier and Ada string literal, e.g. in > UTF-8 encoding. I don't think regular expression for these would be shorter > than a hand-written in Ada scanner. Yes, but also not longer! regular expression for identifer: "foo" code for identifier: lexeme = "foo" >>> Firstly, there are not so many things you would have to recognize that >>> way. >> >> I guess you are saying that implementing a lexer for a restricted set of >> tokens is easier than implementing a general regular expression engine. >> True, but that's not the choice at hand; the choice is between >> implementing a new lexer for a restricted set of tokens, or reusing an >> existing regular expression engine (supported and maintained >> externally) and specifying a small set of regular expressions (most of >> which are simple strings for the reserved words). > > Yes, it is writing regular expression vs. writing Ada program. I prefer Ada > program. I'm confused; you start by agreeing with my statement, but then say something else. I said "implement a general regular expression engine", meaning the Ada code that interprets regular expressions at run time. I hope you agree that writing that is harder than writing an Ada scanner. To address your point; let's compare writing a regular expression with writing Ada code to do the same thing. Here's the regular expression I use for Ada numeric literals: "\([0-9]+#\)?[-+0-9a-fA-F.]+\(#\)?" Given that you are at least a little familiar with regular expressions, there's nothing hard about that. It does not enforce all the lexical rules for numbers; it allows repeated, leading, and trailing underscores; it doesn't enforce pairs of '#'. But there is no legal Ada text that will be misinterpreted by this expression. It is always a trade off between enforcing that sort of rule in the lexer or the parser. The OpenToken Ada code that does the same job (while enforcing some of the rules) is in opentoken-recognizer-based_integer_real_ada.ads, .adb, opentoken-recognizer-extended_digits.ads, .adb. That's a total of 469 lines of code. That's a state-based scanner; a recursive descent one would be simpler. It took several hours to write the Ada code, and a few more for the tests. But later I found a bug that required a new test and code change. It took a few minutes to write the regular expression, and I reused the tests; no bugs found yet. Obviously, if you are not familiar with regular expressions, they will be harder to write. But a software engineer should be willing to learn the appropriate language for the job at hand. To be fair, if you are used to writing Ada code that does scanning, and have a good library for it, that job will be easier as well. So it may just come down to experience and available tools. I would guess that the average good programmer, starting with no knowledge of either, can learn enough about regular expressions to write the above faster than they can learn enough about writing scanners in Ada to do that job well. -- -- Stephe ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: OpenToken: Parsing Ada (subset)? 2015-06-16 12:43 ` Stephen Leake @ 2015-06-16 13:24 ` Dmitry A. Kazakov 2015-06-16 14:13 ` G.B. 2015-06-17 17:29 ` Stephen Leake 0 siblings, 2 replies; 26+ messages in thread From: Dmitry A. Kazakov @ 2015-06-16 13:24 UTC (permalink / raw) On Tue, 16 Jun 2015 07:43:49 -0500, Stephen Leake wrote: > "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes: > >> Yes, it is writing regular expression vs. writing Ada program. I prefer Ada >> program. > > I'm confused; you start by agreeing with my statement, but then say > something else. > > I said "implement a general regular expression engine", meaning the Ada > code that interprets regular expressions at run time. I hope you agree > that writing that is harder than writing an Ada scanner. > > To address your point; let's compare writing a regular expression with > writing Ada code to do the same thing. > > Here's the regular expression I use for Ada numeric literals: > > "\([0-9]+#\)?[-+0-9a-fA-F.]+\(#\)?" > > Given that you are at least a little familiar with regular expressions, > there's nothing hard about that. It is hard. > It does not enforce all the lexical rules for numbers; it allows > repeated, leading, and trailing underscores; it doesn't enforce pairs of > '#'. That is exactly the point. It does not parse literal right and you have to reparse the matched chunk of text once again. What was the gain? Why wouldn't do it right in single step? > It took a few minutes to write the regular expression, and I reused the > tests; no bugs found yet. > Obviously, if you are not familiar with regular expressions, they will > be harder to write. But a software engineer should be willing to learn > the appropriate language for the job at hand. That is not the point. I am familiar with C, but I am avoiding writing anything in C. Regular expressions is a far worse language than C and, additionally, incapable to parse Ada literals. Why bother with that mess? > I would guess that the average good programmer, starting with no > knowledge of either, can learn enough about regular expressions to write > the above faster than they can learn enough about writing scanners in > Ada to do that job well. The problem is that writing a correct pattern for anything more complex than trivial is hard even for people doing this on daily basis. For an average programmer it is patently impossible. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: OpenToken: Parsing Ada (subset)? 2015-06-16 13:24 ` Dmitry A. Kazakov @ 2015-06-16 14:13 ` G.B. 2015-06-17 17:38 ` Stephen Leake 2015-06-17 17:29 ` Stephen Leake 1 sibling, 1 reply; 26+ messages in thread From: G.B. @ 2015-06-16 14:13 UTC (permalink / raw) On 16.06.15 15:24, Dmitry A. Kazakov wrote: >> It does not enforce all the lexical rules for numbers; it allows >> >repeated, leading, and trailing underscores; it doesn't enforce pairs of >> >'#'. > That is exactly the point. It does not parse literal right and you have to > reparse the matched chunk of text once again. What was the gain? Why > wouldn't do it right in single step? (I believe the use case here permits simplifications, meaning that REs are not being used for meticulous, final parsing of Ada.) But '_' seems missing from "[-+0-9a-fA-F.]+". (And obsolete Ada syntax, i.e. substitutes for '#'. Which makes a CFG parser more desirable if '#' or ':' should have matching occurrences. ;-) ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: OpenToken: Parsing Ada (subset)? 2015-06-16 14:13 ` G.B. @ 2015-06-17 17:38 ` Stephen Leake 0 siblings, 0 replies; 26+ messages in thread From: Stephen Leake @ 2015-06-17 17:38 UTC (permalink / raw) "G.B." <bauhaus@futureapps.invalid> writes: > On 16.06.15 15:24, Dmitry A. Kazakov wrote: >>> It does not enforce all the lexical rules for numbers; it allows >>> repeated, leading, and trailing underscores; it doesn't enforce pairs of >>> '#'. >> That is exactly the point. It does not parse literal right and you have to >> reparse the matched chunk of text once again. What was the gain? Why >> wouldn't do it right in single step? > > (I believe the use case here permits simplifications, > meaning that REs are not being used for meticulous, final > parsing of Ada.) Yes, my examples are drawn from a parser for the indentation/navigation engine in Emacs Ada mode, which is actually required to be looser about Ada syntax rules than an Ada compiler. Adding an "enforce strict rules" requirement could change the lexer/parser design, especially if you place an emphasis on good/helpful/useful error messages. In addition, OpenToken (and FastToken) are intended for writing grammar-based parsers quickly, not for getting the best possible performance, or meeting other project-specific requirements. So the design priorizes minimizing the amount of new code that must be written for a new language. > But '_' seems missing from "[-+0-9a-fA-F.]+". Oops; good catch. I would have found that in an ada-mode test, but I'm not actually planning on using the Aflex lexer in Emacs. > (And obsolete Ada syntax, i.e. substitutes for '#'. Which makes a CFG > parser more desirable if '#' or ':' should have matching occurrences. > ;-) Emacs ada-mode explicitly ignores such things (at least until someone asks for it). -- -- Stephe ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: OpenToken: Parsing Ada (subset)? 2015-06-16 13:24 ` Dmitry A. Kazakov 2015-06-16 14:13 ` G.B. @ 2015-06-17 17:29 ` Stephen Leake 2015-06-17 17:42 ` Shark8 2015-06-17 19:03 ` Dmitry A. Kazakov 1 sibling, 2 replies; 26+ messages in thread From: Stephen Leake @ 2015-06-17 17:29 UTC (permalink / raw) "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes: > On Tue, 16 Jun 2015 07:43:49 -0500, Stephen Leake wrote: > >> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes: >> >> Here's the regular expression I use for Ada numeric literals: >> >> "\([0-9]+#\)?[-+0-9a-fA-F.]+\(#\)?" >> >> Given that you are at least a little familiar with regular expressions, >> there's nothing hard about that. > > It is hard. Ok, I gather you are not "at least a little familiar with regular expressions". Too bad. >> It does not enforce all the lexical rules for numbers; it allows >> repeated, leading, and trailing underscores; it doesn't enforce pairs of >> '#'. > > That is exactly the point. It does not parse literal right It's _not_ a "parser"; it's a "lexer". Define "right". The line between lexer and parser is a design decision, not set in stone. > and you have to > reparse the matched chunk of text once again. What was the gain? Doing it this way allows reusing a regexp engine, which is easier than writing a lexer from scatch. >> It took a few minutes to write the regular expression, and I reused the >> tests; no bugs found yet. > >> Obviously, if you are not familiar with regular expressions, they will >> be harder to write. But a software engineer should be willing to learn >> the appropriate language for the job at hand. > > That is not the point. I am familiar with C, but I am avoiding writing > anything in C. Regular expressions is a far worse language than C and, > additionally, incapable to parse Ada literals. Why bother with that > mess? As I have explained several times, I believe this approach is easier than writing a lexer by hand; why bother with _that_ mess? It's your choice. It would be nice if you could admit that other people can make other choices, and still write good code. Perhaps you could post lexer code that does this "right" by your definition, so we could judge for ourselves? >> I would guess that the average good programmer, starting with no >> knowledge of either, can learn enough about regular expressions to write >> the above faster than they can learn enough about writing scanners in >> Ada to do that job well. > > The problem is that writing a correct pattern for anything more complex > than trivial is hard even for people doing this on daily basis. For an > average programmer it is patently impossible. "patently" is overkill here; this is simply not my experience. -- -- Stephe ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: OpenToken: Parsing Ada (subset)? 2015-06-17 17:29 ` Stephen Leake @ 2015-06-17 17:42 ` Shark8 2015-06-17 19:03 ` Dmitry A. Kazakov 1 sibling, 0 replies; 26+ messages in thread From: Shark8 @ 2015-06-17 17:42 UTC (permalink / raw) On Wednesday, June 17, 2015 at 11:29:51 AM UTC-6, Stephen Leake wrote: > > It's _not_ a "parser"; it's a "lexer". > > Define "right". The line between lexer and parser is a design decision, > not set in stone. > > > and you have to > > reparse the matched chunk of text once again. What was the gain? > > Doing it this way allows reusing a regexp engine, which is easier than > writing a lexer from scatch. I just implemented a parser for BB-code (a subset, anyway) -- lexing is essentially finding '[', the ']' after that occurrence, if '/' is the '['+1th character. -- Even though that, in itself, isn't beyond the capabilities of regex I wouldn't recommend using regex: the "simple" usually turns out to be deceptive, esp. WRT regex. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: OpenToken: Parsing Ada (subset)? 2015-06-17 17:29 ` Stephen Leake 2015-06-17 17:42 ` Shark8 @ 2015-06-17 19:03 ` Dmitry A. Kazakov 1 sibling, 0 replies; 26+ messages in thread From: Dmitry A. Kazakov @ 2015-06-17 19:03 UTC (permalink / raw) On Wed, 17 Jun 2015 12:29:48 -0500, Stephen Leake wrote: > "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes: > >> On Tue, 16 Jun 2015 07:43:49 -0500, Stephen Leake wrote: >> >>> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes: >>> >>> Here's the regular expression I use for Ada numeric literals: >>> >>> "\([0-9]+#\)?[-+0-9a-fA-F.]+\(#\)?" >>> >>> Given that you are at least a little familiar with regular expressions, >>> there's nothing hard about that. >> >> It is hard. > > Ok, I gather you are not "at least a little familiar with regular > expressions". *Nobody* is familiar with to be sure that the language generated by the pattern like above is one of the Ada numeric literal. Note "like", because your pattern obviously does not generate the Ada numeric literal. The things are actually much worse that complexity. It is a combination of complexity and weakness. Regular expressions cannot do stuff like Ada literals. Thus patterns actually used are only approximations to what is required. The designer must know how the generated language differ from the required one. And the reader must read not only the program but also the mind of pattern designer. >>> It does not enforce all the lexical rules for numbers; it allows >>> repeated, leading, and trailing underscores; it doesn't enforce pairs of >>> '#'. >> >> That is exactly the point. It does not parse literal right > > It's _not_ a "parser"; it's a "lexer". > > Define "right". def Right: No false positives, no false negatives <=> Rejects only illegal literals, accepts only legal literals. >The line between lexer and parser is a design decision, > not set in stone. True, but we are not talking about higher-level things like maximum fraction length supported. Simple lexical stuff like: - matching '#'s - non-repeating '_'s - valid base number - the set of digits corresponding to the base etc all are beyond the power of regular expressions. (Unlike SNOBOL patterns) >> and you have to >> reparse the matched chunk of text once again. What was the gain? > > Doing it this way allows reusing a regexp engine, which is easier than > writing a lexer from scatch. You still have to parse it again. Also with or without regular expression you have to do it. The only difference is in detecting the end of the lexeme. Not a problem for manually written scanner at all. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: OpenToken: Parsing Ada (subset)? 2015-06-05 9:03 ` Stephen Leake 2015-06-05 9:23 ` Georg Bauhaus 2015-06-05 12:20 ` Dmitry A. Kazakov @ 2015-06-05 20:53 ` Shark8 2015-06-16 14:46 ` Stephen Leake 2 siblings, 1 reply; 26+ messages in thread From: Shark8 @ 2015-06-05 20:53 UTC (permalink / raw) On Friday, June 5, 2015 at 3:03:30 AM UTC-6, Stephen Leake wrote: > > If you trust that the regexp engine is [1]well written and maintained, > [2]the expressive power is adequate for your language, and [3]the speed is > adequate for your application, then why waste resources reimplementing > the tools? Use them and get on with the interesting work. While #3 seems to not come up often, #1 and #2 seem to be far more "question-able" -- #2 is especially misevaluated a LOT. (If it wasn't we wouldn't see people trying to parse HTML or CSV with regex.) > regexp are perfectly adequate for Ada. Even something like Character'('C')? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: OpenToken: Parsing Ada (subset)? 2015-06-05 20:53 ` Shark8 @ 2015-06-16 14:46 ` Stephen Leake 2015-06-16 15:31 ` G.B. 2015-06-16 21:34 ` Randy Brukardt 0 siblings, 2 replies; 26+ messages in thread From: Stephen Leake @ 2015-06-16 14:46 UTC (permalink / raw) Shark8 <onewingedshark@gmail.com> writes: > On Friday, June 5, 2015 at 3:03:30 AM UTC-6, Stephen Leake wrote: >> >> If you trust that the regexp engine is [1]well written and maintained, >> [2]the expressive power is adequate for your language, and [3]the speed is >> adequate for your application, then why waste resources reimplementing >> the tools? Use them and get on with the interesting work. > > While #3 seems to not come up often, #1 and #2 seem to be far more > "question-able" -- #2 is especially misevaluated a LOT. (If it wasn't > we wouldn't see people trying to parse HTML or CSV with regex.) Yes. >> regexp are perfectly adequate for Ada. > > Even something like Character'('C')? Hmm. I've never had problem with code like that, but it does seem like the lexer could treat '(' as a character literal, which would produce a parse error. Testing ... The Emacs lexer handles these properly: Character'('C') Character'( 'C' ) Character ' ( 'C' ) but not: Character '('C') (Which may be one reason I never write code the latter way :) The Emacs lexer regular expression for character literal is: "[^a-zA-Z0-9)]'[^'\n]'" which says if the first tick is preceded by identifier characters or right paren, it's not a character literal; that explains the above behavior, and works for typical Ada code in Emacs. But that regular expression doesn't work in a normal lexer, since it references text before the beginning of the desired lexeme. Emacs is _not_ a "normal lexer". Using the regular expression "'[^']'|''''" for CHARACTER_LITERAL (handling the special case ''''), the Aflex lexer handles the above cases as follows: Character'('C') IDENTIFIER CHARACTER_LITERAL IDENTIFIER TICK RIGHT_PAREN Character'( 'C' ) IDENTIFIER TICK LEFT_PAREN CHARACTER_LITERAL RIGHT_PAREN Character '('C') IDENTIFIER CHARACTER_LITERAL IDENTIFIER TICK RIGHT_PAREN Character ' ( 'C' ) IDENTIFIER TICK LEFT_PAREN CHARACTER_LITERAL RIGHT_PAREN This is as expected (but not desired). One way to handle this is to provide for feedback from the parser to the lexer; if a parse fails, push back the character literal, tell the lexer to treat the first single quote as a TICK, and procede. I'll work on implementing that in FastToken with the Aflex lexer; it will be a good example. Another way is to treat this particular sequence of tokens as a valid expression, but rewrite it before handing off to the rest of the parser. That requires identifying all such special cases; not too hard. A third choice is to not define a CHARACTER_LITERAL token; then the sequence of tokens is always IDENTIFIER TICK LEFT_PAREN TICK IDENTIFIER TICK RIGHT_PAREN and the parser must identify the character literal, or the grammar must be re-written in the same manner. That may be the simplest solution. If I recall correctly, this issue has been discussed here before, and the proposed solutions were similar. I don't know how GNAT handles this. I think the statement "regular expressions are perfectly adequate for Ada" stands; this case just shows that the parser must be complicated if the lexer is not. This case is a good example of the possible trade-offs between the lexer and parser complexity; the Emacs lexer handles all typical cases without feedback from the parser, but is more complex than an Aflex lexer. The Aflex lexer handles the same cases, but requires feedback from the parser or other complexity. -- -- Stephe ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: OpenToken: Parsing Ada (subset)? 2015-06-16 14:46 ` Stephen Leake @ 2015-06-16 15:31 ` G.B. 2015-06-17 17:44 ` Stephen Leake 2015-06-16 21:34 ` Randy Brukardt 1 sibling, 1 reply; 26+ messages in thread From: G.B. @ 2015-06-16 15:31 UTC (permalink / raw) On 16.06.15 16:46, Stephen Leake wrote: > Using the regular expression "'[^']'|''''" for CHARACTER_LITERAL ISO/IEC 10646 question: Is [^'] matching a single wide character? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: OpenToken: Parsing Ada (subset)? 2015-06-16 15:31 ` G.B. @ 2015-06-17 17:44 ` Stephen Leake 0 siblings, 0 replies; 26+ messages in thread From: Stephen Leake @ 2015-06-17 17:44 UTC (permalink / raw) "G.B." <bauhaus@futureapps.invalid> writes: > On 16.06.15 16:46, Stephen Leake wrote: >> Using the regular expression "'[^']'|''''" for CHARACTER_LITERAL > > ISO/IEC 10646 question: Is [^'] matching a single wide character? I don't think this is an ISO question; the Emacs regexp engine does not claim to match an ISO regexp spec (at least, not that I'm aware of). In practice, this does work for non-ASCII characters in Emacs, so the answer is "yes" for Emacs. I haven't tested this with Aflex yet; I don't think Aflex will work with non-ASCII text (one of the reasons I'm not planning to use it with Emacs). I suspect it would require a major rewrite to get Aflex to work with UTF-8 or some other non-ASCII representation. -- -- Stephe ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: OpenToken: Parsing Ada (subset)? 2015-06-16 14:46 ` Stephen Leake 2015-06-16 15:31 ` G.B. @ 2015-06-16 21:34 ` Randy Brukardt 2015-06-17 17:58 ` Stephen Leake 1 sibling, 1 reply; 26+ messages in thread From: Randy Brukardt @ 2015-06-16 21:34 UTC (permalink / raw) "Stephen Leake" <stephen_leake@stephe-leake.org> wrote in message news:85k2v3aeyv.fsf@stephe-leake.org... > Shark8 <onewingedshark@gmail.com> writes: > >> On Friday, June 5, 2015 at 3:03:30 AM UTC-6, Stephen Leake wrote: >>> >>> If you trust that the regexp engine is [1]well written and maintained, >>> [2]the expressive power is adequate for your language, and [3]the speed >>> is >>> adequate for your application, then why waste resources reimplementing >>> the tools? Use them and get on with the interesting work. >> >> While #3 seems to not come up often, #1 and #2 seem to be far more >> "question-able" -- #2 is especially misevaluated a LOT. (If it wasn't >> we wouldn't see people trying to parse HTML or CSV with regex.) > > Yes. > >>> regexp are perfectly adequate for Ada. >> >> Even something like Character'('C')? > > Hmm. I've never had problem with code like that, but it does seem like > the lexer could treat '(' as a character literal, which would produce a > parse error. > > Testing ... > > The Emacs lexer handles these properly: > > Character'('C') > Character'( 'C' ) > Character ' ( 'C' ) > > but not: > > Character '('C') > > (Which may be one reason I never write code the latter way :) > > The Emacs lexer regular expression for character literal is: > > "[^a-zA-Z0-9)]'[^'\n]'" > > which says if the first tick is preceded by identifier characters or > right paren, it's not a character literal; that explains the above > behavior, and works for typical Ada code in Emacs. > > But that regular expression doesn't work in a normal lexer, since it > references text before the beginning of the desired lexeme. Emacs is > _not_ a "normal lexer". > > Using the regular expression "'[^']'|''''" for CHARACTER_LITERAL > (handling the special case ''''), the Aflex lexer handles the > above cases as follows: > > Character'('C') > IDENTIFIER CHARACTER_LITERAL IDENTIFIER TICK RIGHT_PAREN > > Character'( 'C' ) > IDENTIFIER TICK LEFT_PAREN CHARACTER_LITERAL RIGHT_PAREN > > Character '('C') > IDENTIFIER CHARACTER_LITERAL IDENTIFIER TICK RIGHT_PAREN > > Character ' ( 'C' ) > IDENTIFIER TICK LEFT_PAREN CHARACTER_LITERAL RIGHT_PAREN > > This is as expected (but not desired). > > One way to handle this is to provide for feedback from the parser to the > lexer; if a parse fails, push back the character literal, tell the lexer > to treat the first single quote as a TICK, and procede. I'll work on > implementing that in FastToken with the Aflex lexer; it will be a good > example. > > Another way is to treat this particular sequence of tokens as a valid > expression, but rewrite it before handing off to the rest of the parser. > That requires identifying all such special cases; not too hard. > > A third choice is to not define a CHARACTER_LITERAL token; then the > sequence of tokens is always > > IDENTIFIER TICK LEFT_PAREN TICK IDENTIFIER TICK RIGHT_PAREN > > and the parser must identify the character literal, or the grammar must > be re-written in the same manner. That may be the simplest solution. > > If I recall correctly, this issue has been discussed here before, and > the proposed solutions were similar. I don't know how GNAT handles this. I don't think you identified the solution that is typically used: remember the previous token identified by the lexer. Then, when encountering an apostrophe, the token is unconditionally an apostrophe if the preceding token is "all", an identifier, a character or string literal, or an rparen; else it might be a character literal. No "feedback from the parser" needed (that seems like a nightmare to me). The method was originally proposed by Tischler in Ada Letters in July 1983, pg 36. (I got this out of the comments of Janus/Ada, of course.) I tend to agree with Dmitry; for lexing Ada, regular expressions are just not going to work; you'll need too many fixups to make them worth the trouble. Just write the thing in Ada, it won't take you any longer than figuring out the correct regular expression for an identifier. And that makes it easy to handle the weird special cases of Ada. Other approaches are going to lex some programs incorrectly; how important that is will vary depending on what kind of tool you are writing but since the effort is similar, it's hard to see the advantage of a regular expression or other "automatic" lexer. (It makes much more sense for a parser, where the effort can be orders of magnitude different.) Randy. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: OpenToken: Parsing Ada (subset)? 2015-06-16 21:34 ` Randy Brukardt @ 2015-06-17 17:58 ` Stephen Leake 2015-06-17 20:44 ` Randy Brukardt ` (2 more replies) 0 siblings, 3 replies; 26+ messages in thread From: Stephen Leake @ 2015-06-17 17:58 UTC (permalink / raw) "Randy Brukardt" <randy@rrsoftware.com> writes: > "Stephen Leake" <stephen_leake@stephe-leake.org> wrote in message > news:85k2v3aeyv.fsf@stephe-leake.org... >> One way to handle this is to provide for feedback from the parser to the >> lexer; if a parse fails, push back the character literal, tell the lexer >> to treat the first single quote as a TICK, and procede. I'll work on >> implementing that in FastToken with the Aflex lexer; it will be a good >> example. >> >> Another way is to treat this particular sequence of tokens as a valid >> expression, but rewrite it before handing off to the rest of the parser. >> That requires identifying all such special cases; not too hard. >> >> A third choice is to not define a CHARACTER_LITERAL token; then the >> sequence of tokens is always >> >> IDENTIFIER TICK LEFT_PAREN TICK IDENTIFIER TICK RIGHT_PAREN >> >> and the parser must identify the character literal, or the grammar must >> be re-written in the same manner. That may be the simplest solution. >> >> If I recall correctly, this issue has been discussed here before, and >> the proposed solutions were similar. I don't know how GNAT handles this. > > I don't think you identified the solution that is typically used: remember > the previous token identified by the lexer. Then, when encountering an > apostrophe, the token is unconditionally an apostrophe if the preceding > token is "all", an identifier, a character or string literal, or an rparen; > else it might be a character literal. That's the third choice above; the lexer returns TICK (= apostrophe) for all cases, and the parser deals with further classification. Hmm. Unless you are saying that logic is in the lexer; I don't see that it matters much. Aflex does have a provision for adding some logic in a lexer, although I'm not sure it supports "remember the previous token". > No "feedback from the parser" needed > (that seems like a nightmare to me). The method was originally proposed by > Tischler in Ada Letters in July 1983, pg 36. (I got this out of the comments > of Janus/Ada, of course.) > > I tend to agree with Dmitry; for lexing Ada, regular expressions are just > not going to work; you'll need too many fixups to make them worth the > trouble. Just write the thing in Ada, it won't take you any longer than > figuring out the correct regular expression for an identifier. And that > makes it easy to handle the weird special cases of Ada. > > Other approaches are going to lex some programs incorrectly; how important > that is will vary depending on what kind of tool you are writing but since > the effort is similar, it's hard to see the advantage of a regular > expression or other "automatic" lexer. (It makes much more sense for a > parser, where the effort can be orders of magnitude different.) Ok. I guess I'd like to see some actual examples of hand-written lexers. The one in OpenToken is not inspiring to me; that's why I got rid of it for FastToken (it's definitely easier for me to write regexps than to write another OpenToken recognizer (= lexer module)). I have looked briefly at the GNAT lexer. It is highly optimized, and is apparently generated from some SNOBOL sources (ie, _not_ "hand written"). For example, it uses nested if-then-else on each character of each keyword; not something you want to do by hand. -- -- Stephe ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: OpenToken: Parsing Ada (subset)? 2015-06-17 17:58 ` Stephen Leake @ 2015-06-17 20:44 ` Randy Brukardt 2015-06-18 7:51 ` AdaMagica 2015-06-18 9:12 ` Georg Bauhaus 2 siblings, 0 replies; 26+ messages in thread From: Randy Brukardt @ 2015-06-17 20:44 UTC (permalink / raw) "Stephen Leake" <stephen_leake@stephe-leake.org> wrote in message news:85h9q68bf8.fsf@stephe-leake.org... > "Randy Brukardt" <randy@rrsoftware.com> writes: > >> "Stephen Leake" <stephen_leake@stephe-leake.org> wrote in message >> news:85k2v3aeyv.fsf@stephe-leake.org... > >>> One way to handle this is to provide for feedback from the parser to the >>> lexer; if a parse fails, push back the character literal, tell the lexer >>> to treat the first single quote as a TICK, and procede. I'll work on >>> implementing that in FastToken with the Aflex lexer; it will be a good >>> example. >>> >>> Another way is to treat this particular sequence of tokens as a valid >>> expression, but rewrite it before handing off to the rest of the parser. >>> That requires identifying all such special cases; not too hard. >>> >>> A third choice is to not define a CHARACTER_LITERAL token; then the >>> sequence of tokens is always >>> >>> IDENTIFIER TICK LEFT_PAREN TICK IDENTIFIER TICK RIGHT_PAREN >>> >>> and the parser must identify the character literal, or the grammar must >>> be re-written in the same manner. That may be the simplest solution. >>> >>> If I recall correctly, this issue has been discussed here before, and >>> the proposed solutions were similar. I don't know how GNAT handles this. >> >> I don't think you identified the solution that is typically used: >> remember >> the previous token identified by the lexer. Then, when encountering an >> apostrophe, the token is unconditionally an apostrophe if the preceding >> token is "all", an identifier, a character or string literal, or an >> rparen; >> else it might be a character literal. > > That's the third choice above; the lexer returns TICK (= apostrophe) for > all cases, and the parser deals with further classification. > > Hmm. Unless you are saying that logic is in the lexer; I don't see that > it matters much. Of course it's in the lexer, I don't see how it could be in the parser (especially if the parser is table-driven). The character following the Tick in a character literal could be anything, including characters that the parser doesn't even see (white space, characters like $, %, and # that aren't legal tokens, etc.). Trying to handle that in the parser would greatly complicate the grammar and most likely would introduce ambiguities into it. I could see it working in a hand-written parser, but that doesn't make much sense (why use a table-driven lexer and a hand-written parser? -- the parser is far more work). I also could see it working if one introduced a filter stage in between the lexer and parser -- but then you've just introduced a lot more complexity. ... > I have looked briefly at the GNAT lexer. It is highly optimized, and is > apparently generated from some SNOBOL sources (ie, _not_ "hand > written"). For example, it uses nested if-then-else on each character of > each keyword; not something you want to do by hand. The one in Janus/Ada started out life as a table-driven lexer that we were required to use in our original CS 701 compiler. But that couldn't handle the little glitches of Ada and it was huge, we needed something much smaller for our original CP/M compiler (which had to fit in 48K RAM). So it was replaced by hand-written code in the same style (essentially replacing the table transitions with case statements). (Actually, I think it was originally generated by a program from the transition table, and then hand-pruned because much of it was clearly redundant.) It uses a character classification table to reduce the number of transitions needed. For most classes, there are very few transitions (if you see "=", there isn't much that can follow it). It's fairly large (the entire unit is 2400 lines), but a large part of that is error handling code, code to create token records (which for efficiency is repeated everywhere rather than put into a subprogram - this code is very "hot" in a profile of the compiler, and anything that we can save matters) and code to manage identifiers and reserved words. I suspect I could do it in much less space if I was starting from scratch (which I probably will have to do to support UTF-8). Randy. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: OpenToken: Parsing Ada (subset)? 2015-06-17 17:58 ` Stephen Leake 2015-06-17 20:44 ` Randy Brukardt @ 2015-06-18 7:51 ` AdaMagica 2015-06-18 9:12 ` Georg Bauhaus 2 siblings, 0 replies; 26+ messages in thread From: AdaMagica @ 2015-06-18 7:51 UTC (permalink / raw) Am Mittwoch, 17. Juni 2015 19:58:06 UTC+2 schrieb Stephen Leake: > Ok. I guess I'd like to see some actual examples of hand-written lexers. > The one in OpenToken is not inspiring to me; that's why I got rid of it > for FastToken (it's definitely easier for me to write regexps than to > write another OpenToken recognizer (= lexer module)). I found it very simple to write the multitude of recognizers needed for the Ada, Java, HTML lexers I wrote for OpenToken. Maybe not inspiring, but simple. OK, forget the HTML lexer - it's unsatisfying. HTML syntax is uncomparable to Ada's. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: OpenToken: Parsing Ada (subset)? 2015-06-17 17:58 ` Stephen Leake 2015-06-17 20:44 ` Randy Brukardt 2015-06-18 7:51 ` AdaMagica @ 2015-06-18 9:12 ` Georg Bauhaus 2 siblings, 0 replies; 26+ messages in thread From: Georg Bauhaus @ 2015-06-18 9:12 UTC (permalink / raw) On 17.06.15 19:58, Stephen Leake wrote: > Ok. I guess I'd like to see some actual examples of hand-written lexers. My silly Ada source highlighter effectively used look-back, too, as a state in the tokenizer, IIRC. Bigger language tokens would be built from smaller token pieces; there are definitions of Delimiter_1 and Delimiter_2, a constraint expressing a relation between Operator_1 and Delimiter_1 (of inclusion); a number of Character_Set-s and Is_* functions for classifying; but still not all tokens are classified correctly (bugs or leniency, depending on POV of this program). Incidentally, the ubiquitous highlite program has similar shortcomings when parsing Ada or Perl (they share the '''), last time I looked. Some more input for testing, omitting spaces and combinations: Character'('''); Character('''); Character'('('); Character('('); Character'(-'''); Character'('-'-'-'); Character'(-'''-'-'); Character'(''')'Alignment; Name(''')'Has_Same_Storage; Character'Pos('''); Character'Base'('''); Character'Base'Pos('''); ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: OpenToken: Parsing Ada (subset)? 2015-06-01 13:08 OpenToken: Parsing Ada (subset)? Jacob Sparre Andersen 2015-06-02 22:12 ` Stephen Leake @ 2015-06-17 17:50 ` AdaMagica 1 sibling, 0 replies; 26+ messages in thread From: AdaMagica @ 2015-06-17 17:50 UTC (permalink / raw) Just to add for completeness: The Ada lexer written by me in OpenToken handles Character'('C') correctly. The solution I found is not ideal. ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2015-06-18 9:12 UTC | newest] Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-06-01 13:08 OpenToken: Parsing Ada (subset)? Jacob Sparre Andersen 2015-06-02 22:12 ` Stephen Leake 2015-06-03 1:43 ` Shark8 2015-06-03 7:36 ` Dmitry A. Kazakov 2015-06-05 9:03 ` Stephen Leake 2015-06-05 9:23 ` Georg Bauhaus 2015-06-05 20:49 ` Shark8 2015-06-05 23:52 ` Dennis Lee Bieber 2015-06-05 12:20 ` Dmitry A. Kazakov 2015-06-16 12:43 ` Stephen Leake 2015-06-16 13:24 ` Dmitry A. Kazakov 2015-06-16 14:13 ` G.B. 2015-06-17 17:38 ` Stephen Leake 2015-06-17 17:29 ` Stephen Leake 2015-06-17 17:42 ` Shark8 2015-06-17 19:03 ` Dmitry A. Kazakov 2015-06-05 20:53 ` Shark8 2015-06-16 14:46 ` Stephen Leake 2015-06-16 15:31 ` G.B. 2015-06-17 17:44 ` Stephen Leake 2015-06-16 21:34 ` Randy Brukardt 2015-06-17 17:58 ` Stephen Leake 2015-06-17 20:44 ` Randy Brukardt 2015-06-18 7:51 ` AdaMagica 2015-06-18 9:12 ` Georg Bauhaus 2015-06-17 17:50 ` AdaMagica
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox