From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 X-Received: by 10.70.49.111 with SMTP id t15mr237778pdn.10.1434458631254; Tue, 16 Jun 2015 05:43:51 -0700 (PDT) Path: buffer1.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!h15no3971930igd.0!news-out.google.com!kd3ni7362igb.0!nntp.google.com!peer02.iad.highwinds-media.com!news.highwinds-media.com!feed-me.highwinds-media.com!post02.iad.highwinds-media.com!news.flashnewsgroups.com-b7.4zTQh5tI3A!not-for-mail From: Stephen Leake Newsgroups: comp.lang.ada Subject: Re: OpenToken: Parsing Ada (subset)? References: <878uc3r2y6.fsf@adaheads.sparre-andersen.dk> <85twupvjxo.fsf@stephe-leake.org> <81ceb070-16fe-4578-a09a-eb11a2bbb664@googlegroups.com> <162zj7c2l0ykp$.1rxias18vby83.dlg@40tude.net> <856172bk80.fsf@stephe-leake.org> <1ljiyuuchbxvp.wrtbilkw3rdb.dlg@40tude.net> Date: Tue, 16 Jun 2015 07:43:49 -0500 Message-ID: <85pp4vakmy.fsf@stephe-leake.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4 (windows-nt) Cancel-Lock: sha1:CFhFOdxzEBcnpBknFpb2ZvfVgbg= MIME-Version: 1.0 X-Complaints-To: abuse@flashnewsgroups.com Organization: FlashNewsgroups.com X-Trace: 23b3b55801a07e97f808422876 X-Received-Bytes: 5185 X-Received-Body-CRC: 1809039964 Content-Type: text/plain Xref: number.nntp.giganews.com comp.lang.ada:193648 Date: 2015-06-16T07:43:49-05:00 List-Id: "Dmitry A. Kazakov" writes: > On Fri, 05 Jun 2015 04:03:27 -0500, Stephen Leake wrote: > >> Aflex compiles all the regular expressions for all of the tokens into >> one state machine, that visits each character in the input stream once. >> You can't get faster than that. > > It is visiting source character vs. visiting internal states of the machine > and transition computations. You cannot say what is more expensive in > advance. True; measuring is always required. > A transition computation may be much more expensive than accessing > a character stored in an array. However hand-written scanners do not roll > back either, usually. Ok. >>> In the end it is always worth of efforts writing a manual token scanner by >>> hand. >> >> "always" is way too strong a statement here. >> >> If you trust that the regexp engine is well written and maintained, >> the expressive power is adequate for your language, and the speed is >> adequate for your application, then why waste resources reimplementing >> the tools? Use them and get on with the interesting work. >> >> regexp are perfectly adequate for Ada. > > Examples I had in mind were Ada identifier and Ada string literal, e.g. in > UTF-8 encoding. I don't think regular expression for these would be shorter > than a hand-written in Ada scanner. Yes, but also not longer! regular expression for identifer: "foo" code for identifier: lexeme = "foo" >>> Firstly, there are not so many things you would have to recognize that >>> way. >> >> I guess you are saying that implementing a lexer for a restricted set of >> tokens is easier than implementing a general regular expression engine. >> True, but that's not the choice at hand; the choice is between >> implementing a new lexer for a restricted set of tokens, or reusing an >> existing regular expression engine (supported and maintained >> externally) and specifying a small set of regular expressions (most of >> which are simple strings for the reserved words). > > Yes, it is writing regular expression vs. writing Ada program. I prefer Ada > program. I'm confused; you start by agreeing with my statement, but then say something else. I said "implement a general regular expression engine", meaning the Ada code that interprets regular expressions at run time. I hope you agree that writing that is harder than writing an Ada scanner. To address your point; let's compare writing a regular expression with writing Ada code to do the same thing. Here's the regular expression I use for Ada numeric literals: "\([0-9]+#\)?[-+0-9a-fA-F.]+\(#\)?" Given that you are at least a little familiar with regular expressions, there's nothing hard about that. It does not enforce all the lexical rules for numbers; it allows repeated, leading, and trailing underscores; it doesn't enforce pairs of '#'. But there is no legal Ada text that will be misinterpreted by this expression. It is always a trade off between enforcing that sort of rule in the lexer or the parser. The OpenToken Ada code that does the same job (while enforcing some of the rules) is in opentoken-recognizer-based_integer_real_ada.ads, .adb, opentoken-recognizer-extended_digits.ads, .adb. That's a total of 469 lines of code. That's a state-based scanner; a recursive descent one would be simpler. It took several hours to write the Ada code, and a few more for the tests. But later I found a bug that required a new test and code change. It took a few minutes to write the regular expression, and I reused the tests; no bugs found yet. Obviously, if you are not familiar with regular expressions, they will be harder to write. But a software engineer should be willing to learn the appropriate language for the job at hand. To be fair, if you are used to writing Ada code that does scanning, and have a good library for it, that job will be easier as well. So it may just come down to experience and available tools. I would guess that the average good programmer, starting with no knowledge of either, can learn enough about regular expressions to write the above faster than they can learn enough about writing scanners in Ada to do that job well. -- -- Stephe