From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-0.3 required=5.0 tests=BAYES_00, REPLYTO_WITHOUT_TO_CC autolearn=no autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!mx02.eternal-september.org!.POSTED!not-for-mail From: Georg Bauhaus Newsgroups: comp.lang.ada Subject: Re: OpenToken: Parsing Ada (subset)? Date: Thu, 18 Jun 2015 11:12:30 +0200 Organization: A noiseless patient Spider Message-ID: References: <878uc3r2y6.fsf@adaheads.sparre-andersen.dk> <85twupvjxo.fsf@stephe-leake.org> <81ceb070-16fe-4578-a09a-eb11a2bbb664@googlegroups.com> <162zj7c2l0ykp$.1rxias18vby83.dlg@40tude.net> <856172bk80.fsf@stephe-leake.org> <26ccc147-7a15-48d7-8808-3248edfbf433@googlegroups.com> <85k2v3aeyv.fsf@stephe-leake.org> <85h9q68bf8.fsf@stephe-leake.org> Reply-To: nonlegitur@futureapps.de Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Thu, 18 Jun 2015 09:11:10 +0000 (UTC) Injection-Info: mx02.eternal-september.org; posting-host="b96887e80893c84a90c3007226ca0d1c"; logging-data="19323"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/cL0M6V4PoTdtafmjyBhQIsEbE4EWgi7I=" User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 In-Reply-To: <85h9q68bf8.fsf@stephe-leake.org> Cancel-Lock: sha1:G55/kFgOnNGIihsRh3cGBJ/f/50= Xref: news.eternal-september.org comp.lang.ada:26368 Date: 2015-06-18T11:12:30+02:00 List-Id: On 17.06.15 19:58, Stephen Leake wrote: > Ok. I guess I'd like to see some actual examples of hand-written lexers. My silly Ada source highlighter effectively used look-back, too, as a state in the tokenizer, IIRC. Bigger language tokens would be built from smaller token pieces; there are definitions of Delimiter_1 and Delimiter_2, a constraint expressing a relation between Operator_1 and Delimiter_1 (of inclusion); a number of Character_Set-s and Is_* functions for classifying; but still not all tokens are classified correctly (bugs or leniency, depending on POV of this program). Incidentally, the ubiquitous highlite program has similar shortcomings when parsing Ada or Perl (they share the '''), last time I looked. Some more input for testing, omitting spaces and combinations: Character'('''); Character('''); Character'('('); Character('('); Character'(-'''); Character'('-'-'-'); Character'(-'''-'-'); Character'(''')'Alignment; Name(''')'Has_Same_Storage; Character'Pos('''); Character'Base'('''); Character'Base'Pos(''');