From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!feeder.eternal-september.org!nntp-feed.chiark.greenend.org.uk!ewrotcd!newsfeed.xs3.de!io.xs3.de!news.jacob-sparre.dk!franka.jacob-sparre.dk!pnx.dk!.POSTED.rrsoftware.com!not-for-mail From: "Randy Brukardt" Newsgroups: comp.lang.ada Subject: Re: Encaspulation: What to export Date: Thu, 30 Nov 2017 15:42:23 -0600 Organization: JSA Research & Innovation Message-ID: References: <8666203a-4e42-438d-8fe0-1a63f643955f@googlegroups.com> <1aab7965-08cf-472f-9322-bfabb6f2c728@googlegroups.com> Injection-Date: Thu, 30 Nov 2017 21:42:23 -0000 (UTC) Injection-Info: franka.jacob-sparre.dk; posting-host="rrsoftware.com:24.196.82.226"; logging-data="9167"; mail-complaints-to="news@jacob-sparre.dk" X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 6.00.2900.5931 X-RFC2646: Format=Flowed; Original X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.7246 Xref: reader02.eternal-september.org comp.lang.ada:49280 Date: 2017-11-30T15:42:23-06:00 List-Id: wrote in message news:1aab7965-08cf-472f-9322-bfabb6f2c728@googlegroups.com... > Randy Brukardt: > >> Really? We don't have any parser (just a lexer) in the search engine >> crawler. As I recall, section closes are counted rather than anything >> more >> complex. > > Sure. You could say the same for gathering identifiers from Ada sources > for a search engine for Ada sources. A lexer is ok for that job. Would you > conclude that Ada doesn't need to be parsed ? > Or reversely, how would you manage to display HTML lists or tables without > a parser ? With a search engine crawler you just throw away the HTML > structures. This is okay for your crawler: you just need the text between > the tags. Not really true, since we tag the URLs with the type of reference (automatic, like images, or manual, like links) and that requires identifying the enclosing construct as well as the "attribute" containing the URL. And there are a few cases were the meaning of the attribute is different in different constructs. As far as lists or tables goes, you can build an HTML tree without any parsing, so I don't see any requirement to parse HTML. (It might be easier to build a parser using a tool than a hand-constructed tree builder if ones needs are complex enough, but that doesn't change the underlying issue.) Randy.