From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.4 X-Google-Thread: 103376,a65bb7bde679ed1d X-Google-NewGroupId: yes X-Google-Attributes: gida07f3367d7,domainid0,public,usenet X-Google-Language: ENGLISH,ASCII-7-bit Received: by 10.68.31.165 with SMTP id b5mr1868162pbi.1.1322786397551; Thu, 01 Dec 2011 16:39:57 -0800 (PST) MIME-Version: 1.0 Path: lh20ni54722pbb.0!nntp.google.com!news1.google.com!goblin2!goblin1!goblin.stu.neva.ru!news.tornevall.net!news.jacob-sparre.dk!pnx.dk!jacob-sparre.dk!ada-dk.org!.POSTED!not-for-mail From: "Randy Brukardt" Newsgroups: comp.lang.ada Subject: Re: Ann: Natools.Chunked_Strings, beta 1 Date: Thu, 1 Dec 2011 18:39:53 -0600 Organization: Jacob Sparre Andersen Research & Innovation Message-ID: References: <4ed4fc37$0$2537$ba4acef3@reader.news.orange.fr> NNTP-Posting-Host: static-69-95-181-76.mad.choiceone.net X-Trace: munin.nbi.dk 1322786396 11557 69.95.181.76 (2 Dec 2011 00:39:56 GMT) X-Complaints-To: news@jacob-sparre.dk NNTP-Posting-Date: Fri, 2 Dec 2011 00:39:56 +0000 (UTC) X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 6.00.2900.5931 X-RFC2646: Format=Flowed; Original X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 Xref: news1.google.com comp.lang.ada:19298 Date: 2011-12-01T18:39:53-06:00 List-Id: "Dmitry A. Kazakov" wrote in message news:de6vkicxgv4x$.1c89iragml7xf$.dlg@40tude.net... > On Wed, 30 Nov 2011 18:11:10 -0600, Randy Brukardt wrote: ... >> (1) The Trash-Finder spam filter uses an "append-all" pattern to handling >> text and html filtering (along with a few replacements). That's mainly >> because it is best to ignore line-breaks in such matching. I could have >> invented a different data-structure for that use, but it would have just >> meant more work (especially to recreate the string pattern-matching >> operations, which are used extensively). > > See, the pattern matcher should have the "line end" atom. My pattern > matcher has it. The "pattern matcher" I'm talking about is the one built-into the Ada.Strings packages. I'm not going to add anything to that. Let me reiterate that the desigm of this spam filter was specifically intended to use as much as possible from the pre-defined string packages as possible. I wanted to learn about using them more (I had not done that in the first 8 or so years of their existence) and possibly build an example of good use of these strings. Your attitude seems to be that you should roll-your-own (my first tendency as well), and that clearly is not the best idea for many software projects (as you spend a lot of time rolling that would be better spent on the application). With this in mind, I chose a representation for mail messages as a linked list of unbounded strings, one per line. (Line endings are significant for many of the header elements in MIME and SMTP, so I need to preserve those initially.) [In hindsight, I would have dumped the unbounded strings and made the linked list use discriminated records instead -- the memory management work would be the same and direct access to slices would be convinient for some operations. But I was committed to using unbounded strings in order to gain experience. Maybe a different alternative would be to declare the linked lists using the containers package, which did not yet exist at the time I created this application, then the unbounded strings would make more sense.] Pattern matching of the body (only) of the message occurs on "normalized" text that has encodings, extra spaces, and line endings removed. And if the body is HTML, the body is split into text and markup portions (with each having a separate set of filters applied). This I did in large single unbounded strings, because it was a better match for the (simple) matching facilities offered in the Ada.Strings packages. Making a copy of the text is necessary in any case (since we don't want to modify the original for lots of reasons), and putting the result into the most convinient format. Declaring some other array of characters type to hold this text would be silly -- it surely is a string, and it would require writing pattern matching code from scratch (or using lots of type conversions). What would be the point? I understand your comment about "weak typing", but it doesn't make sense here -- the entire task is weakly typed. There is hardly anything you can assume about incoming messages, because spammers look to exploit such assumptions to evade your filters. So you have to verify everything from a "blob" format, and once you have done so, it makes little sense to spend a lot of extra effort changing the type of everything. I put the strong typing into the data structures that store the determined structure of the message, for instance (it points at the various lines that start and end each section. Randy.