From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Thread: 103376,a65bb7bde679ed1d
X-Google-NewGroupId: yes
X-Google-Attributes: gida07f3367d7,domainid0,public,usenet
X-Google-Language: ENGLISH,ASCII-7-bit
Received: by 10.68.31.165 with SMTP id b5mr1868162pbi.1.1322786397551;
        Thu, 01 Dec 2011 16:39:57 -0800 (PST)
MIME-Version: 1.0
Path: 
 lh20ni54722pbb.0!nntp.google.com!news1.google.com!goblin2!goblin1!goblin.stu.neva.ru!news.tornevall.net!news.jacob-sparre.dk!pnx.dk!jacob-sparre.dk!ada-dk.org!.POSTED!not-for-mail
From: "Randy Brukardt" <randy@rrsoftware.com>
Newsgroups: comp.lang.ada
Subject: Re: Ann: Natools.Chunked_Strings, beta 1
Date: Thu, 1 Dec 2011 18:39:53 -0600
Organization: Jacob Sparre Andersen Research & Innovation
Message-ID: <jb96or$b95$1@munin.nbi.dk>
References: <slrnjd9tpk.1lme.lithiumcat@sigil.instinctive.eu>
 <4ed4fc37$0$2537$ba4acef3@reader.news.orange.fr>
 <op.v5q874xcule2fv@douda-yannick>
 <ouubrb3trn06$.1jl5q3ausoy2v.dlg@40tude.net> <jb6gn0$47g$1@munin.nbi.dk>
 <de6vkicxgv4x$.1c89iragml7xf$.dlg@40tude.net>
NNTP-Posting-Host: static-69-95-181-76.mad.choiceone.net
X-Trace: munin.nbi.dk 1322786396 11557 69.95.181.76 (2 Dec 2011 00:39:56 GMT)
X-Complaints-To: news@jacob-sparre.dk
NNTP-Posting-Date: Fri, 2 Dec 2011 00:39:56 +0000 (UTC)
X-Priority: 3
X-MSMail-Priority: Normal
X-Newsreader: Microsoft Outlook Express 6.00.2900.5931
X-RFC2646: Format=Flowed; Original
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157
Xref: news1.google.com comp.lang.ada:19298
Date: 2011-12-01T18:39:53-06:00
List-Id: <comp.lang.ada>

"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
news:de6vkicxgv4x$.1c89iragml7xf$.dlg@40tude.net...
> On Wed, 30 Nov 2011 18:11:10 -0600, Randy Brukardt wrote:
...
>> (1) The Trash-Finder spam filter uses an "append-all" pattern to handling
>> text and html filtering (along with a few replacements). That's mainly
>> because it is best to ignore line-breaks in such matching. I could have
>> invented a different data-structure for that use, but it would have just
>> meant more work (especially to recreate the string pattern-matching
>> operations, which are used extensively).
>
> See, the pattern matcher should have the "line end" atom. My pattern
> matcher has it.

The "pattern matcher" I'm talking about is the one built-into the 
Ada.Strings packages. I'm not going to add anything to that.

Let me reiterate that the desigm of this spam filter was specifically 
intended to use as much as possible from the pre-defined string packages as 
possible. I wanted to learn about using them more (I had not done that in 
the first 8 or so years of their existence) and possibly build an example of 
good use of these strings. Your attitude seems to be that you should 
roll-your-own (my first tendency as well), and that clearly is not the best 
idea for many software projects (as you spend a lot of time rolling that 
would be better spent on the application).

With this in mind, I chose a representation for mail messages as a linked 
list of unbounded strings, one per line. (Line endings are significant for 
many of the header elements in MIME and SMTP, so I need to preserve those 
initially.) [In hindsight, I would have dumped the unbounded strings and 
made the linked list use discriminated records instead -- the memory 
management work would be the same and direct access to slices would be 
convinient for some operations. But I was committed to using unbounded 
strings in order to gain experience. Maybe a different alternative would be 
to declare the linked lists using the containers package, which did not yet 
exist at the time I created this application, then the unbounded strings 
would make more sense.]

Pattern matching of the body (only) of the message occurs on "normalized" 
text that has encodings, extra spaces, and line endings removed. And if the 
body is HTML, the body is split into text and markup portions (with each 
having a separate set of filters applied). This I did in large single 
unbounded strings, because it was a better match for the (simple) matching 
facilities offered in the Ada.Strings packages. Making a copy of the text is 
necessary in any case (since we don't want to modify the original for lots 
of reasons), and putting the result into the most convinient format. 
Declaring some other array of characters type to hold this text would be 
silly -- it surely is a string, and it would require writing pattern 
matching code from scratch (or using lots of type conversions). What would 
be the point?

I understand your comment about "weak typing", but it doesn't make sense 
here -- the entire task is weakly typed. There is hardly anything you can 
assume about incoming messages, because spammers look to exploit such 
assumptions to evade your filters. So you have to verify everything from a 
"blob" format, and once you have done so, it makes little sense to spend a 
lot of extra effort changing the type of everything. I put the strong typing 
into the data structures that store the determined structure of the message, 
for instance (it points at the various lines that start and end each 
section.

                                   Randy.