From: "Randy Brukardt" <randy@rrsoftware.com>
Subject: Re: Ann: Natools.Chunked_Strings, beta 1
Date: Thu, 1 Dec 2011 18:39:53 -0600
Date: 2011-12-01T18:39:53-06:00 [thread overview]
Message-ID: <jb96or$b95$1@munin.nbi.dk> (raw)
In-Reply-To: de6vkicxgv4x$.1c89iragml7xf$.dlg@40tude.net
"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message
news:de6vkicxgv4x$.1c89iragml7xf$.dlg@40tude.net...
> On Wed, 30 Nov 2011 18:11:10 -0600, Randy Brukardt wrote:
...
>> (1) The Trash-Finder spam filter uses an "append-all" pattern to handling
>> text and html filtering (along with a few replacements). That's mainly
>> because it is best to ignore line-breaks in such matching. I could have
>> invented a different data-structure for that use, but it would have just
>> meant more work (especially to recreate the string pattern-matching
>> operations, which are used extensively).
>
> See, the pattern matcher should have the "line end" atom. My pattern
> matcher has it.
The "pattern matcher" I'm talking about is the one built-into the
Ada.Strings packages. I'm not going to add anything to that.
Let me reiterate that the desigm of this spam filter was specifically
intended to use as much as possible from the pre-defined string packages as
possible. I wanted to learn about using them more (I had not done that in
the first 8 or so years of their existence) and possibly build an example of
good use of these strings. Your attitude seems to be that you should
roll-your-own (my first tendency as well), and that clearly is not the best
idea for many software projects (as you spend a lot of time rolling that
would be better spent on the application).
With this in mind, I chose a representation for mail messages as a linked
list of unbounded strings, one per line. (Line endings are significant for
many of the header elements in MIME and SMTP, so I need to preserve those
initially.) [In hindsight, I would have dumped the unbounded strings and
made the linked list use discriminated records instead -- the memory
management work would be the same and direct access to slices would be
convinient for some operations. But I was committed to using unbounded
strings in order to gain experience. Maybe a different alternative would be
to declare the linked lists using the containers package, which did not yet
exist at the time I created this application, then the unbounded strings
would make more sense.]
Pattern matching of the body (only) of the message occurs on "normalized"
text that has encodings, extra spaces, and line endings removed. And if the
body is HTML, the body is split into text and markup portions (with each
having a separate set of filters applied). This I did in large single
unbounded strings, because it was a better match for the (simple) matching
facilities offered in the Ada.Strings packages. Making a copy of the text is
necessary in any case (since we don't want to modify the original for lots
of reasons), and putting the result into the most convinient format.
Declaring some other array of characters type to hold this text would be
silly -- it surely is a string, and it would require writing pattern
matching code from scratch (or using lots of type conversions). What would
be the point?
I understand your comment about "weak typing", but it doesn't make sense
here -- the entire task is weakly typed. There is hardly anything you can
assume about incoming messages, because spammers look to exploit such
assumptions to evade your filters. So you have to verify everything from a
"blob" format, and once you have done so, it makes little sense to spend a
lot of extra effort changing the type of everything. I put the strong typing
into the data structures that store the determined structure of the message,
for instance (it points at the various lines that start and end each
section.
Randy.
next prev parent reply other threads:[~2011-12-02 0:39 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-11-29 15:16 Ann: Natools.Chunked_Strings, beta 1 Natasha Kerensikova
2011-11-29 15:37 ` Pascal Obry
2011-11-29 16:34 ` Natasha Kerensikova
2011-11-29 17:08 ` Georg Bauhaus
2011-11-30 9:51 ` Natasha Kerensikova
2011-11-29 20:25 ` Randy Brukardt
2011-11-30 10:44 ` Yannick Duchêne (Hibou57)
2011-11-30 10:39 ` Yannick Duchêne (Hibou57)
2011-11-30 10:57 ` Dmitry A. Kazakov
2011-12-01 0:11 ` Randy Brukardt
2011-12-01 8:30 ` Dmitry A. Kazakov
2011-12-01 23:26 ` Vinzent Hoefler
2011-12-02 8:27 ` Dmitry A. Kazakov
2011-12-02 9:30 ` Georg Bauhaus
2011-12-02 13:11 ` Dmitry A. Kazakov
2011-12-02 0:39 ` Randy Brukardt [this message]
2011-12-01 9:02 ` Yannick Duchêne (Hibou57)
2011-11-30 13:08 ` Natasha Kerensikova
2011-11-30 19:39 ` Jeffrey Carter
2011-12-01 10:57 ` Natasha Kerensikova
2011-12-01 19:07 ` Jeffrey Carter
2011-12-01 21:19 ` Yannick Duchêne (Hibou57)
2011-12-01 22:49 ` Natasha Kerensikova
2011-12-02 16:16 ` Tero Koskinen
2011-12-02 17:36 ` Adam Beneschan
2011-12-02 18:52 ` Tero Koskinen
2011-12-02 18:14 ` Yannick Duchêne (Hibou57)
2011-12-02 19:07 ` Adam Beneschan
2011-11-30 10:33 ` Yannick Duchêne (Hibou57)
2011-11-30 11:04 ` Natasha Kerensikova
replies disabled
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox