comp.lang.ada
 help / color / mirror / Atom feed
From: "Randy Brukardt" <randy@rrsoftware.com>
Subject: Re: If anybody wants to make something in Ada but do not know what
Date: Wed, 16 Apr 2003 15:01:09 -0500
Date: 2003-04-16T15:01:09-05:00	[thread overview]
Message-ID: <v9rdn4e8tej0d2@corp.supernews.com> (raw)
In-Reply-To: slrnb9qkhg.2t6.randhol+news@kiuk0152.chembio.ntnu.no

Preben Randhol wrote in message ...
>Then perhaps a Bayesian Spam filter could be a nice challenge. Or if
>somebody are heading a university student project/diploma work it could
>a suitable project?


What kind of spam filter are you talking about? A filter for a server is
different in a number of ways than a filter for a mail client. And a
filter for an ISP or large company is different than a filter for a tiny
organization.

That said, an anti-spam filter written in Ada already exists: it's
called Trash Finder, and it works with the IMS mail server on Windows. I
haven't publized it here precisely because no one here can use it. :-)
It is of course 100% in Ada, and it filters for literally dozens of
criteria -- after fully decoding and unfolding the message (a
significant percentage of spam is encoded). Among other things, it
filters on character sets, attachment types, violations of RFCs in the
mail format (spammers have a hard time following RFCs), specific HTML
features (forms, scripts, graphics, text outside of the markup, etc.),
From, To, Subjects, Text (without the HTML markup, which often can be
used to hide things), HTML markup, and (most recently) domains in URLs
given either in HTML markup or text.

It filters about 98% of the incoming spam on my system.

You'll note that there isn't a Bayesian filter. That has always been on
the 'wish list', but there are variety of reasons I no longer expect it
to be very effective. I study a lot of spam each week in an effort to
find more ways to automatically discard spam without discarding good
mail, so I think I'm reasonably qualified to talk on this subject.

First of all, Bayesian filters are most effective the closer to the
client that they are. On the server, they have to filter everyones mail,
and that necessarily means that they have to let more stuff through. For
example, 'casino' would be very unlikely to appear in my mail, but an
ISP could hardly block mail containing that word. So it doesn't make a
lot of sense to put such a filter on the server.

Secondly, most of the effectiveness of Bayesian filters have come from
the fact that they include the HTML markup in the text stored. Spammers
have figured that out, and are now sending a lot more plain text
messages. I've seen a number of spammers that as a matter of course send
an HTML version and a text version of the same message a couple of days
apart.

Thirdly, spammers have started sending random strings of junk (usually
placed so it won't display) as part of messages. Depending on the
filter, that can make a lot of messages look "OK" to a Bayesian filter,
because they often treat unknown words as unlikely to be spam. Even if
they don't do that, they tend to clog up the database with lots of junk
'words'.

Fourthly, I've been getting quite a few very short messages advertising
porn and other stuff. These are just too short (usually only 5-8 words
and a unique URL) to be caught by any filter.

Lastly, a Bayesian filter can never be accurate enough to entrust with
discarding of messages, at least for me. I'll only trust a pinpoint
filter for that, such as discarding names that include a particular URL.
Even so, I'm discarding 70% of the incoming spam here.

My preference is to filter on the URLs (and in some cases, phone numbers
and snail mail addresses) that the spammers use for contacts. They can
hardly not provide a contact, and it often is something that would never
appear in a real mail. (Do you want mail that links to
"beefupyourp---s.com"? - hyphens inserted as this is a family newsgroup.
:-) It's possible to automate handling of links as well.

Anyway, I'm no longer planning to write a Bayesian filter. I'm still
thinking about an unknown word filter, but I expect that to be
high-maintenance (and thus not for everyone).

             Randy.





  parent reply	other threads:[~2003-04-16 20:01 UTC|newest]

Thread overview: 65+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-04-16 12:53 If anybody wants to make something in Ada but do not know what Preben Randhol
2003-04-16 13:59 ` Warren W. Gay VE3WWG
2003-04-16 16:10   ` rd
2003-04-16 16:34     ` SPAM-less email (was If anybody wants to make something in Ada but do not know what) Warren W. Gay VE3WWG
2003-04-16 17:00       ` SPAM-less email (was If anybody wants to make something in Ada but Larry Kilgallen
2003-04-16 17:43         ` Warren W. Gay VE3WWG
2003-04-16 18:03           ` Samuel Tardieu
2003-04-16 18:48             ` SPAM-less email (was If anybody wants to make something in Ada tmoran
2003-04-16 20:58               ` Georg Bauhaus
2003-04-17 16:51             ` SPAM-less email (was If anybody wants to make something in Ada but Warren W. Gay VE3WWG
2003-04-17 21:54               ` Robert A Duff
2003-04-17 22:39                 ` AG
2003-04-18  8:27                 ` Preben Randhol
2003-04-17 23:38               ` SPAM-less email (was If anybody wants to make something in Adabut Randy Brukardt
2003-04-18  0:06                 ` AG
2003-04-18  0:32                   ` Larry Kilgallen
2003-04-18  0:48                     ` AG
2003-04-18  2:10                       ` Larry Kilgallen
2003-04-18  3:13                         ` AG
2003-04-18  4:50                           ` tmoran
2003-04-18 11:26                             ` Larry Kilgallen
2003-04-18 11:23                         ` Larry Kilgallen
     [not found]                         ` <g3Kna.5120$mZ4.89596@news.xtra.co.nzOrganization: LJK Software <JKMUgN4L70TN@eisner.encompasserve.org>
2003-04-19  6:36                           ` Tarjei T. Jensen
2003-04-21 18:50                     ` Randy Brukardt
2003-04-18  7:32                 ` Jacob Sparre Andersen
2003-04-18 11:32                   ` Larry Kilgallen
2003-04-19  4:45                     ` [way off-topic] A new spammer is born? Wesley Groleau
2003-04-19 20:10                   ` SPAM-less email (was If anybody wants to make something in Adabut Georg Bauhaus
2003-04-19 21:15                     ` AG
2003-04-20 15:31                       ` Georg Bauhaus
2003-04-21  3:33                         ` Wesley Groleau
2003-04-16 19:19           ` SPAM-less email (was If anybody wants to make something in Ada Larry Kilgallen
2003-04-16 21:38       ` SPAM-less email (was If anybody wants to make something in Ada butdo not know what) rd
2003-04-16 22:03         ` Samuel Tardieu
2003-04-17  0:16           ` rd
2003-04-17 16:59           ` Warren W. Gay VE3WWG
2003-04-17 16:58         ` Warren W. Gay VE3WWG
2003-04-17 22:02         ` Robert A Duff
2003-04-16 19:16     ` If anybody wants to make something in Ada but do not know what Pascal Obry
2003-04-16 19:42       ` Samuel Tardieu
2003-04-24 13:55   ` Frode Tenneboe
2003-04-28 16:00     ` Warren W. Gay VE3WWG
2003-04-28 17:28       ` Preben Randhol
2003-04-28 19:53         ` Wesley Groleau
2003-04-29  6:14           ` Preben Randhol
2003-04-29 17:40       ` Georg Bauhaus
2003-04-16 17:52 ` Jano
2003-04-16 18:43 ` Wesley Groleau
2003-04-16 20:03   ` Randy Brukardt
2003-04-16 20:01 ` Randy Brukardt [this message]
2003-04-16 23:21   ` Wesley Groleau
2003-04-17  8:05     ` AG
2003-04-17 16:52       ` Wesley Groleau
2003-04-17 22:02         ` AG
2003-04-17 22:58     ` Randy Brukardt
2003-04-19  6:28       ` Tarjei T. Jensen
2003-04-23 19:32         ` Robert C. Leif
2003-04-24  1:35           ` Wesley Groleau
2003-04-16 23:26   ` Wesley Groleau
2003-04-17 22:28     ` Randy Brukardt
2003-04-30 12:44 ` Frank
2003-04-30 19:59   ` Free SVG tools Nick Roberts
2003-05-02  4:54     ` Steve Bowen
2003-05-02 20:12       ` Martin Holmes
2003-05-03 18:54         ` Steve Bowen
replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox