From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII-7-bit
X-Google-Thread: 103376,e5c972d04da95d51
X-Google-Attributes: gid103376,public
X-Google-ArrivalTime: 2003-04-16 13:02:44 PST
Path: 
 archiver1.google.com!news1.google.com!newsfeed.stanford.edu!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!sn-xit-03!sn-xit-01!sn-post-01!supernews.com!corp.supernews.com!not-for-mail
From: "Randy Brukardt" <randy@rrsoftware.com>
Newsgroups: comp.lang.ada
Subject: Re: If anybody wants to make something in Ada but do not know what
Date: Wed, 16 Apr 2003 15:01:09 -0500
Organization: Posted via Supernews, http://www.supernews.com
Message-ID: <v9rdn4e8tej0d2@corp.supernews.com>
References: <slrnb9qkhg.2t6.randhol+news@kiuk0152.chembio.ntnu.no>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Newsreader: Microsoft Outlook Express 4.72.3612.1700
X-MimeOLE: Produced By Microsoft MimeOLE V4.72.3719.2500
X-Complaints-To: abuse@supernews.com
Xref: archiver1.google.com comp.lang.ada:36205
Date: 2003-04-16T15:01:09-05:00
List-Id: <comp.lang.ada>

Preben Randhol wrote in message ...
>Then perhaps a Bayesian Spam filter could be a nice challenge. Or if
>somebody are heading a university student project/diploma work it could
>a suitable project?


What kind of spam filter are you talking about? A filter for a server is
different in a number of ways than a filter for a mail client. And a
filter for an ISP or large company is different than a filter for a tiny
organization.

That said, an anti-spam filter written in Ada already exists: it's
called Trash Finder, and it works with the IMS mail server on Windows. I
haven't publized it here precisely because no one here can use it. :-)
It is of course 100% in Ada, and it filters for literally dozens of
criteria -- after fully decoding and unfolding the message (a
significant percentage of spam is encoded). Among other things, it
filters on character sets, attachment types, violations of RFCs in the
mail format (spammers have a hard time following RFCs), specific HTML
features (forms, scripts, graphics, text outside of the markup, etc.),
From, To, Subjects, Text (without the HTML markup, which often can be
used to hide things), HTML markup, and (most recently) domains in URLs
given either in HTML markup or text.

It filters about 98% of the incoming spam on my system.

You'll note that there isn't a Bayesian filter. That has always been on
the 'wish list', but there are variety of reasons I no longer expect it
to be very effective. I study a lot of spam each week in an effort to
find more ways to automatically discard spam without discarding good
mail, so I think I'm reasonably qualified to talk on this subject.

First of all, Bayesian filters are most effective the closer to the
client that they are. On the server, they have to filter everyones mail,
and that necessarily means that they have to let more stuff through. For
example, 'casino' would be very unlikely to appear in my mail, but an
ISP could hardly block mail containing that word. So it doesn't make a
lot of sense to put such a filter on the server.

Secondly, most of the effectiveness of Bayesian filters have come from
the fact that they include the HTML markup in the text stored. Spammers
have figured that out, and are now sending a lot more plain text
messages. I've seen a number of spammers that as a matter of course send
an HTML version and a text version of the same message a couple of days
apart.

Thirdly, spammers have started sending random strings of junk (usually
placed so it won't display) as part of messages. Depending on the
filter, that can make a lot of messages look "OK" to a Bayesian filter,
because they often treat unknown words as unlikely to be spam. Even if
they don't do that, they tend to clog up the database with lots of junk
'words'.

Fourthly, I've been getting quite a few very short messages advertising
porn and other stuff. These are just too short (usually only 5-8 words
and a unique URL) to be caught by any filter.

Lastly, a Bayesian filter can never be accurate enough to entrust with
discarding of messages, at least for me. I'll only trust a pinpoint
filter for that, such as discarding names that include a particular URL.
Even so, I'm discarding 70% of the incoming spam here.

My preference is to filter on the URLs (and in some cases, phone numbers
and snail mail addresses) that the spammers use for contacts. They can
hardly not provide a contact, and it often is something that would never
appear in a real mail. (Do you want mail that links to
"beefupyourp---s.com"? - hyphens inserted as this is a family newsgroup.
:-) It's possible to automate handling of links as well.

Anyway, I'm no longer planning to write a Bayesian filter. I'm still
thinking about an unknown word filter, but I expect that to be
high-maintenance (and thus not for everyone).

             Randy.