From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII-7-bit X-Google-Thread: 103376,e5c972d04da95d51 X-Google-Attributes: gid103376,public X-Google-ArrivalTime: 2003-04-16 13:02:44 PST Path: archiver1.google.com!news1.google.com!newsfeed.stanford.edu!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!sn-xit-03!sn-xit-01!sn-post-01!supernews.com!corp.supernews.com!not-for-mail From: "Randy Brukardt" Newsgroups: comp.lang.ada Subject: Re: If anybody wants to make something in Ada but do not know what Date: Wed, 16 Apr 2003 15:01:09 -0500 Organization: Posted via Supernews, http://www.supernews.com Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Newsreader: Microsoft Outlook Express 4.72.3612.1700 X-MimeOLE: Produced By Microsoft MimeOLE V4.72.3719.2500 X-Complaints-To: abuse@supernews.com Xref: archiver1.google.com comp.lang.ada:36205 Date: 2003-04-16T15:01:09-05:00 List-Id: Preben Randhol wrote in message ... >Then perhaps a Bayesian Spam filter could be a nice challenge. Or if >somebody are heading a university student project/diploma work it could >a suitable project? What kind of spam filter are you talking about? A filter for a server is different in a number of ways than a filter for a mail client. And a filter for an ISP or large company is different than a filter for a tiny organization. That said, an anti-spam filter written in Ada already exists: it's called Trash Finder, and it works with the IMS mail server on Windows. I haven't publized it here precisely because no one here can use it. :-) It is of course 100% in Ada, and it filters for literally dozens of criteria -- after fully decoding and unfolding the message (a significant percentage of spam is encoded). Among other things, it filters on character sets, attachment types, violations of RFCs in the mail format (spammers have a hard time following RFCs), specific HTML features (forms, scripts, graphics, text outside of the markup, etc.), From, To, Subjects, Text (without the HTML markup, which often can be used to hide things), HTML markup, and (most recently) domains in URLs given either in HTML markup or text. It filters about 98% of the incoming spam on my system. You'll note that there isn't a Bayesian filter. That has always been on the 'wish list', but there are variety of reasons I no longer expect it to be very effective. I study a lot of spam each week in an effort to find more ways to automatically discard spam without discarding good mail, so I think I'm reasonably qualified to talk on this subject. First of all, Bayesian filters are most effective the closer to the client that they are. On the server, they have to filter everyones mail, and that necessarily means that they have to let more stuff through. For example, 'casino' would be very unlikely to appear in my mail, but an ISP could hardly block mail containing that word. So it doesn't make a lot of sense to put such a filter on the server. Secondly, most of the effectiveness of Bayesian filters have come from the fact that they include the HTML markup in the text stored. Spammers have figured that out, and are now sending a lot more plain text messages. I've seen a number of spammers that as a matter of course send an HTML version and a text version of the same message a couple of days apart. Thirdly, spammers have started sending random strings of junk (usually placed so it won't display) as part of messages. Depending on the filter, that can make a lot of messages look "OK" to a Bayesian filter, because they often treat unknown words as unlikely to be spam. Even if they don't do that, they tend to clog up the database with lots of junk 'words'. Fourthly, I've been getting quite a few very short messages advertising porn and other stuff. These are just too short (usually only 5-8 words and a unique URL) to be caught by any filter. Lastly, a Bayesian filter can never be accurate enough to entrust with discarding of messages, at least for me. I'll only trust a pinpoint filter for that, such as discarding names that include a particular URL. Even so, I'm discarding 70% of the incoming spam here. My preference is to filter on the URLs (and in some cases, phone numbers and snail mail addresses) that the spammers use for contacts. They can hardly not provide a contact, and it often is something that would never appear in a real mail. (Do you want mail that links to "beefupyourp---s.com"? - hyphens inserted as this is a family newsgroup. :-) It's possible to automate handling of links as well. Anyway, I'm no longer planning to write a Bayesian filter. I'm still thinking about an unknown word filter, but I expect that to be high-maintenance (and thus not for everyone). Randy.