From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII-7-bit X-Google-Thread: 103376,e5c972d04da95d51 X-Google-Attributes: gid103376,public X-Google-ArrivalTime: 2003-04-17 15:57:34 PST Path: archiver1.google.com!news1.google.com!sn-xit-02!sn-xit-06!sn-post-01!supernews.com!corp.supernews.com!not-for-mail From: "Randy Brukardt" Newsgroups: comp.lang.ada Subject: Re: If anybody wants to make something in Ada but do not know what Date: Thu, 17 Apr 2003 17:58:23 -0500 Organization: Posted via Supernews, http://www.supernews.com Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Newsreader: Microsoft Outlook Express 4.72.3612.1700 X-MimeOLE: Produced By Microsoft MimeOLE V4.72.3719.2500 X-Complaints-To: abuse@supernews.com Xref: archiver1.google.com comp.lang.ada:36262 Date: 2003-04-17T17:58:23-05:00 List-Id: Wesley Groleau wrote in message ... >> First of all, Bayesian filters are most effective the closer to the >> client that they are. On the server, they have to filter everyones mail, >> and that necessarily means that they have to let more stuff through. For > >Not necessarily. The one I proposed does the filtering on the server >based on feedback from the addressee. In other word, each user would >have his/her dedicated statistical DB. I don't think that would work. The word list for the AdaIC search engine is 270,000 words, and it takes up 6 MB. The database of counts for a Bayesian filter would take at least two counters (presumably 32-bit, although 24-bit would probably be enough) for each word. That's at least 2MB. The word list could be shared, but certainly the DB could not. (And I'd expect the word list to be much larger than that, given the random words and the puree approach of the Graham filter). For a very small server like mine, that might not be a big deal. But some of the people running Trash Finder have servers handling 50,000 messages per day with more than 1,000 users. The overhead of scanning and scoring all of those messages would be very high, especially as the DBs would be too large to stay in the machine's cache. >> Thirdly, spammers have started sending random strings of junk (usually >> placed so it won't display) as part of messages. Depending on the >> filter, that can make a lot of messages look "OK" to a Bayesian filter, >> because they often treat unknown words as unlikely to be spam. Even if >> they don't do that, they tend to clog up the database with lots of junk >> 'words'. > >The way some implementations work, these tricks won't work. >For example, Paul Graham's algorithm only looks at the strongest >indicators at both ends. If a spammer puts in a lot of random >words, they won't be consistent and will not have much weight. >If the spammer puts in the same words all the time, and these words >are common in non-spam, they will not be srong indicators and >won't be used. If they are not common in non-spam, they will >catch the spam. That might prevent passing spam, but it does nothing to avoid the overhead. The problem is in order to find out the strongest indicator, you have to score every 'word' in the message. When a lot of trash words are in the message, you have to allocate new words and new counters for them; and when there are a lot of such messages, the size of the DB grows rapidly. (We saw this happen in the search engine when we accidentially indexed some Unix .lib files.) That adds overhead; a lot of overhead for a filter like mine which gets invoked on each message individually. (Writing out the word list each time is expensive.) >> Lastly, a Bayesian filter can never be accurate enough to entrust with >> discarding of messages, at least for me. I'll only trust a pinpoint >> filter for that, such as discarding names that include a particular URL. >> Even so, I'm discarding 70% of the incoming spam here. > >My _limited_ tests with a Bayesian filter had no false negatives or false >positives. And the 'net being what it is, an occasional message gets >lost somehow anyway. Besides, no filter is required to discard anything. > >I would just like the presumed spam messages stored on the server >until I say trash them (or until I ignore them for some length >of time). I used to do that, but I discovered that I was spending more than an hour each Monday and a half hour every other day going through them and getting rid of them. That's silly. I don't mind occassionally quarentining a good message, but deleting it is a no-no. For servers that get 50,000 messages a day, manual deletion of the 20% spam is impractical. >> My preference is to filter on the URLs (and in some cases, phone numbers >> and snail mail addresses) that the spammers use for contacts. > >But a Bayesian filter can do stats on that as well. Only with a lot more overhead and less precision. And overhead matters if your processing 50,000 messages a day (or even 500 on a machine that's also running a search engine, web server, and soon, a software update server). The Graham filter is especially bad with overhead, because it just uses everything about the message as it comes. A lot of spam is encoded in various ways, and that will increase the number of versions of the same word a lot. And if the spammer hits on a new encoding, the spam will go through. Another overhead issue with Bayesian filters is that you have to scan the message twice, once to classify it, and a second time to store the counters appropriately. (And you have to have a way to handle false positives and false negatives.) You can trade off memory vs. cpu time in various ways, but it would be preferable to do neither. Anyway, its best that multiple approaches be used to filter spam. If everybody did it the same way, the spammers would find it easier to get their trash though!! Randy.