From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-0.3 required=5.0 tests=BAYES_00,FREEMAIL_FROM, REPLYTO_WITHOUT_TO_CC autolearn=no autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII-7-bit X-Google-Thread: 103376,e5c972d04da95d51 X-Google-Attributes: gid103376,public X-Google-ArrivalTime: 2003-04-16 16:21:12 PST Path: archiver1.google.com!news1.google.com!newsfeed.stanford.edu!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!small1.nntp.aus1.giganews.com!nntp.giganews.com!nntp3.aus1.giganews.com!nntp.gbronline.com!news.gbronline.com.POSTED!not-for-mail NNTP-Posting-Date: Wed, 16 Apr 2003 18:21:10 -0500 Date: Wed, 16 Apr 2003 18:21:17 -0500 From: Wesley Groleau Reply-To: wesgroleau@despammed.com Organization: Ain't no organization here! User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-US; rv:1.2.1) Gecko/20021130 X-Accept-Language: en-us, en, es-mx, pt-br, fr-ca MIME-Version: 1.0 Newsgroups: comp.lang.ada Subject: Re: If anybody wants to make something in Ada but do not know what References: In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Message-ID: NNTP-Posting-Host: 216.117.18.101 X-Trace: sv3-7HsCRW/IrYiQOcsZ2+QMZv3Qdt5xGkz1dzHxACz1z7zDqz5ThAksedgj/AatflYM09cUjzKdSpVaJW3!Gz/M73jAkUtf+ZJfZaSP97v51gUvo1Lg2SswAeYdr7wtlospAEg7OHoOFp/voyI7E7OOaHUgHlEx!YhXgoA== X-Complaints-To: abuse@gbronline.com X-DMCA-Complaints-To: abuse@gbronline.com X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly X-Postfilter: 1.1 Xref: archiver1.google.com comp.lang.ada:36218 Date: 2003-04-16T18:21:17-05:00 List-Id: > First of all, Bayesian filters are most effective the closer to the > client that they are. On the server, they have to filter everyones mail, > and that necessarily means that they have to let more stuff through. For Not necessarily. The one I proposed does the filtering on the server based on feedback from the addressee. In other word, each user would have his/her dedicated statistical DB. > Secondly, most of the effectiveness of Bayesian filters have come from > the fact that they include the HTML markup in the text stored. Spammers > have figured that out, and are now sending a lot more plain text > Thirdly, spammers have started sending random strings of junk (usually > placed so it won't display) as part of messages. Depending on the > filter, that can make a lot of messages look "OK" to a Bayesian filter, > because they often treat unknown words as unlikely to be spam. Even if > they don't do that, they tend to clog up the database with lots of junk > 'words'. The way some implementations work, these tricks won't work. For example, Paul Graham's algorithm only looks at the strongest indicators at both ends. If a spammer puts in a lot of random words, they won't be consistent and will not have much weight. If the spammer puts in the same words all the time, and these words are common in non-spam, they will not be srong indicators and won't be used. If they are not common in non-spam, they will catch the spam. > Lastly, a Bayesian filter can never be accurate enough to entrust with > discarding of messages, at least for me. I'll only trust a pinpoint > filter for that, such as discarding names that include a particular URL. > Even so, I'm discarding 70% of the incoming spam here. My _limited_ tests with a Bayesian filter had no false negatives or false positives. And the 'net being what it is, an occasional message gets lost somehow anyway. Besides, no filter is required to discard anything. I would just like the presumed spam messages stored on the server until I say trash them (or until I ignore them for some length of time). Ideally, have the filter put the subject lines in an e-mail to me, containing a CGI form with two choices: - trash all of them - send a individual choice message The individual choice message would let me select specific messages to be kept. > My preference is to filter on the URLs (and in some cases, phone numbers > and snail mail addresses) that the spammers use for contacts. But a Bayesian filter can do stats on that as well.