From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-0.3 required=5.0 tests=BAYES_00,FREEMAIL_FROM,
	REPLYTO_WITHOUT_TO_CC autolearn=no autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII-7-bit
X-Google-Thread: 103376,e5c972d04da95d51
X-Google-Attributes: gid103376,public
X-Google-ArrivalTime: 2003-04-16 16:21:12 PST
Path: 
 archiver1.google.com!news1.google.com!newsfeed.stanford.edu!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!small1.nntp.aus1.giganews.com!nntp.giganews.com!nntp3.aus1.giganews.com!nntp.gbronline.com!news.gbronline.com.POSTED!not-for-mail
NNTP-Posting-Date: Wed, 16 Apr 2003 18:21:10 -0500
Date: Wed, 16 Apr 2003 18:21:17 -0500
From: Wesley Groleau <wesgroleau@despammed.com>
Reply-To: wesgroleau@despammed.com
Organization: Ain't no organization here!
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-US;
 rv:1.2.1) Gecko/20021130
X-Accept-Language: en-us, en, es-mx, pt-br, fr-ca
MIME-Version: 1.0
Newsgroups: comp.lang.ada
Subject: Re: If anybody wants to make something in Ada but do not know what
References: <slrnb9qkhg.2t6.randhol+news@kiuk0152.chembio.ntnu.no>
 <v9rdn4e8tej0d2@corp.supernews.com>
In-Reply-To: <v9rdn4e8tej0d2@corp.supernews.com>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Message-ID: <VZ2dndUZ8_77eACjXTWcpA@gbronline.com>
NNTP-Posting-Host: 216.117.18.101
X-Trace: 
 sv3-7HsCRW/IrYiQOcsZ2+QMZv3Qdt5xGkz1dzHxACz1z7zDqz5ThAksedgj/AatflYM09cUjzKdSpVaJW3!Gz/M73jAkUtf+ZJfZaSP97v51gUvo1Lg2SswAeYdr7wtlospAEg7OHoOFp/voyI7E7OOaHUgHlEx!YhXgoA==
X-Complaints-To: abuse@gbronline.com
X-DMCA-Complaints-To: abuse@gbronline.com
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint
 properly
X-Postfilter: 1.1
Xref: archiver1.google.com comp.lang.ada:36218
Date: 2003-04-16T18:21:17-05:00
List-Id: <comp.lang.ada>


> First of all, Bayesian filters are most effective the closer to the
> client that they are. On the server, they have to filter everyones mail,
> and that necessarily means that they have to let more stuff through. For

Not necessarily.  The one I proposed does the filtering on the server
based on feedback from the addressee.  In other word, each user would
have his/her dedicated statistical DB.

> Secondly, most of the effectiveness of Bayesian filters have come from
> the fact that they include the HTML markup in the text stored. Spammers
> have figured that out, and are now sending a lot more plain text

> Thirdly, spammers have started sending random strings of junk (usually
> placed so it won't display) as part of messages. Depending on the
> filter, that can make a lot of messages look "OK" to a Bayesian filter,
> because they often treat unknown words as unlikely to be spam. Even if
> they don't do that, they tend to clog up the database with lots of junk
> 'words'.

The way some implementations work, these tricks won't work.
For example, Paul Graham's algorithm only looks at the strongest
indicators at both ends.  If a spammer puts in a lot of random
words, they won't be consistent and will not have much weight.
If the spammer puts in the same words all the time, and these words
are common in non-spam, they will not be srong indicators and
won't be used.  If they are not common in non-spam, they will
catch the spam.

> Lastly, a Bayesian filter can never be accurate enough to entrust with
> discarding of messages, at least for me. I'll only trust a pinpoint
> filter for that, such as discarding names that include a particular URL.
> Even so, I'm discarding 70% of the incoming spam here.

My _limited_ tests with a Bayesian filter had no false negatives or false 
positives.  And the 'net being what it is, an occasional message gets
lost somehow anyway.  Besides, no filter is required to discard anything.

I would just like the presumed spam messages stored on the server
until I say trash them (or until I ignore them for some length
of time).  Ideally, have the filter put the subject lines in an
e-mail to me, containing a CGI form with two choices:
  - trash all of them
  - send a individual choice message
The individual choice message would let me select specific messages
to be kept.

> My preference is to filter on the URLs (and in some cases, phone numbers
> and snail mail addresses) that the spammers use for contacts.

But a Bayesian filter can do stats on that as well.