From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII-7-bit
X-Google-Thread: 103376,e5c972d04da95d51
X-Google-Attributes: gid103376,public
X-Google-ArrivalTime: 2003-04-17 15:57:34 PST
Path: 
 archiver1.google.com!news1.google.com!sn-xit-02!sn-xit-06!sn-post-01!supernews.com!corp.supernews.com!not-for-mail
From: "Randy Brukardt" <randy@rrsoftware.com>
Newsgroups: comp.lang.ada
Subject: Re: If anybody wants to make something in Ada but do not know what
Date: Thu, 17 Apr 2003 17:58:23 -0500
Organization: Posted via Supernews, http://www.supernews.com
Message-ID: <v9ucatrcl1ik89@corp.supernews.com>
References: <slrnb9qkhg.2t6.randhol+news@kiuk0152.chembio.ntnu.no>
 <v9rdn4e8tej0d2@corp.supernews.com> <VZ2dndUZ8_77eACjXTWcpA@gbronline.com>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Newsreader: Microsoft Outlook Express 4.72.3612.1700
X-MimeOLE: Produced By Microsoft MimeOLE V4.72.3719.2500
X-Complaints-To: abuse@supernews.com
Xref: archiver1.google.com comp.lang.ada:36262
Date: 2003-04-17T17:58:23-05:00
List-Id: <comp.lang.ada>

Wesley Groleau wrote in message ...
>> First of all, Bayesian filters are most effective the closer to the
>> client that they are. On the server, they have to filter everyones
mail,
>> and that necessarily means that they have to let more stuff through.
For
>
>Not necessarily.  The one I proposed does the filtering on the server
>based on feedback from the addressee.  In other word, each user would
>have his/her dedicated statistical DB.


I don't think that would work. The word list for the AdaIC search engine
is 270,000 words, and it takes up 6 MB. The database of counts for a
Bayesian filter would take at least two counters (presumably 32-bit,
although 24-bit would probably be enough) for each word. That's at least
2MB. The word list could be shared, but certainly the DB could not. (And
I'd expect the word list to be much larger than that, given the random
words and the puree approach of the Graham filter).

For a very small server like mine, that might not be a big deal. But
some of the people running Trash Finder have servers handling 50,000
messages per day with more than 1,000 users. The overhead of scanning
and scoring all of those messages would be very high, especially as the
DBs would be too large to stay in the machine's cache.

>> Thirdly, spammers have started sending random strings of junk
(usually
>> placed so it won't display) as part of messages. Depending on the
>> filter, that can make a lot of messages look "OK" to a Bayesian
filter,
>> because they often treat unknown words as unlikely to be spam. Even
if
>> they don't do that, they tend to clog up the database with lots of
junk
>> 'words'.
>
>The way some implementations work, these tricks won't work.
>For example, Paul Graham's algorithm only looks at the strongest
>indicators at both ends.  If a spammer puts in a lot of random
>words, they won't be consistent and will not have much weight.
>If the spammer puts in the same words all the time, and these words
>are common in non-spam, they will not be srong indicators and
>won't be used.  If they are not common in non-spam, they will
>catch the spam.


That might prevent passing spam, but it does nothing to avoid the
overhead. The problem is in order to find out the strongest indicator,
you have to score every 'word' in the message. When a lot of trash words
are in the message, you have to allocate new words and new counters for
them; and when there are a lot of such messages, the size of the DB
grows rapidly. (We saw this happen in the search engine when we
accidentially indexed some Unix .lib files.) That adds overhead; a lot
of overhead for a filter like mine which gets invoked on each message
individually. (Writing out the word list each time is expensive.)

>> Lastly, a Bayesian filter can never be accurate enough to entrust
with
>> discarding of messages, at least for me. I'll only trust a pinpoint
>> filter for that, such as discarding names that include a particular
URL.
>> Even so, I'm discarding 70% of the incoming spam here.
>
>My _limited_ tests with a Bayesian filter had no false negatives or
false
>positives.  And the 'net being what it is, an occasional message gets
>lost somehow anyway.  Besides, no filter is required to discard
anything.
>
>I would just like the presumed spam messages stored on the server
>until I say trash them (or until I ignore them for some length
>of time).

I used to do that, but I discovered that I was spending more than an
hour each Monday and a half hour every other day going through them and
getting rid of them. That's silly. I don't mind occassionally
quarentining a good message, but deleting it is a no-no.

For servers that get 50,000 messages a day, manual deletion of the 20%
spam is impractical.

>> My preference is to filter on the URLs (and in some cases, phone
numbers
>> and snail mail addresses) that the spammers use for contacts.
>
>But a Bayesian filter can do stats on that as well.


Only with a lot more overhead and less precision.

And overhead matters if your processing 50,000 messages a day (or even
500 on a machine that's also running a search engine, web server, and
soon, a software update server).

The Graham filter is especially bad with overhead, because it just uses
everything about the message as it comes. A lot of spam is encoded in
various ways, and that will increase the number of versions of the same
word a lot. And if the spammer hits on a new encoding, the spam will go
through.

Another overhead issue with Bayesian filters is that you have to scan
the message twice, once to classify it, and a second time to store the
counters appropriately. (And you have to have a way to handle false
positives and false negatives.) You can trade off memory vs. cpu time in
various ways, but it would be preferable to do neither.

Anyway, its best that multiple approaches be used to filter spam. If
everybody did it the same way, the spammers would find it easier to get
their trash though!!

                  Randy.