From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII-7-bit
X-Google-Thread: 103376,e5c972d04da95d51
X-Google-Attributes: gid103376,public
X-Google-ArrivalTime: 2003-04-18 23:29:02 PST
Path: 
 archiver1.google.com!news1.google.com!newsfeed.stanford.edu!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!newsfeed.icl.net!newsfeed.fjserv.net!colt.net!news.tele.dk!news.tele.dk!small.news.tele.dk!newsfeed1.e.nsc.no!nsc.no!nextra.com!news2.e.nsc.no.POSTED!53ab2750!not-for-mail
From: "Tarjei T. Jensen" <tarjei@online.no>
Newsgroups: comp.lang.ada
References: <slrnb9qkhg.2t6.randhol+news@kiuk0152.chembio.ntnu.no>
 <v9rdn4e8tej0d2@corp.supernews.com> <VZ2dndUZ8_77eACjXTWcpA@gbronline.com>
 <v9ucatrcl1ik89@corp.supernews.com>
Subject: Re: If anybody wants to make something in Ada but do not know what
X-Priority: 3
X-MSMail-Priority: Normal
X-Newsreader: Microsoft Outlook Express 6.00.2800.1106
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106
Message-ID: <O06oa.5030$8g5.77428@news2.e.nsc.no>
NNTP-Posting-Host: 130.67.226.24
X-Complaints-To: news-abuse@telenor.net
NNTP-Posting-Date: Sat, 19 Apr 2003 08:29:02 MEST
X-Trace: news2.ulv.nextra.no 1050733742 130.67.226.24
Date: Sat, 19 Apr 2003 08:28:59 +0200
Xref: archiver1.google.com comp.lang.ada:36302
Date: 2003-04-19T08:28:59+02:00
List-Id: <comp.lang.ada>

Randy Brukardt wrote:
> That might prevent passing spam, but it does nothing to avoid the
> overhead. The problem is in order to find out the strongest indicator,
> you have to score every 'word' in the message. When a lot of trash words
> are in the message, you have to allocate new words and new counters for
> them; and when there are a lot of such messages, the size of the DB
> grows rapidly. (We saw this happen in the search engine when we
> accidentially indexed some Unix .lib files.) That adds overhead; a lot
> of overhead for a filter like mine which gets invoked on each message
> individually. (Writing out the word list each time is expensive.)

Why not do it another way: Check all URL in the message. If they point to a
know porn/spam server, mark it as suspect. Then do some processing on what
is left of the text.

One could also obtain some sort of unique signature from the mail. Then
compare that to other messages received. If a lot of messages have the same
signature, then they are likely to be spam. Known mailing lists will of
course be excluded. The only problem is to generate a signature that is not
trivial to evade. Preferably there should be a number of signatures
algorithms to choose from, so that it becomes difficult to optimize the mail
for all of them since the spammer can't know which algorithm is used any
given day or hour. The algorithm would of course be chosen arbitrarily at
each site.

greetings,