From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII-7-bit X-Google-Thread: 103376,e5c972d04da95d51 X-Google-Attributes: gid103376,public X-Google-ArrivalTime: 2003-04-18 23:29:02 PST Path: archiver1.google.com!news1.google.com!newsfeed.stanford.edu!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!newsfeed.icl.net!newsfeed.fjserv.net!colt.net!news.tele.dk!news.tele.dk!small.news.tele.dk!newsfeed1.e.nsc.no!nsc.no!nextra.com!news2.e.nsc.no.POSTED!53ab2750!not-for-mail From: "Tarjei T. Jensen" Newsgroups: comp.lang.ada References: Subject: Re: If anybody wants to make something in Ada but do not know what X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 6.00.2800.1106 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106 Message-ID: NNTP-Posting-Host: 130.67.226.24 X-Complaints-To: news-abuse@telenor.net NNTP-Posting-Date: Sat, 19 Apr 2003 08:29:02 MEST X-Trace: news2.ulv.nextra.no 1050733742 130.67.226.24 Date: Sat, 19 Apr 2003 08:28:59 +0200 Xref: archiver1.google.com comp.lang.ada:36302 Date: 2003-04-19T08:28:59+02:00 List-Id: Randy Brukardt wrote: > That might prevent passing spam, but it does nothing to avoid the > overhead. The problem is in order to find out the strongest indicator, > you have to score every 'word' in the message. When a lot of trash words > are in the message, you have to allocate new words and new counters for > them; and when there are a lot of such messages, the size of the DB > grows rapidly. (We saw this happen in the search engine when we > accidentially indexed some Unix .lib files.) That adds overhead; a lot > of overhead for a filter like mine which gets invoked on each message > individually. (Writing out the word list each time is expensive.) Why not do it another way: Check all URL in the message. If they point to a know porn/spam server, mark it as suspect. Then do some processing on what is left of the text. One could also obtain some sort of unique signature from the mail. Then compare that to other messages received. If a lot of messages have the same signature, then they are likely to be spam. Known mailing lists will of course be excluded. The only problem is to generate a signature that is not trivial to evade. Preferably there should be a number of signatures algorithms to choose from, so that it becomes difficult to optimize the mail for all of them since the spammer can't know which algorithm is used any given day or hour. The algorithm would of course be chosen arbitrarily at each site. greetings,