From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-0.3 required=5.0 tests=BAYES_00,
	REPLYTO_WITHOUT_TO_CC autolearn=no autolearn_force=no version=3.4.4
X-Google-Thread: 103376,b4b864fa2b61bbba
X-Google-Attributes: gid103376,public,usenet
X-Google-Language: ENGLISH,ASCII-7-bit
Path: 
 g2news1.google.com!news2.google.com!news.glorb.com!news2.arglkargh.de!news.karotte.org!newsfeed00.sul.t-online.de!newsfeed01.sul.t-online.de!t-online.de!newsfeed.arcor.de!newsspool2.arcor-online.net!news.arcor.de.POSTED!not-for-mail
From: "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de>
Subject: Re: Parallel Text Corpus Processing with Ada?
Newsgroups: comp.lang.ada
User-Agent: 40tude_Dialog/2.0.15.1
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Reply-To: mailbox@dmitry-kazakov.de
Organization: cbb software GmbH
References: <1194735959.240323.38210@v2g2000hsf.googlegroups.com>
 <1t1ab1hzsng9p.101gcl2uomeoy.dlg@40tude.net>
 <1194821365.830120.106600@o3g2000hsb.googlegroups.com>
Date: Mon, 12 Nov 2007 17:17:31 +0100
Message-ID: <8s767qqrk0iw.x5fwu5eaj345$.dlg@40tude.net>
NNTP-Posting-Date: 12 Nov 2007 17:10:33 CET
NNTP-Posting-Host: ab7daf58.newsspool1.arcor-online.net
X-Trace: 
 DXC=E`\BRI`SdXD78PK[oJ2ng@ic==]BZ:afN4Fo<]lROoRAFl8W>\BH3YBJGLNETSFF;@DNcfSJ;bb[EFCTGGVUmh?DLK[5LiR>kgBFQ?QC0@@8CJ
X-Complaints-To: usenet-abuse@arcor.de
Xref: g2news1.google.com comp.lang.ada:18317
Date: 2007-11-12T17:10:33+01:00
List-Id: <comp.lang.ada>

On Sun, 11 Nov 2007 14:49:25 -0800, braver wrote:

> On Nov 11, 11:23 am, "Dmitry A. Kazakov" <mail...@dmitry-kazakov.de>
> wrote:
>> But see above. What kind of processing you have?
>>
>> 1. Do you run one complex pattern along a long text?
>> 2. Multiple patterns matching the same (long) text?
>> 3. Multiple patterns matching different texts?
> 
> I do large corpora research, finding all kinds of n-grams in millions
> of files.  I'm primarily interested in utilizing all 8 cores of my
> current Linux server to speed up things like grepping those files, so
> would be curious to see Ada 2005 code doing both
> 
> -- tasking
> -- dictionary counting of occurrences -- n-gram counting
> 
> Tasking is definitely more interesting as I see already from
> Ada.Containers I can use hash maps, the questions is how to split a
> corpus and unleash 8 tasks on it so they occupy their own cores.

I would concentrate on prevention of memory access collisions. Memory
access should certainly be the bottleneck. So choosing the algorithm of
recognition and counting I would move memory / computing trade-off towards
memory in order to get as many things as possible into the processor's
caches.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de