From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-0.3 required=5.0 tests=BAYES_00, REPLYTO_WITHOUT_TO_CC autolearn=no autolearn_force=no version=3.4.4 X-Google-Thread: 103376,b4b864fa2b61bbba X-Google-Attributes: gid103376,public,usenet X-Google-Language: ENGLISH,ASCII-7-bit Path: g2news1.google.com!news2.google.com!news.glorb.com!news2.arglkargh.de!news.karotte.org!newsfeed00.sul.t-online.de!newsfeed01.sul.t-online.de!t-online.de!newsfeed.arcor.de!newsspool2.arcor-online.net!news.arcor.de.POSTED!not-for-mail From: "Dmitry A. Kazakov" Subject: Re: Parallel Text Corpus Processing with Ada? Newsgroups: comp.lang.ada User-Agent: 40tude_Dialog/2.0.15.1 MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Reply-To: mailbox@dmitry-kazakov.de Organization: cbb software GmbH References: <1194735959.240323.38210@v2g2000hsf.googlegroups.com> <1t1ab1hzsng9p.101gcl2uomeoy.dlg@40tude.net> <1194821365.830120.106600@o3g2000hsb.googlegroups.com> Date: Mon, 12 Nov 2007 17:17:31 +0100 Message-ID: <8s767qqrk0iw.x5fwu5eaj345$.dlg@40tude.net> NNTP-Posting-Date: 12 Nov 2007 17:10:33 CET NNTP-Posting-Host: ab7daf58.newsspool1.arcor-online.net X-Trace: DXC=E`\BRI`SdXD78PK[oJ2ng@ic==]BZ:afN4Fo<]lROoRAFl8W>\BH3YBJGLNETSFF;@DNcfSJ;bb[EFCTGGVUmh?DLK[5LiR>kgBFQ?QC0@@8CJ X-Complaints-To: usenet-abuse@arcor.de Xref: g2news1.google.com comp.lang.ada:18317 Date: 2007-11-12T17:10:33+01:00 List-Id: On Sun, 11 Nov 2007 14:49:25 -0800, braver wrote: > On Nov 11, 11:23 am, "Dmitry A. Kazakov" > wrote: >> But see above. What kind of processing you have? >> >> 1. Do you run one complex pattern along a long text? >> 2. Multiple patterns matching the same (long) text? >> 3. Multiple patterns matching different texts? > > I do large corpora research, finding all kinds of n-grams in millions > of files. I'm primarily interested in utilizing all 8 cores of my > current Linux server to speed up things like grepping those files, so > would be curious to see Ada 2005 code doing both > > -- tasking > -- dictionary counting of occurrences -- n-gram counting > > Tasking is definitely more interesting as I see already from > Ada.Containers I can use hash maps, the questions is how to split a > corpus and unleash 8 tasks on it so they occupy their own cores. I would concentrate on prevention of memory access collisions. Memory access should certainly be the bottleneck. So choosing the algorithm of recognition and counting I would move memory / computing trade-off towards memory in order to get as many things as possible into the processor's caches. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de