From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-0.3 required=5.0 tests=BAYES_00, REPLYTO_WITHOUT_TO_CC autolearn=no autolearn_force=no version=3.4.4 X-Google-Thread: 103376,b4b864fa2b61bbba X-Google-Attributes: gid103376,public,usenet X-Google-Language: ENGLISH,ASCII-7-bit Path: g2news1.google.com!news1.google.com!news.glorb.com!feed.xsnews.nl!border-1.ams.xsnews.nl!feeder1.cambrium.nl!feed.tweaknews.nl!news.netcologne.de!newsfeed-hp2.netcologne.de!newsfeed.arcor.de!newsspool4.arcor-online.net!news.arcor.de.POSTED!not-for-mail From: "Dmitry A. Kazakov" Subject: Re: Parallel Text Corpus Processing with Ada? Newsgroups: comp.lang.ada User-Agent: 40tude_Dialog/2.0.15.1 MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Reply-To: mailbox@dmitry-kazakov.de Organization: cbb software GmbH References: <1194735959.240323.38210@v2g2000hsf.googlegroups.com> Date: Sun, 11 Nov 2007 09:23:34 +0100 Message-ID: <1t1ab1hzsng9p.101gcl2uomeoy.dlg@40tude.net> NNTP-Posting-Date: 11 Nov 2007 09:23:39 CET NNTP-Posting-Host: 45845cd9.newsspool1.arcor-online.net X-Trace: DXC=A7j9W^f9_joUoRk[hk2Walic==]BZ:afn4Fo<]lROoRaFl8W>\BH3YbI`YFje\m4 On Sat, 10 Nov 2007 15:05:59 -0800, braver wrote: > Greetings -- I'm working with large text corpora, and am wondering > what tools are there for implementing parallel apps working with > corpora. E.g., one could imagine a parallel grep. This is for a > single Linux box with multiple CPUs and shared memory -- an ideal > setup for Ada concurrency. What tools do we have to use things like > Python and Ruby, also widely used for text processing, and what's the > state of regexps? Why necessarily RE? Or else why patterns? Patterns come at a high price. They are sufficiently slower than tailored string processing algorithms. More power you get, slower it works. Especially for parallel processing I would consider a specialized implementation first. As for patterns, GNAT has both RE and SNOBOL ones. SNOBOL patterns were mentioned by Georg. REs are in the package GNAT.Regexp. I have Ada bindings to different SNOBOL-like patterns http://www.dmitry-kazakov.de/match/match.htm. But see above. What kind of processing you have? 1. Do you run one complex pattern along a long text? 2. Multiple patterns matching the same (long) text? 3. Multiple patterns matching different texts? I.e. what is concurrent and how well you can split it into different tasks. For example matching alternatives concurrently like in the pattern "green"|"red"|"blue" would likely be slower than doing it sequential, too much overhead, impossible to implement heuristics etc. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de