From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-0.3 required=5.0 tests=BAYES_00,
	REPLYTO_WITHOUT_TO_CC autolearn=no autolearn_force=no version=3.4.4
X-Google-Thread: 103376,b4b864fa2b61bbba
X-Google-Attributes: gid103376,public,usenet
X-Google-Language: ENGLISH,ASCII-7-bit
Path: 
 g2news1.google.com!news1.google.com!news.glorb.com!feed.xsnews.nl!border-1.ams.xsnews.nl!feeder1.cambrium.nl!feed.tweaknews.nl!news.netcologne.de!newsfeed-hp2.netcologne.de!newsfeed.arcor.de!newsspool4.arcor-online.net!news.arcor.de.POSTED!not-for-mail
From: "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de>
Subject: Re: Parallel Text Corpus Processing with Ada?
Newsgroups: comp.lang.ada
User-Agent: 40tude_Dialog/2.0.15.1
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Reply-To: mailbox@dmitry-kazakov.de
Organization: cbb software GmbH
References: <1194735959.240323.38210@v2g2000hsf.googlegroups.com>
Date: Sun, 11 Nov 2007 09:23:34 +0100
Message-ID: <1t1ab1hzsng9p.101gcl2uomeoy.dlg@40tude.net>
NNTP-Posting-Date: 11 Nov 2007 09:23:39 CET
NNTP-Posting-Host: 45845cd9.newsspool1.arcor-online.net
X-Trace: 
 DXC=A7j9W^f9_joUoRk[hk2Walic==]BZ:afn4Fo<]lROoRaFl8W>\BH3YbI`YFje\m4<bDNcfSJ;bb[eIRnRBaCd<MnY<6@FFDc\Pn\ZZJ@lCaEon
X-Complaints-To: usenet-abuse@arcor.de
Xref: g2news1.google.com comp.lang.ada:18269
Date: 2007-11-11T09:23:39+01:00
List-Id: <comp.lang.ada>

On Sat, 10 Nov 2007 15:05:59 -0800, braver wrote:

> Greetings -- I'm working with large text corpora, and am wondering
> what tools are there for implementing parallel apps working with
> corpora.  E.g., one could imagine a parallel grep.  This is for a
> single Linux box with multiple CPUs and shared memory -- an ideal
> setup for Ada concurrency.  What tools do we have to use things like
> Python and Ruby, also widely used for text processing, and what's the
> state of regexps?

Why necessarily RE? Or else why patterns? Patterns come at a high price.
They are sufficiently slower than tailored string processing algorithms.
More power you get, slower it works. Especially for parallel processing I
would consider a specialized implementation first.

As for patterns, GNAT has both RE and SNOBOL ones. SNOBOL patterns were
mentioned by Georg. REs are in the package GNAT.Regexp. I have Ada bindings
to different SNOBOL-like patterns
http://www.dmitry-kazakov.de/match/match.htm.

But see above. What kind of processing you have?

1. Do you run one complex pattern along a long text?
2. Multiple patterns matching the same (long) text?
3. Multiple patterns matching different texts?

I.e. what is concurrent and how well you can split it into different tasks.
For example matching alternatives concurrently like in the pattern
"green"|"red"|"blue" would likely be slower than doing it sequential, too
much overhead, impossible to implement heuristics etc.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de