comp.lang.ada
 help / color / mirror / Atom feed
* Parallel Text Corpus Processing with Ada?
@ 2007-11-10 23:05 braver
  2007-11-11  0:11 ` tmoran
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: braver @ 2007-11-10 23:05 UTC (permalink / raw)


Greetings -- I'm working with large text corpora, and am wondering
what tools are there for implementing parallel apps working with
corpora.  E.g., one could imagine a parallel grep.  This is for a
single Linux box with multiple CPUs and shared memory -- an ideal
setup for Ada concurrency.  What tools do we have to use things like
Python and Ruby, also widely used for text processing, and what's the
state of regexps?

Cheers,
Alexy




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Parallel Text Corpus Processing with Ada?
  2007-11-10 23:05 Parallel Text Corpus Processing with Ada? braver
@ 2007-11-11  0:11 ` tmoran
  2007-11-11  1:10 ` Georg Bauhaus
  2007-11-11  8:23 ` Dmitry A. Kazakov
  2 siblings, 0 replies; 17+ messages in thread
From: tmoran @ 2007-11-11  0:11 UTC (permalink / raw)


> E.g., one could imagine a parallel grep.
  A year or something ago I tried a parallel word counter and as I recall
it ran pleasing faster on a dual-core than a single-core CPU.  It had a
subroutine that accepted a buffer load of text updated a count parameter.
So parallelizing just consisted in making more than one counter task,
then having the read task give each a buffer load in turn, and after all
was done, harvesting the several counts.  Of course there was also a
bit of special handling for words that were split across buffer loads,
but for a decent size buffer that took an inconsequential amount of time.
  So yes, there are lots of opportunities to utility multiple cpus using
very simple Ada tasking.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Parallel Text Corpus Processing with Ada?
  2007-11-10 23:05 Parallel Text Corpus Processing with Ada? braver
  2007-11-11  0:11 ` tmoran
@ 2007-11-11  1:10 ` Georg Bauhaus
  2007-11-11  8:23 ` Dmitry A. Kazakov
  2 siblings, 0 replies; 17+ messages in thread
From: Georg Bauhaus @ 2007-11-11  1:10 UTC (permalink / raw)



On Sat, 2007-11-10 at 15:05 -0800, braver wrote:
> Greetings -- I'm working with large text corpora,... could imagine
>  a parallel grep.  ... what's the
> state of regexps?

GNAT has an original implementation of SPITBOL (SNOBOL4)
pattern matching. This is good equipment for text analysis.






^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Parallel Text Corpus Processing with Ada?
  2007-11-10 23:05 Parallel Text Corpus Processing with Ada? braver
  2007-11-11  0:11 ` tmoran
  2007-11-11  1:10 ` Georg Bauhaus
@ 2007-11-11  8:23 ` Dmitry A. Kazakov
  2007-11-11 15:54   ` Georg Bauhaus
  2007-11-11 22:49   ` braver
  2 siblings, 2 replies; 17+ messages in thread
From: Dmitry A. Kazakov @ 2007-11-11  8:23 UTC (permalink / raw)


On Sat, 10 Nov 2007 15:05:59 -0800, braver wrote:

> Greetings -- I'm working with large text corpora, and am wondering
> what tools are there for implementing parallel apps working with
> corpora.  E.g., one could imagine a parallel grep.  This is for a
> single Linux box with multiple CPUs and shared memory -- an ideal
> setup for Ada concurrency.  What tools do we have to use things like
> Python and Ruby, also widely used for text processing, and what's the
> state of regexps?

Why necessarily RE? Or else why patterns? Patterns come at a high price.
They are sufficiently slower than tailored string processing algorithms.
More power you get, slower it works. Especially for parallel processing I
would consider a specialized implementation first.

As for patterns, GNAT has both RE and SNOBOL ones. SNOBOL patterns were
mentioned by Georg. REs are in the package GNAT.Regexp. I have Ada bindings
to different SNOBOL-like patterns
http://www.dmitry-kazakov.de/match/match.htm.

But see above. What kind of processing you have?

1. Do you run one complex pattern along a long text?
2. Multiple patterns matching the same (long) text?
3. Multiple patterns matching different texts?

I.e. what is concurrent and how well you can split it into different tasks.
For example matching alternatives concurrently like in the pattern
"green"|"red"|"blue" would likely be slower than doing it sequential, too
much overhead, impossible to implement heuristics etc.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Parallel Text Corpus Processing with Ada?
  2007-11-11  8:23 ` Dmitry A. Kazakov
@ 2007-11-11 15:54   ` Georg Bauhaus
  2007-11-11 16:13     ` Georg Bauhaus
  2007-11-12 13:31     ` Dmitry A. Kazakov
  2007-11-11 22:49   ` braver
  1 sibling, 2 replies; 17+ messages in thread
From: Georg Bauhaus @ 2007-11-11 15:54 UTC (permalink / raw)



On Sun, 2007-11-11 at 09:23 +0100, Dmitry A. Kazakov wrote:

> Why necessarily RE? Or else why patterns? Patterns come at a high price.
> They are sufficiently slower than tailored string processing algorithms.

SPITBOL patterns serve "tailored string processing algorithms";
I don't quite understand how you could have more targetted algorithms
for large corpora. What do you have in mind?
For example, BreakX is hard to beat; I believe that your assessment that
"patterns are more powerful, but slower in matching"[1] is really true.

A few RE packages have a Boyer-Moore algorithm built in, and more.
How can string processing be substantially faster?

(I guess that the new Ada.Strings.Fixed.Index/5 subprograms are
an improvement, maybe depending on how string slices would otherwise
be passed to searches starting at some Position.)

[1] http://www.dmitry-kazakov.de/match/match.htm





^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Parallel Text Corpus Processing with Ada?
  2007-11-11 15:54   ` Georg Bauhaus
@ 2007-11-11 16:13     ` Georg Bauhaus
  2007-11-12 13:31     ` Dmitry A. Kazakov
  1 sibling, 0 replies; 17+ messages in thread
From: Georg Bauhaus @ 2007-11-11 16:13 UTC (permalink / raw)



On Sun, 2007-11-11 at 16:54 +0100, Georg Bauhaus wrote:

> For example, BreakX is hard to beat; I believe that your assessment that
> "patterns are more powerful, but slower in matching"[1] is really true.

There is a "not" missing in the last sentence.






^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Parallel Text Corpus Processing with Ada?
  2007-11-11  8:23 ` Dmitry A. Kazakov
  2007-11-11 15:54   ` Georg Bauhaus
@ 2007-11-11 22:49   ` braver
  2007-11-12 16:17     ` Dmitry A. Kazakov
  2007-11-13 22:45     ` Simon Wright
  1 sibling, 2 replies; 17+ messages in thread
From: braver @ 2007-11-11 22:49 UTC (permalink / raw)


On Nov 11, 11:23 am, "Dmitry A. Kazakov" <mail...@dmitry-kazakov.de>
wrote:
> But see above. What kind of processing you have?
>
> 1. Do you run one complex pattern along a long text?
> 2. Multiple patterns matching the same (long) text?
> 3. Multiple patterns matching different texts?

I do large corpora research, finding all kinds of n-grams in millions
of files.  I'm primarily interested in utilizing all 8 cores of my
current Linux server to speed up things like grepping those files, so
would be curious to see Ada 2005 code doing both

-- tasking
-- dictionary counting of occurrences -- n-gram counting

Tasking is definitely more interesting as I see already from
Ada.Containers I can use hash maps, the questions is how to split a
corpus and unleash 8 tasks on it so they occupy their own cores.

Cheers,
Alexy




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Parallel Text Corpus Processing with Ada?
  2007-11-11 15:54   ` Georg Bauhaus
  2007-11-11 16:13     ` Georg Bauhaus
@ 2007-11-12 13:31     ` Dmitry A. Kazakov
  2007-11-12 15:07       ` Georg Bauhaus
  1 sibling, 1 reply; 17+ messages in thread
From: Dmitry A. Kazakov @ 2007-11-12 13:31 UTC (permalink / raw)


On Sun, 11 Nov 2007 16:54:40 +0100, Georg Bauhaus wrote:

> On Sun, 2007-11-11 at 09:23 +0100, Dmitry A. Kazakov wrote:
> 
>> Why necessarily RE? Or else why patterns? Patterns come at a high price.
>> They are sufficiently slower than tailored string processing algorithms.
> 
> SPITBOL patterns serve "tailored string processing algorithms";
> I don't quite understand how you could have more targetted algorithms
> for large corpora. What do you have in mind?

There is wide class of problems which patterns do not solve or solve
inefficiently. Like building dictionaries finding longest common substring
etc. (Compiling Ada programs is also in this class. (:-))

> A few RE packages have a Boyer-Moore algorithm built in, and more.
> How can string processing be substantially faster?

That depends on the problem. RE is not just about string search. Searching
for a substring is itself a very specialized problem which IMO is rarely
needed. In the other hand string processing often is more than pure
matching (i.e. cursor moving + failure/success outcome). SNOBOL had
immediate assignment to respond the problem. As a general note, pattern
matching languages are declarative with all disadvantage of.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Parallel Text Corpus Processing with Ada?
  2007-11-12 13:31     ` Dmitry A. Kazakov
@ 2007-11-12 15:07       ` Georg Bauhaus
  2007-11-12 16:11         ` Dmitry A. Kazakov
  0 siblings, 1 reply; 17+ messages in thread
From: Georg Bauhaus @ 2007-11-12 15:07 UTC (permalink / raw)


On Mon, 2007-11-12 at 14:31 +0100, Dmitry A. Kazakov wrote:

> There is wide class of problems which patterns do not solve or solve
> inefficiently. Like building dictionaries finding longest common substring
> etc. (Compiling Ada programs is also in this class. (:-))

OK. This valuable context information would, I think, be a nice
addition to your SNOBOL patterns comments.
In line with the SNOBOL4 motto: Don't use pattern matching...

> As a general note, pattern
> matching languages are declarative with all disadvantage of.

OK, but being declarative isn't true of SNOBOL4's, because you can
modify patterns during scanning, if you are crazy enough to do this. ;-)







^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Parallel Text Corpus Processing with Ada?
  2007-11-12 15:07       ` Georg Bauhaus
@ 2007-11-12 16:11         ` Dmitry A. Kazakov
  0 siblings, 0 replies; 17+ messages in thread
From: Dmitry A. Kazakov @ 2007-11-12 16:11 UTC (permalink / raw)


On Mon, 12 Nov 2007 16:07:53 +0100, Georg Bauhaus wrote:

> On Mon, 2007-11-12 at 14:31 +0100, Dmitry A. Kazakov wrote:
> 
>> There is wide class of problems which patterns do not solve or solve
>> inefficiently. Like building dictionaries finding longest common substring
>> etc. (Compiling Ada programs is also in this class. (:-))
> 
> OK. This valuable context information would, I think, be a nice
> addition to your SNOBOL patterns comments.
> In line with the SNOBOL4 motto: Don't use pattern matching...

I am just trying to be honest. (:-))

Then existence of unsolvable problems does not imply that any problem is
like that. There is a class of problems where patterns are suitable. They
best fit in scripting environments.

But when I see patterns in the context of parallel processing of massive
amounts of data, then the first thing I would do is to verify if I really
should use them.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Parallel Text Corpus Processing with Ada?
  2007-11-11 22:49   ` braver
@ 2007-11-12 16:17     ` Dmitry A. Kazakov
  2007-11-13 22:45     ` Simon Wright
  1 sibling, 0 replies; 17+ messages in thread
From: Dmitry A. Kazakov @ 2007-11-12 16:17 UTC (permalink / raw)


On Sun, 11 Nov 2007 14:49:25 -0800, braver wrote:

> On Nov 11, 11:23 am, "Dmitry A. Kazakov" <mail...@dmitry-kazakov.de>
> wrote:
>> But see above. What kind of processing you have?
>>
>> 1. Do you run one complex pattern along a long text?
>> 2. Multiple patterns matching the same (long) text?
>> 3. Multiple patterns matching different texts?
> 
> I do large corpora research, finding all kinds of n-grams in millions
> of files.  I'm primarily interested in utilizing all 8 cores of my
> current Linux server to speed up things like grepping those files, so
> would be curious to see Ada 2005 code doing both
> 
> -- tasking
> -- dictionary counting of occurrences -- n-gram counting
> 
> Tasking is definitely more interesting as I see already from
> Ada.Containers I can use hash maps, the questions is how to split a
> corpus and unleash 8 tasks on it so they occupy their own cores.

I would concentrate on prevention of memory access collisions. Memory
access should certainly be the bottleneck. So choosing the algorithm of
recognition and counting I would move memory / computing trade-off towards
memory in order to get as many things as possible into the processor's
caches.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Parallel Text Corpus Processing with Ada?
  2007-11-11 22:49   ` braver
  2007-11-12 16:17     ` Dmitry A. Kazakov
@ 2007-11-13 22:45     ` Simon Wright
  2007-11-14 23:38       ` braver
  1 sibling, 1 reply; 17+ messages in thread
From: Simon Wright @ 2007-11-13 22:45 UTC (permalink / raw)


braver <deliverable@gmail.com> writes:

> Tasking is definitely more interesting as I see already from
> Ada.Containers I can use hash maps, the questions is how to split a
> corpus and unleash 8 tasks on it so they occupy their own cores.

You did spot that Ada.Containers aren't task-safe? (ie, you need to
lock the containers yourself ...)



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Parallel Text Corpus Processing with Ada?
  2007-11-13 22:45     ` Simon Wright
@ 2007-11-14 23:38       ` braver
  2007-11-15  9:39         ` Ludovic Brenta
  2007-11-15 21:11         ` Simon Wright
  0 siblings, 2 replies; 17+ messages in thread
From: braver @ 2007-11-14 23:38 UTC (permalink / raw)


On Nov 14, 1:45 am, Simon Wright <simon.j.wri...@mac.com> wrote:
> You did spot that Ada.Containers aren't task-safe? (ie, you need to
> lock the containers yourself ...)

Man, that's a problem!  What about BCs, are they task-safe?  What do
folks do to use Ada.Containers in a tasking setup, and why on Earth
would they be in a standard and contardict some other chapters such as
the tasks?

Cheers,
Alexy




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Parallel Text Corpus Processing with Ada?
  2007-11-14 23:38       ` braver
@ 2007-11-15  9:39         ` Ludovic Brenta
  2007-11-15 11:12           ` Dmitry A. Kazakov
  2007-11-15 21:11         ` Simon Wright
  1 sibling, 1 reply; 17+ messages in thread
From: Ludovic Brenta @ 2007-11-15  9:39 UTC (permalink / raw)


braver writes:
> On Nov 14, 1:45 am, Simon Wright <simon.j.wri...@mac.com> wrote:
>> You did spot that Ada.Containers aren't task-safe? (ie, you need to
>> lock the containers yourself ...)
>
> Man, that's a problem!  What about BCs, are they task-safe?  What do
> folks do to use Ada.Containers in a tasking setup, and why on Earth
> would they be in a standard and contardict some other chapters such
> as the tasks?

They do not contradict the tasking; on the contrary they are
orthogonal to tasking.  This way, if you do not need tasking, the
containers do not force you to pay the price of tasking.

You can wrap a container in a protected object if you want concurrent
access.  Alternatively, you can have one container per task (private
to that task, so not protection needed).  Before starting each task,
you Move elements into that task's private container.

-- 
Ludovic Brenta.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Parallel Text Corpus Processing with Ada?
  2007-11-15  9:39         ` Ludovic Brenta
@ 2007-11-15 11:12           ` Dmitry A. Kazakov
  0 siblings, 0 replies; 17+ messages in thread
From: Dmitry A. Kazakov @ 2007-11-15 11:12 UTC (permalink / raw)


On Thu, 15 Nov 2007 10:39:59 +0100, Ludovic Brenta wrote:

> You can wrap a container in a protected object if you want concurrent
> access.

It is not that simple. One problem is that O(>1) operations are poor
candidates for being protected actions. Another is with cursors and stuff
like that which is decoupled from the container. Thus a cursor cannot be
used outside the container when the latter is protected. Otherwise it might
get corrupt.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Parallel Text Corpus Processing with Ada?
  2007-11-14 23:38       ` braver
  2007-11-15  9:39         ` Ludovic Brenta
@ 2007-11-15 21:11         ` Simon Wright
  2007-11-17  1:05           ` Randy Brukardt
  1 sibling, 1 reply; 17+ messages in thread
From: Simon Wright @ 2007-11-15 21:11 UTC (permalink / raw)


braver <deliverable@gmail.com> writes:

> On Nov 14, 1:45 am, Simon Wright <simon.j.wri...@mac.com> wrote:
>> You did spot that Ada.Containers aren't task-safe? (ie, you need to
>> lock the containers yourself ...)
>
> Man, that's a problem!  What about BCs, are they task-safe?

At one point the BCs followed their original C++ source by providing a
couple of synchronised forms.

I found that for any serious use I couldn't find a one-size-fits-all
(well, few-sizes-fit-all perhaps) approach that let me build real
apps. Iterators were one problem, applications where the data
structure involves multiple containers was another. I did consider
some shared-mutex schemes (would have involved something like
constraining container instances by access Mutex'Class or some
such). But on the whole it seemed that it would be better to require
people to but the protection _they_ need round their data.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Parallel Text Corpus Processing with Ada?
  2007-11-15 21:11         ` Simon Wright
@ 2007-11-17  1:05           ` Randy Brukardt
  0 siblings, 0 replies; 17+ messages in thread
From: Randy Brukardt @ 2007-11-17  1:05 UTC (permalink / raw)


"Simon Wright" <simon.j.wright@mac.com> wrote in message
news:m2oddvw6xn.fsf@mac.com...
> braver <deliverable@gmail.com> writes:
>
> > On Nov 14, 1:45 am, Simon Wright <simon.j.wri...@mac.com> wrote:
> >> You did spot that Ada.Containers aren't task-safe? (ie, you need to
> >> lock the containers yourself ...)
> >
> > Man, that's a problem!  What about BCs, are they task-safe?
>
> At one point the BCs followed their original C++ source by providing a
> couple of synchronised forms.
>
> I found that for any serious use I couldn't find a one-size-fits-all
> (well, few-sizes-fit-all perhaps) approach that let me build real
> apps. Iterators were one problem, applications where the data
> structure involves multiple containers was another. I did consider
> some shared-mutex schemes (would have involved something like
> constraining container instances by access Mutex'Class or some
> such). But on the whole it seemed that it would be better to require
> people to but the protection _they_ need round their data.

Which is the same reason that the standard Ada containers don't allow
multiple tasks accessing a single container. (Nor does any other Ada
predefined library, for that matter. This is the standard behavior of Ada
predefined libraries - see RM A(3) - that's paragraph 3 at the beginning of
Annex A - there is nothing specific to containers here.)

There is some work going on to provide some for of "protected" containers,
but it is not clear what or when that will come to fruition.

                     Randy.





^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2007-11-17  1:05 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-11-10 23:05 Parallel Text Corpus Processing with Ada? braver
2007-11-11  0:11 ` tmoran
2007-11-11  1:10 ` Georg Bauhaus
2007-11-11  8:23 ` Dmitry A. Kazakov
2007-11-11 15:54   ` Georg Bauhaus
2007-11-11 16:13     ` Georg Bauhaus
2007-11-12 13:31     ` Dmitry A. Kazakov
2007-11-12 15:07       ` Georg Bauhaus
2007-11-12 16:11         ` Dmitry A. Kazakov
2007-11-11 22:49   ` braver
2007-11-12 16:17     ` Dmitry A. Kazakov
2007-11-13 22:45     ` Simon Wright
2007-11-14 23:38       ` braver
2007-11-15  9:39         ` Ludovic Brenta
2007-11-15 11:12           ` Dmitry A. Kazakov
2007-11-15 21:11         ` Simon Wright
2007-11-17  1:05           ` Randy Brukardt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox