Reliability and deadlock in Annex E/distributed code

comp.lang.ada
 help / color / mirror / Atom feed

* Reliability and deadlock in Annex E/distributed code
@ 2006-09-10 20:58 Dr. Adrian Wrigley
  2006-09-11 18:52 ` Jerome Hugues
  2006-09-12 20:31 ` Dr. Adrian Wrigley
  0 siblings, 2 replies; 20+ messages in thread
From: Dr. Adrian Wrigley @ 2006-09-10 20:58 UTC (permalink / raw)


Hi guys!

I've been having difficulty getting my Annex E/glade code to run reliably.

Under gnat 3.15p for x86 Linux, things were tolerably OK, with failures
of the code about weekly (running one instance continuously).
Sometimes the program simply wouldn't allow new partitions to run, as if
there was some boot server failure.  Sometimes the server would suddenly
start consuming all the CPU cycles it could get.

I think there may have been one or two bugs in the 3.15p version of glade,
particularly under certain error conditions of partitions being killed
while in use.

It has proved impossible to get the program to be as reliable as I want,
and have tried running under GNAT GPL 2006 from https://libre2.adacore.com/

I built and installed the glade, and have been testing my code.  It doesn't
work at all.  I have also tried the gnat and glade from Martin Krischik's
builds for FC5, and got the same problem:

There are three partitions A, B, C
The program starts up normally.
A procedure (in a normal unit) in partition C calls a function (in a
normal package) in partition B (using dynamic dispatch on a remote access
to class-wide type). The function in partition B calls a function (in a
rci package) in partition A The function in partition A never executes,
and the program stops executing.

If I enable glade debugging (S_PARINT=true S_RPC=true), I can see that
partition A gets the RPC message instructing it to do the call, but
then it doesn't actually call the necessary function (parameterless
return of an integer, but any function call fails).

If I call the function in A directly from B, it works fine.  It only seems
to be when A is called from B while executing a call from C that the problem occurs.

It's as if there is some deadlock or shortage of tasks to allocate or something.

Any ideas?

Using gdb, I find that each time a call in A is made, but doesn't execute, I
get a new task:

(gdb) info tasks
...
* 12   81d16a8    1  46 Waiting on entry call  rpc_handler
(gdb) where
#0  0x42028d69 in sigsuspend () from /lib/i686/libc.so.6
#1  0x4005b108 in __pthread_wait_for_restart_signal () from /lib/i686/libpthread.so.0
#2  0x4005804b in pthread_cond_wait () from /lib/i686/libpthread.so.0
#3  0x080c5172 in system.tasking.entry_calls.wait_until_abortable ()
#4  0x080c29c4 in system.tasking.protected_objects.operations.protected_entry_call ()
#5  0x080b0afa in system.rpc.server.rpc_handler (<_task>=0x81d1698) at s-tpobop.ads:200
#6  0x080bed4f in system.tasking.stages.task_wrapper ()

It looks like the call cannot proceed until "wait_until_abortable" returns.

Am I doing something wrong by making one remote call inside another?
Maybe the new glade detects an error unnoticed by 3.15p?  Perhaps this
is the cause of previous 'hangs'?

On the topic of Annex E support:

I've tried building PolyORB from https://libre2.adacore.com/ but it seems
to be missing the src/dsa directory needed to support Annex E.  If I get the
version from cvs, it gets the error "raised RTSFIND.RE_NOT_AVAILABLE : rtsfind.adb:497".
What's the best way to build the DSA personality?

If I used the DSA personality from PolyORB, will it be any different from the
GARLIC PCS?  Might it be more/less robust? faster?

Thanks in advance for any input!
--
Dr. Adrian Wrigley, Cambridge, UK.




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reliability and deadlock in Annex E/distributed code
  2006-09-10 20:58 Reliability and deadlock in Annex E/distributed code Dr. Adrian Wrigley
@ 2006-09-11 18:52 ` Jerome Hugues
  2006-09-12 20:40   ` Dr. Adrian Wrigley
  2006-09-12 20:31 ` Dr. Adrian Wrigley
  1 sibling, 1 reply; 20+ messages in thread
From: Jerome Hugues @ 2006-09-11 18:52 UTC (permalink / raw)


In article <pan.2006.09.10.20.55.57.113998@linuxchip.demon.co.uk.uk.uk>, Dr. Adrian Wrigley wrote:

> There are three partitions A, B, C
> The program starts up normally.
> A procedure (in a normal unit) in partition C calls a function (in a
> normal package) in partition B (using dynamic dispatch on a remote access
> to class-wide type). The function in partition B calls a function (in a
> rci package) in partition A The function in partition A never executes,
> and the program stops executing.

> It's as if there is some deadlock or shortage of tasks to allocate
> or something.
>
> Any ideas?

How many application tasks do you have on each node ? did you
configure a task pool for each node ? (just to check you do not run
out of task)
 
> I've tried building PolyORB from https://libre2.adacore.com/ but it
> seems to be missing the src/dsa directory needed to support Annex E.
> If I get the version from cvs, it gets the error "raised
> RTSFIND.RE_NOT_AVAILABLE : rtsfind.adb:497".  What's the best way to
> build the DSA personality?

Short answer: waiting for an AdaCore annoucement stating it is ready ;)

Long answer: DSA requires that GNAT, GLADE (gnatdist) and PolyORB
versions are consistent. So getting it from CVS expose you to
problems.
 
AFAICT, the error you see means expanded code references an entity
that does not exist. It is the symptom of a strong inconsistency
between the compiler and the PCS.

> If I used the DSA personality from PolyORB, will it be any different
> from the GARLIC PCS?  Might it be more/less robust? faster?

Lots of new configuration options, new protocols. 

-- 
Jerome



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reliability and deadlock in Annex E/distributed code
  2006-09-10 20:58 Reliability and deadlock in Annex E/distributed code Dr. Adrian Wrigley
  2006-09-11 18:52 ` Jerome Hugues
@ 2006-09-12 20:31 ` Dr. Adrian Wrigley
  2006-09-12 23:24   ` tmoran
                     ` (3 more replies)
  1 sibling, 4 replies; 20+ messages in thread
From: Dr. Adrian Wrigley @ 2006-09-12 20:31 UTC (permalink / raw)


On Sun, 10 Sep 2006 20:58:33 +0000, Dr. Adrian Wrigley wrote:

> I've been having difficulty getting my Annex E/glade code to run reliably.
> 
> Under gnat 3.15p for x86 Linux, things were tolerably OK, with failures
> of the code about weekly (running one instance continuously).
> Sometimes the program simply wouldn't allow new partitions to run, as if
> there was some boot server failure.  Sometimes the server would suddenly
> start consuming all the CPU cycles it could get.
...

OK.  I have produced a fairly short example.
There are three partitions, A, B, C.
C calls B which calls A.
Compiler is GNAT GPL 2006 + GLADE 2006 on x86 Linux

The partition C (executable in ./cpart) runs OK on *alternate*
invocations.  Every other time, it hangs indefinitely.
This seems strange.

Dialogue following shows source files, compilation and
two invocations of partition C, one hanging.
Host machine is archimedes, in bash.


archimedes$ ls -l *.ad[sb] dist.cfg
-rw-rw-rw-    1 amtw     amtw          188 Sep 12 16:04 a.adb
-rw-rw-rw-    1 amtw     amtw           88 Sep 12 16:14 a.ads
-rw-rw-rw-    1 amtw     amtw          106 Sep 12 16:11 amain.adb
-rw-rw-r--    1 amtw     amtw          466 Sep 12 16:17 b.adb
-rw-rw-r--    1 amtw     amtw          116 Sep 12 16:08 b.ads
-rw-rw-rw-    1 amtw     amtw          252 Sep 12 16:10 cmain.adb
-rw-rw-rw-    1 amtw     amtw          540 Sep 12 16:16 dist.cfg

archimedes$ head -n 100 *.ad[sb] dist.cfg
==> a.adb <==
package body A is

   X : Integer := 0;

   function Next return Integer is
   begin
      X := X + 1; -- Return next integer in sequence, unprotected
      return X;
   end Next;

end A;

==> a.ads <==
package A is

   pragma Remote_Call_Interface;
   function Next return Integer;

end A;

==> amain.adb <==
with A;

procedure Amain is
begin

   delay 1000.0; -- Wait around for a while, then complete

end Amain;

==> b.adb <==
with Text_IO;
with A;

package body B is

-- Return A.Next simply by passing call through
   function Next return Integer is
   begin
      Text_IO.Put_Line ("B: B Next called");

      return A.Next;
   end Next;

   task Main;
   task body Main is
   begin
      Text_IO.Put_Line ("B: B making direct call to RCI function in A:");
-- Direct call to function in A works fine
      Text_IO.Put_Line ("B: A Next gives" & Integer'Image (A.Next));
   end Main;

end B;

==> b.ads <==
package B is

   pragma Remote_Call_Interface;
   function Next return Integer; -- Pass through of A's Next

end B;

==> cmain.adb <==
with Text_IO;
with B;

-- Each time this program is run, should produce the next integer in sequence

procedure CMain is

begin

   Text_IO.Put_Line ("C: Running  B.Next:");
   Text_IO.Put_Line ("C: B Next gives" & Integer'Image (B.Next));

end CMain;

==> dist.cfg <==
configuration Dist is

-- Boot server specification:
  pragma Starter (None);
  pragma Boot_Location ("tcp", "localhost:6788"); -- Choose spare port

  APart : Partition := (A);
  procedure AMain is in APart;
  for APart'Task_Pool use (4, 4, 10);

  BPart : Partition := (B);
  for BPart'Task_Pool use (4, 4, 10);

  procedure CMain;
  CPart : Partition := (CMain);
  for CPart'Task_Pool use (4, 4, 10);
  for CPart'Main use CMain;
  for CPart'Termination use Local_Termination;
  for CPart'Reconnection use Block_Until_Restart;

end Dist;
archimedes$

archimedes$ gcc -v   # Test the compiler version
Reading specs from /data2/gnat-gpl/bin/../lib/gcc/i686-pc-linux-gnu/3.4.6/specs
Configured with: /cardhu.b/gnatmail/release-gpl/build-cardhu/src/configure --prefix=/usr/gnat --enable-languages=c,ada --disable-nls --disable-libada --target=i686-pc-linux-gnu --host=i686-pc-linux-gnu --disable-checking --enable-threads=posix
Thread model: posix
gcc version 3.4.6 for GNAT GPL 2006 (20060522)


archimedes$ gnatdist -g dist.cfg   # Build the partitions
gnatdist: checking configuration consistency
 ------------------------------
 ---- Configuration report ----
 ------------------------------
Configuration :
   Name        : dist
   Main        : amain
   Starter     : none
   Protocols   : tcp://localhost:6788

Partition apart
   Main        : amain
   Task Pool   : 4 4 10 
   Units       : 
             - a (rci)
             - amain (normal)

Partition bpart
   Task Pool   : 4 4 10 
   Units       : 
             - b (rci)

Partition cpart
   Main        : cmain
   Task Pool   : 4 4 10 
   Termination : local
   Units       : 
             - cmain (normal)

 -------------------------------
gnatdist:    a caller stubs is up to date
gnatdist:    a receiver stubs is up to date
gnatdist: building b caller stubs from b.ads
gnatdist: building b receiver stubs from b.adb
gnatdist: building partition bpart
gnatdist: building partition cpart
archimedes$
archimedes$ ./apart &   # Start partition A
[1] 20904
archimedes$ ./bpart &   # Start partition B
[2] 20911
archimedes$ B: B making direct call to RCI function in A:
B: A Next gives 1

archimedes$ ./cpart     # Test partition C
C: Running  B.Next:
B: B Next called
C: B Next gives 2       # Works!
archimedes$ ./cpart     # Test partition C again
C: Running  B.Next:     # Hangs :(


--
Adrian








^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reliability and deadlock in Annex E/distributed code
  2006-09-11 18:52 ` Jerome Hugues
@ 2006-09-12 20:40   ` Dr. Adrian Wrigley
  2006-09-13  7:16     ` Dmitry A. Kazakov
  0 siblings, 1 reply; 20+ messages in thread
From: Dr. Adrian Wrigley @ 2006-09-12 20:40 UTC (permalink / raw)


Thanks for the reply!

On Mon, 11 Sep 2006 18:52:56 +0000, Jerome Hugues wrote:

> In article <pan.2006.09.10.20.55.57.113998@linuxchip.demon.co.uk.uk.uk>, Dr. Adrian Wrigley wrote:
> 
>> There are three partitions A, B, C
>> The program starts up normally.
>> A procedure (in a normal unit) in partition C calls a function (in a
>> normal package) in partition B (using dynamic dispatch on a remote access
>> to class-wide type). The function in partition B calls a function (in a
>> rci package) in partition A The function in partition A never executes,
>> and the program stops executing.
> 
>> It's as if there is some deadlock or shortage of tasks to allocate
>> or something.
>>
>> Any ideas?
> 
> How many application tasks do you have on each node ? did you
> configure a task pool for each node ? (just to check you do not run
> out of task)

I have about two or three.  I have configured a task pool.
I don't think running out of tasks causes the problem, although
it may trigger later failures, since each time execution hangs,
tasks are retained.

In another reply to this thread, I give example code which
behaves unexpectedly on GNAT GPL 2006.

>> I've tried building PolyORB from https://libre2.adacore.com/ but it
>> seems to be missing the src/dsa directory needed to support Annex E.
>> If I get the version from cvs, it gets the error "raised
>> RTSFIND.RE_NOT_AVAILABLE : rtsfind.adb:497".  What's the best way to
>> build the DSA personality?
> 
> Short answer: waiting for an AdaCore annoucement stating it is ready ;)

I heard for a year or two that PolyORB supports DSA.  But the
whole shizophrenic middleware thing confuses me, so I haven't
paid much attention before.

> Long answer: DSA requires that GNAT, GLADE (gnatdist) and PolyORB
> versions are consistent. So getting it from CVS expose you to
> problems.
>  
> AFAICT, the error you see means expanded code references an entity
> that does not exist. It is the symptom of a strong inconsistency
> between the compiler and the PCS.

The PolyORB CVS suggests that GNAT GPL 2006 is suitable, but the
change log implies that there is still significant turmoil in
the code base.

>> If I used the DSA personality from PolyORB, will it be any different
>> from the GARLIC PCS?  Might it be more/less robust? faster?
> 
> Lots of new configuration options, new protocols.

I'd like to see some clear examples of what real problems the tool
can solve.  How do I know when to use MOMA, SOAP, GIOP etc?
The DSA "application personality" seems the easiest to integrate
into existing Ada applications, but I have no experience of
any of the "protocol personalities" to choose between them.

What matters to me most at the moment is being able to make
calls between partitions absolutely reliably, with confidence
that partitions will start and stop when expected.
As soon as partitions go AWOL, not terminating or not starting,
hopes of a robust system fade :(
--
Adrian







^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reliability and deadlock in Annex E/distributed code
  2006-09-12 20:31 ` Dr. Adrian Wrigley
@ 2006-09-12 23:24   ` tmoran
  2006-09-13 11:00     ` Dr. Adrian Wrigley
  2006-09-13 11:21   ` Dr. Adrian Wrigley
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: tmoran @ 2006-09-12 23:24 UTC (permalink / raw)


I'm not familiar with the configuration control for the partitions,
so I have some questions:
  Why is there a "delay 1000.0;" in Amain in partition A?
  I see explicit starts for A and B - are they automatically
allowed to terminate when C ends?
  When C is explicitly started a second time what causes A and B
to still be present?



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reliability and deadlock in Annex E/distributed code
  2006-09-12 20:40   ` Dr. Adrian Wrigley
@ 2006-09-13  7:16     ` Dmitry A. Kazakov
  0 siblings, 0 replies; 20+ messages in thread
From: Dmitry A. Kazakov @ 2006-09-13  7:16 UTC (permalink / raw)


On Tue, 12 Sep 2006 20:40:16 GMT, Dr. Adrian Wrigley wrote:

> What matters to me most at the moment is being able to make
> calls between partitions absolutely reliably, with confidence
> that partitions will start and stop when expected.
> As soon as partitions go AWOL, not terminating or not starting,
> hopes of a robust system fade :(

I don't think that CORBA is a right thing for that.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reliability and deadlock in Annex E/distributed code
  2006-09-12 23:24   ` tmoran
@ 2006-09-13 11:00     ` Dr. Adrian Wrigley
  0 siblings, 0 replies; 20+ messages in thread
From: Dr. Adrian Wrigley @ 2006-09-13 11:00 UTC (permalink / raw)

On Tue, 12 Sep 2006 18:24:56 -0500, tmoran wrote:

> I'm not familiar with the configuration control for the partitions,
> so I have some questions:
>   Why is there a "delay 1000.0;" in Amain in partition A?
>   I see explicit starts for A and B - are they automatically
> allowed to terminate when C ends?
>   When C is explicitly started a second time what causes A and B
> to still be present?

The program terminates when all the partitions are ready to
terminate.  All the partitions with RCI and Remote Types units
have to run.  Partitions with normal units only can be run
any number of times (including not at all) - if started, they
have to complete before the program can terminate.

The Local_Termination policy on CPart allows the partition to
exit when it has nothing left to do.  Otherwise, it exits when
APart and BPart exit.

Without the delay 1000.0, the program terminates as soon as
APart and BPart have been started and completed.  This would
not give time for the user to start a CPart.  With the delay,
APart and BPart stay running for 1000 seconds, allowing
the user to run several CPart (perhaps concurrently).
When the time elapses and the CParts have completed, the
program terminates.

In the full program I have, there are several different
client partitions like CPart, but with different functionality.
I have several instantiations of the server like BPart
(which has normal units in my case).  APart provides
naming services allowing the BParts to be located.
It has been running with 3.15p for a year or two with
a separate watchdog to restart the server if it hangs.
Under GNAT GPL 2006, it works provided I avoid calling
different partitions in a chain.

Did you get the example running?
I use three separate windows - one for each partition, so
you can see any messages and watch for termination.
--
Adrian

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reliability and deadlock in Annex E/distributed code
  2006-09-12 20:31 ` Dr. Adrian Wrigley
  2006-09-12 23:24   ` tmoran
@ 2006-09-13 11:21   ` Dr. Adrian Wrigley
  2006-09-21 21:18   ` Dr. Adrian Wrigley
  2006-09-22 13:52   ` Dr. Adrian Wrigley
  3 siblings, 0 replies; 20+ messages in thread
From: Dr. Adrian Wrigley @ 2006-09-13 11:21 UTC (permalink / raw)


On Tue, 12 Sep 2006 20:31:55 +0000, Dr. Adrian Wrigley wrote:

> On Sun, 10 Sep 2006 20:58:33 +0000, Dr. Adrian Wrigley wrote:
> 
>> I've been having difficulty getting my Annex E/glade code to run reliably.
>> 
>> Under gnat 3.15p for x86 Linux, things were tolerably OK, with failures
>> of the code about weekly (running one instance continuously).
>> Sometimes the program simply wouldn't allow new partitions to run, as if
>> there was some boot server failure.  Sometimes the server would suddenly
>> start consuming all the CPU cycles it could get.
> ...
> 
> OK.  I have produced a fairly short example.
> There are three partitions, A, B, C.
> C calls B which calls A.
> Compiler is GNAT GPL 2006 + GLADE 2006 on x86 Linux
> 
> The partition C (executable in ./cpart) runs OK on *alternate*
> invocations.  Every other time, it hangs indefinitely.
> This seems strange.

(talking to myself again...)

If I change function Next in b.adb so that it doesn't call A
(returning a constant instead),  there are absolutely no problems.
Only when B.Next calls A does the deadlock happen.

There must be something in BPart that isn't completing properly
when running B.Next and calling into A.  Each time the call into
B hangs, it uses up a task.  It will create anonymous tasks
to replace them until the whole program grinds to a halt :(

I've tried using gdb on BPart to see what's going on.  From
what I can tell, the key code is in s-rpcser.adb, function
RPC_Handler.  On alternate occasions, it executes the
remote subprogram or just stops.  If I could see why this
happens, I might be able to fix it...
--
Adrian




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reliability and deadlock in Annex E/distributed code
@ 2006-09-15 21:24 Anh Vo
  2006-09-17 13:33 ` Dr. Adrian Wrigley
  0 siblings, 1 reply; 20+ messages in thread
From: Anh Vo @ 2006-09-15 21:24 UTC (permalink / raw)
  To: comp.lang.ada, Dr. Adrian Wrigley

I successfully compiled and run your original code. The important thing
is the starting sequence of partitions. That is cpart , bpart and apart
should be started in sequence. Cpart terminates first. Then, after 1000
seconds, cpart and bpart terminate. Therefore, delay 1000 seconds in
amain.adb is unnecessary. Comment out this delay, the result of the run
can complete quickly without for 17 minutes approximately.

AV 

>>> "Dr. Adrian Wrigley" <amtw@linuxchip.demon.co.uk.uk.uk> 9/13/2006
4:21 AM >>>
On Tue, 12 Sep 2006 20:31:55 +0000, Dr. Adrian Wrigley wrote:

[..]
(talking to myself again...)

If I change function Next in b.adb so that it doesn't call A
(returning a constant instead),  there are absolutely no problems.
Only when B.Next calls A does the deadlock happen.

There must be something in BPart that isn't completing properly
when running B.Next and calling into A.  Each time the call into
B hangs, it uses up a task.  It will create anonymous tasks
to replace them until the whole program grinds to a halt :(

I've tried using gdb on BPart to see what's going on.  From
what I can tell, the key code is in s-rpcser.adb, function
RPC_Handler.  On alternate occasions, it executes the
remote subprogram or just stops.  If I could see why this
happens, I might be able to fix it...
--
Adrian

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reliability and deadlock in Annex E/distributed code
  2006-09-15 21:24 Reliability and deadlock in Annex E/distributed code Anh Vo
@ 2006-09-17 13:33 ` Dr. Adrian Wrigley
  0 siblings, 0 replies; 20+ messages in thread
From: Dr. Adrian Wrigley @ 2006-09-17 13:33 UTC (permalink / raw)

On Fri, 15 Sep 2006 16:24:34 -0500, Anh Vo wrote:

> I successfully compiled and run your original code. The important thing
> is the starting sequence of partitions. That is cpart , bpart and apart
> should be started in sequence. Cpart terminates first. Then, after 1000
> seconds, cpart and bpart terminate. Therefore, delay 1000 seconds in
> amain.adb is unnecessary. Comment out this delay, the result of the run
> can complete quickly without for 17 minutes approximately.

Thank you very much for trying this!

The code runs fine for me with GNAT 3.15p on Linux.  Which version are
you using (gcc -v)?

The code runs fine on GNAT GPL 2006 too. But the second time cpart
is run, it hangs.  The third time it works and so on.

I start apart and bpart first as "servers".  If you take out the
delay in amain and run cpart first, it will work fine as you
describe.  But the "servers", apart and bpart will immediately
terminate.  The failure occurs if apart and bpart carry on
running, and a second cpart is invoked.

In my application it is important that I can run multiple
clients at the same time (like cpart).

Since I wrote my message last week, I have found that the failure
also occurs on alternate calls to B.Next within one invocation
of a partition.  I show a modified version of cmain.adb below,
which simply calls B.Next twice.  The partition outputs

C: Running  B.Next:
C: B Next gives 2
<hangs>

Do you get the same problem on your system with this code?

Thanks for your time!
--
Adrian

==> cmain.adb <==
with Text_IO;
with B;

-- Each time this program is run, should produce the next two integers
-- in sequence

procedure CMain is

begin

   Text_IO.Put_Line ("C: Running  B.Next:");
-- The next line works
   Text_IO.Put_Line ("C: B Next gives" & Integer'Image (B.Next));

-- but the next hangs on GNAT GPL 2006 and recent GLADEs
   Text_IO.Put_Line ("C: B Next gives" & Integer'Image (B.Next));

end CMain;
==>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reliability and deadlock in Annex E/distributed code
  2006-09-12 20:31 ` Dr. Adrian Wrigley
  2006-09-12 23:24   ` tmoran
  2006-09-13 11:21   ` Dr. Adrian Wrigley
@ 2006-09-21 21:18   ` Dr. Adrian Wrigley
  2006-09-22 13:52   ` Dr. Adrian Wrigley
  3 siblings, 0 replies; 20+ messages in thread
From: Dr. Adrian Wrigley @ 2006-09-21 21:18 UTC (permalink / raw)


On Tue, 12 Sep 2006 20:31:55 +0000, Dr. Adrian Wrigley wrote:

> On Sun, 10 Sep 2006 20:58:33 +0000, Dr. Adrian Wrigley wrote:
> 
>> I've been having difficulty getting my Annex E/glade code to run reliably.
>> 
>> Under gnat 3.15p for x86 Linux, things were tolerably OK, with failures
>> of the code about weekly (running one instance continuously).
>> Sometimes the program simply wouldn't allow new partitions to run, as if
>> there was some boot server failure.  Sometimes the server would suddenly
>> start consuming all the CPU cycles it could get.
> ...
> 
> OK.  I have produced a fairly short example.
> There are three partitions, A, B, C.
> C calls B which calls A.
> Compiler is GNAT GPL 2006 + GLADE 2006 on x86 Linux
> 
> The partition C (executable in ./cpart) runs OK on *alternate*
> invocations.  Every other time, it hangs indefinitely.
> This seems strange.
...

Just a quick update on (non) progress...

OK.  The test case is now shorter and very easy to run,
failing with just one partition used.

Simply gnatchop the text.  Paste the dist2.cfg

# compile
gnatdist -g dist2.cfg

# run
./apart

B: B Next called
C: B Next gives 1
<hangs>

The program should output numbers up to 100 and exit.

So far, it fails on:  GNAT GPL 2005, GNAT GPL 2006,  GNAT 4.1.1
  (using corresponding glade distributions)
and it succeeds on:   GNAT 3.15p
on FC5,  Red Hat 8.0 and knoppix (debian) (arch i386/i686)

The problem is that 3.15p glade fails in other, more interesting
ways, some of which have since been fixed.  RACW calls are proving
particularly problematic.


------ gnatchop-able text follows
with Text_IO;
with B;
with A;

-- Each time this program is run, should produce the next integer in sequence

procedure CMain is

begin

   for I in 1 .. 100 loop
      Text_IO.Put_Line ("C: B Next gives" & Integer'Image (B.Next));
   end loop;

end CMain;


package body A is

   X : Integer := 0;

   function Next return Integer is
   begin
      X := X + 1; -- Return next integer in sequence, unprotected
      return X;
   end Next;

end A;


package A is

   pragma Remote_Call_Interface;
 -- The next line causes failure.  Without it,
-- the calls are local and succeed without problem also see b.ads
   pragma All_Calls_Remote;
   function Next return Integer;

end A;



with Text_IO;
with A;

package body B is

-- Return A.Next simply by passing call through
   function Next return Integer is
   begin
      Text_IO.Put_Line ("B: B Next called");

      return A.Next;
   end Next;

end B;

package B is

   pragma Remote_Call_Interface;

-- The nextline causes failure.  Without it,
-- the calls are local and succeed without problem
   pragma All_Calls_Remote;

   function Next return Integer; -- Pass through of A's Next

end B;
-----end gnatchop-able text

-- Configuration file dist2.cfg
configuration Dist2 is

-- Boot server specification:
  pragma Starter (None);
  pragma Boot_Location ("tcp", "localhost:6788"); -- Choose spare port

  APart : Partition := (A, B, CMain);
  procedure CMain is in APart;
  for APart'Task_Pool use (2, 40, 60);

end Dist2;
-------------END------------




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reliability and deadlock in Annex E/distributed code
  2006-09-12 20:31 ` Dr. Adrian Wrigley
                     ` (2 preceding siblings ...)
  2006-09-21 21:18   ` Dr. Adrian Wrigley
@ 2006-09-22 13:52   ` Dr. Adrian Wrigley
  2006-09-22 23:11     ` Ludovic Brenta
  3 siblings, 1 reply; 20+ messages in thread
From: Dr. Adrian Wrigley @ 2006-09-22 13:52 UTC (permalink / raw)

On Tue, 12 Sep 2006 20:31:55 +0000, Dr. Adrian Wrigley wrote:

> On Sun, 10 Sep 2006 20:58:33 +0000, Dr. Adrian Wrigley wrote:
> 
>> I've been having difficulty getting my Annex E/glade code to run reliably.
>> 
>> Under gnat 3.15p for x86 Linux, things were tolerably OK, with failures
>> of the code about weekly (running one instance continuously).
>> Sometimes the program simply wouldn't allow new partitions to run, as if
>> there was some boot server failure.  Sometimes the server would suddenly
>> start consuming all the CPU cycles it could get.

...

Building GNAT GPL 2006 Glade, I find the examples don't build correctly,
and a couple that build crash:

The two source files I used are:

MD5 Sum
6504bed94037ac5ccc9e80f1831104f8  gnat-gpl-2006-i686-gnu-linux-libc2.3-bin.tar.gz
8ed3151978111ce6c26a857d3d4642ed  tools/glade/glade-gpl-2006-src.tgz

The examples directory Examples/MultiPro fails to build.

On line 636 of glade-2006-src/Examples/MultiPro/s-gaprxy.adb a call is made:

Soft_Links.Set_Stamp (From_SEA (Data));

Which appears to call a function defined at line 271 glade-2006-src/Glade/s-gasoli.ads

--    procedure Set_Stamp (S : Float);

Which, as you can see, is commented out.  The build therefore fails.

Is there are more recent version of these files which will build?

I have tested all the examples that build on three different machines.
Examples/Eratho/dynamic and Examples/Eratho/cycle fail in deadlock
on Fedora Core 5, Debian and Red Hat distributions.

Has anyone here managed to run these examples?

I have emailed gnat-gpl@adacore.com
--
Adrian

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reliability and deadlock in Annex E/distributed code
  2006-09-22 13:52   ` Dr. Adrian Wrigley
@ 2006-09-22 23:11     ` Ludovic Brenta
  2006-09-23 16:03       ` Reliability and deadlock in Annex E/distributed code (progress at last!) Dr. Adrian Wrigley
  0 siblings, 1 reply; 20+ messages in thread
From: Ludovic Brenta @ 2006-09-22 23:11 UTC (permalink / raw)


Dr. Adrian Wrigley writes:
[glade 2006 examples]
> Has anyone here managed to run these examples?

I haven't gotten around to packaging glade 2006 for Debian yet, but
it's on my list.  Your input is very valuable to me.

> I have emailed gnat-gpl@adacore.com

Please keep me posted.

-- 
Ludovic Brenta.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reliability and deadlock in Annex E/distributed code (progress at last!)
  2006-09-22 23:11     ` Ludovic Brenta
@ 2006-09-23 16:03       ` Dr. Adrian Wrigley
  2006-09-23 19:17         ` Björn Persson
  0 siblings, 1 reply; 20+ messages in thread
From: Dr. Adrian Wrigley @ 2006-09-23 16:03 UTC (permalink / raw)

On Sat, 23 Sep 2006 01:11:14 +0200, Ludovic Brenta wrote:

> Dr. Adrian Wrigley writes:
> [glade 2006 examples]
>> Has anyone here managed to run these examples?
> 
> I haven't gotten around to packaging glade 2006 for Debian yet, but
> it's on my list.  Your input is very valuable to me.
> 
>> I have emailed gnat-gpl@adacore.com
> 
> Please keep me posted.

The Glade test case MultiPro does not build in GPL 2006.
This is a known problem.

The deadlock is in Examples/Eratho/dynamic and Examples/Eratho/spiral
(cycle and absolute work fine <I got this the wrong way round in my last
message>). This seems to be the same problem as the deadlock in the simple
test case elsewhere in this thread. It doesn't seem to be a known problem.

There is a *separate* deadlock issue when running multiple
partitions doing RACW calls into different partitions, not the
boot partition.  This problem is also shown by Glade 3.15p and
all other versions I have tried.  I have a test case for this
if anyone is interested (a bit longer than the other).

Finally, however, there is good news (for me that is)!

I have found that by removing all the cases of RCI calls
calling other RCI units, I can avoid the deadlock that I have been
struggling with in recent Glades.

That just leaves the intermittent deadlock when doing RACW
calls from two different partitions into a third.
This problem is solved by putting the RACW unit into the boot partition.

By making the RACW unit (a server) into a generic package, I
can instantiate it from library level for each server instance
that I need (I just need a few).  The instantiations can then
all be configured into the root partition.

So I have managed to make the code run absolutely reliably.
In particular, running multiple clients concurrently,
and killing clients prematurely has caused *no* anomolous
behaviour.  This is great improvement over the same code
running on 3.15p, which would hang sporadically.

The only functionality lost is the ability to dynamically
instantiate servers.  Whenever the servers are in spearate
partitions, the failures occur.

The next step is to get the GtkAda components working,
and then to build the 64-bit and 32-bit partitions built
and working together (this works nicely, AFACT).

Thanks to everyone who has made suggestions here and by email!
--
Adrian

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reliability and deadlock in Annex E/distributed code (progress at last!)
  2006-09-23 16:03       ` Reliability and deadlock in Annex E/distributed code (progress at last!) Dr. Adrian Wrigley
@ 2006-09-23 19:17         ` Björn Persson
  2006-09-23 20:53           ` Dr. Adrian Wrigley
  0 siblings, 1 reply; 20+ messages in thread
From: Björn Persson @ 2006-09-23 19:17 UTC (permalink / raw)


Dr. Adrian Wrigley wrote:
> So I have managed to make the code run absolutely reliably.

Congratulations! You deserve some success after all the trouble you've had.

-- 
Bjï¿½rn Persson                              PGP key A88682FD
                    omb jor ers @sv ge.
                    r o.b n.p son eri nu



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reliability and deadlock in Annex E/distributed code (progress at last!)
  2006-09-23 19:17         ` Björn Persson
@ 2006-09-23 20:53           ` Dr. Adrian Wrigley
  2006-09-23 22:21             ` Björn Persson
  2006-09-25 11:41             ` Alex R. Mosteo
  0 siblings, 2 replies; 20+ messages in thread
From: Dr. Adrian Wrigley @ 2006-09-23 20:53 UTC (permalink / raw)

On Sat, 23 Sep 2006 19:17:51 +0000, Bjï¿½rn Persson wrote:

> Dr. Adrian Wrigley wrote:
>> So I have managed to make the code run absolutely reliably.
> 
> Congratulations! You deserve some success after all the trouble you've had.

Thank you.

I feel like Annex E Ada is a niche within a niche in programming.
That puts me in a tiny minority, even of readers at c.l.a
Moving to 64-bit with GtkAda puts me in another minority.

I hope the relatively sparse responses to the thread on this
topic just reflect that these are minority Ada topics, and not
a reaction against someone with unreasonable demands or behaviour!

The goal of the software is ambitious.  To archive live stockmarket
data.  To analyse it mathematically and make useful predictions.
To display account data on multiple screens/computers at once.
To simulate various strategies on historic data.  To integrate
multiple brokerages and information sources, with redundancy.
To place stock orders with brokers autonomously.  Last but
not least, to make a decent living out of it.  This final
step has proved to be a tough challenge, but as the software
gets stronger, the results keep improving...

Without Annex E, I would have had to devise an alternative
client-server architecture.  It would certainly have been more
complex, perhaps involving specialised libraries, web servers
or CORBA.  So it still surprises me how underused (and under
tested) it is.

Hopefully, the issues I have met with Glade will be
solved in a GNAT GPL 2007.  But I'm not relying on this.

If I go quiet for a bit, it means it's working!
--
Adrian

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reliability and deadlock in Annex E/distributed code (progress at last!)
  2006-09-23 20:53           ` Dr. Adrian Wrigley
@ 2006-09-23 22:21             ` Björn Persson
  2006-09-23 23:31               ` tmoran
  2006-09-25 11:41             ` Alex R. Mosteo
  1 sibling, 1 reply; 20+ messages in thread
From: Björn Persson @ 2006-09-23 22:21 UTC (permalink / raw)


Dr. Adrian Wrigley wrote:
> I hope the relatively sparse responses to the thread on this
> topic just reflect that these are minority Ada topics, and not
> a reaction against someone with unreasonable demands or behaviour!

I don't see anything unreasonable in your posts. I've been following 
them with interest, but I haven't had anything to add as I've never even 
tried to use Annex E myself.

-- 
Bjï¿½rn Persson                              PGP key A88682FD
                    omb jor ers @sv ge.
                    r o.b n.p son eri nu



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reliability and deadlock in Annex E/distributed code (progress at last!)
  2006-09-23 22:21             ` Björn Persson
@ 2006-09-23 23:31               ` tmoran
  2006-09-24  0:19                 ` Dr. Adrian Wrigley
  0 siblings, 1 reply; 20+ messages in thread
From: tmoran @ 2006-09-23 23:31 UTC (permalink / raw)


>> topic just reflect that these are minority Ada topics, and not
>them with interest, but I haven't had anything to add as I've never even
>tried to use Annex E myself.
  It appears to me the problems have been in the vendor specific part the
ARM doesn't describe:  ARM 5(E) "The implementation shall provide means
for explicitly assigning library units to a partition and for the
configuring and execution of a program consisting of multiple partitions
on a distributed system; the means are implementation defined."



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reliability and deadlock in Annex E/distributed code (progress at last!)
  2006-09-23 23:31               ` tmoran
@ 2006-09-24  0:19                 ` Dr. Adrian Wrigley
  0 siblings, 0 replies; 20+ messages in thread
From: Dr. Adrian Wrigley @ 2006-09-24  0:19 UTC (permalink / raw)


On Sat, 23 Sep 2006 18:31:34 -0500, tmoran wrote:

>>> topic just reflect that these are minority Ada topics, and not
>>them with interest, but I haven't had anything to add as I've never even
>>tried to use Annex E myself.
>   It appears to me the problems have been in the vendor specific part the
> ARM doesn't describe:  ARM 5(E) "The implementation shall provide means
> for explicitly assigning library units to a partition and for the
> configuring and execution of a program consisting of multiple partitions
> on a distributed system; the means are implementation defined."

Glade has a perfectly satisfactory means of configuring partitions.
(the main complaint I have about gnatdist is that it returns
a SUCCESS return code when the compilation failed :()

The problems I have been encounting are in the partition communications
system, I think.  The behaviour I've been seeing fails to meet the
requirements of Annex E semantics, AFAICT.
-- 
Adrian




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reliability and deadlock in Annex E/distributed code (progress at last!)
  2006-09-23 20:53           ` Dr. Adrian Wrigley
  2006-09-23 22:21             ` Björn Persson
@ 2006-09-25 11:41             ` Alex R. Mosteo
  1 sibling, 0 replies; 20+ messages in thread
From: Alex R. Mosteo @ 2006-09-25 11:41 UTC (permalink / raw)


Dr. Adrian Wrigley wrote:

> On Sat, 23 Sep 2006 19:17:51 +0000, Bjï¿½rn Persson wrote:
> 
>> Dr. Adrian Wrigley wrote:
>>> So I have managed to make the code run absolutely reliably.
>> 
>> Congratulations! You deserve some success after all the trouble you've
>> had.
> 
> Thank you.
> 
> I feel like Annex E Ada is a niche within a niche in programming.
> That puts me in a tiny minority, even of readers at c.l.a
> Moving to 64-bit with GtkAda puts me in another minority.
> 
> I hope the relatively sparse responses to the thread on this
> topic just reflect that these are minority Ada topics, and not
> a reaction against someone with unreasonable demands or behaviour!

I'm sure many people are interested in your findings. As you say, it seems
you're in a niche within a niche so for now you're a bit lonely there...



^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2006-09-25 11:41 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-09-10 20:58 Reliability and deadlock in Annex E/distributed code Dr. Adrian Wrigley
2006-09-11 18:52 ` Jerome Hugues
2006-09-12 20:40   ` Dr. Adrian Wrigley
2006-09-13  7:16     ` Dmitry A. Kazakov
2006-09-12 20:31 ` Dr. Adrian Wrigley
2006-09-12 23:24   ` tmoran
2006-09-13 11:00     ` Dr. Adrian Wrigley
2006-09-13 11:21   ` Dr. Adrian Wrigley
2006-09-21 21:18   ` Dr. Adrian Wrigley
2006-09-22 13:52   ` Dr. Adrian Wrigley
2006-09-22 23:11     ` Ludovic Brenta
2006-09-23 16:03       ` Reliability and deadlock in Annex E/distributed code (progress at last!) Dr. Adrian Wrigley
2006-09-23 19:17         ` Björn Persson
2006-09-23 20:53           ` Dr. Adrian Wrigley
2006-09-23 22:21             ` Björn Persson
2006-09-23 23:31               ` tmoran
2006-09-24  0:19                 ` Dr. Adrian Wrigley
2006-09-25 11:41             ` Alex R. Mosteo
  -- strict thread matches above, loose matches on Subject: below --
2006-09-15 21:24 Reliability and deadlock in Annex E/distributed code Anh Vo
2006-09-17 13:33 ` Dr. Adrian Wrigley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox