comp.lang.ada
 help / color / mirror / Atom feed
* Re: Reliability and deadlock in Annex E/distributed code
@ 2006-09-15 21:24 Anh Vo
  2006-09-17 13:33 ` Dr. Adrian Wrigley
  0 siblings, 1 reply; 13+ messages in thread
From: Anh Vo @ 2006-09-15 21:24 UTC (permalink / raw)
  To: comp.lang.ada, Dr. Adrian Wrigley

I successfully compiled and run your original code. The important thing
is the starting sequence of partitions. That is cpart , bpart and apart
should be started in sequence. Cpart terminates first. Then, after 1000
seconds, cpart and bpart terminate. Therefore, delay 1000 seconds in
amain.adb is unnecessary. Comment out this delay, the result of the run
can complete quickly without for 17 minutes approximately.

AV 

>>> "Dr. Adrian Wrigley" <amtw@linuxchip.demon.co.uk.uk.uk> 9/13/2006
4:21 AM >>>
On Tue, 12 Sep 2006 20:31:55 +0000, Dr. Adrian Wrigley wrote:

[..]
(talking to myself again...)

If I change function Next in b.adb so that it doesn't call A
(returning a constant instead),  there are absolutely no problems.
Only when B.Next calls A does the deadlock happen.

There must be something in BPart that isn't completing properly
when running B.Next and calling into A.  Each time the call into
B hangs, it uses up a task.  It will create anonymous tasks
to replace them until the whole program grinds to a halt :(

I've tried using gdb on BPart to see what's going on.  From
what I can tell, the key code is in s-rpcser.adb, function
RPC_Handler.  On alternate occasions, it executes the
remote subprogram or just stops.  If I could see why this
happens, I might be able to fix it...
--
Adrian




^ permalink raw reply	[flat|nested] 13+ messages in thread
* Reliability and deadlock in Annex E/distributed code
@ 2006-09-10 20:58 Dr. Adrian Wrigley
  2006-09-11 18:52 ` Jerome Hugues
  2006-09-12 20:31 ` Dr. Adrian Wrigley
  0 siblings, 2 replies; 13+ messages in thread
From: Dr. Adrian Wrigley @ 2006-09-10 20:58 UTC (permalink / raw)


Hi guys!

I've been having difficulty getting my Annex E/glade code to run reliably.

Under gnat 3.15p for x86 Linux, things were tolerably OK, with failures
of the code about weekly (running one instance continuously).
Sometimes the program simply wouldn't allow new partitions to run, as if
there was some boot server failure.  Sometimes the server would suddenly
start consuming all the CPU cycles it could get.

I think there may have been one or two bugs in the 3.15p version of glade,
particularly under certain error conditions of partitions being killed
while in use.

It has proved impossible to get the program to be as reliable as I want,
and have tried running under GNAT GPL 2006 from https://libre2.adacore.com/

I built and installed the glade, and have been testing my code.  It doesn't
work at all.  I have also tried the gnat and glade from Martin Krischik's
builds for FC5, and got the same problem:

There are three partitions A, B, C
The program starts up normally.
A procedure (in a normal unit) in partition C calls a function (in a
normal package) in partition B (using dynamic dispatch on a remote access
to class-wide type). The function in partition B calls a function (in a
rci package) in partition A The function in partition A never executes,
and the program stops executing.

If I enable glade debugging (S_PARINT=true S_RPC=true), I can see that
partition A gets the RPC message instructing it to do the call, but
then it doesn't actually call the necessary function (parameterless
return of an integer, but any function call fails).

If I call the function in A directly from B, it works fine.  It only seems
to be when A is called from B while executing a call from C that the problem occurs.

It's as if there is some deadlock or shortage of tasks to allocate or something.

Any ideas?

Using gdb, I find that each time a call in A is made, but doesn't execute, I
get a new task:

(gdb) info tasks
...
* 12   81d16a8    1  46 Waiting on entry call  rpc_handler
(gdb) where
#0  0x42028d69 in sigsuspend () from /lib/i686/libc.so.6
#1  0x4005b108 in __pthread_wait_for_restart_signal () from /lib/i686/libpthread.so.0
#2  0x4005804b in pthread_cond_wait () from /lib/i686/libpthread.so.0
#3  0x080c5172 in system.tasking.entry_calls.wait_until_abortable ()
#4  0x080c29c4 in system.tasking.protected_objects.operations.protected_entry_call ()
#5  0x080b0afa in system.rpc.server.rpc_handler (<_task>=0x81d1698) at s-tpobop.ads:200
#6  0x080bed4f in system.tasking.stages.task_wrapper ()

It looks like the call cannot proceed until "wait_until_abortable" returns.

Am I doing something wrong by making one remote call inside another?
Maybe the new glade detects an error unnoticed by 3.15p?  Perhaps this
is the cause of previous 'hangs'?

On the topic of Annex E support:

I've tried building PolyORB from https://libre2.adacore.com/ but it seems
to be missing the src/dsa directory needed to support Annex E.  If I get the
version from cvs, it gets the error "raised RTSFIND.RE_NOT_AVAILABLE : rtsfind.adb:497".
What's the best way to build the DSA personality?

If I used the DSA personality from PolyORB, will it be any different from the
GARLIC PCS?  Might it be more/less robust? faster?

Thanks in advance for any input!
--
Dr. Adrian Wrigley, Cambridge, UK.




^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2006-09-22 23:11 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-09-15 21:24 Reliability and deadlock in Annex E/distributed code Anh Vo
2006-09-17 13:33 ` Dr. Adrian Wrigley
  -- strict thread matches above, loose matches on Subject: below --
2006-09-10 20:58 Dr. Adrian Wrigley
2006-09-11 18:52 ` Jerome Hugues
2006-09-12 20:40   ` Dr. Adrian Wrigley
2006-09-13  7:16     ` Dmitry A. Kazakov
2006-09-12 20:31 ` Dr. Adrian Wrigley
2006-09-12 23:24   ` tmoran
2006-09-13 11:00     ` Dr. Adrian Wrigley
2006-09-13 11:21   ` Dr. Adrian Wrigley
2006-09-21 21:18   ` Dr. Adrian Wrigley
2006-09-22 13:52   ` Dr. Adrian Wrigley
2006-09-22 23:11     ` Ludovic Brenta

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox