Annex E, GLADE and fault-tolerance

comp.lang.ada
 help / color / mirror / Atom feed

* Annex E, GLADE and fault-tolerance
@ 2003-08-26 14:44 Dr. Adrian Wrigley
  2003-08-27 15:26 ` Francisco Javier Loma Daza
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Dr. Adrian Wrigley @ 2003-08-26 14:44 UTC (permalink / raw)


Hi all!

OK.  So I've got my client/server application working now, but have a problem
with fault tolerance. (I am using Annex E/GNAT 3.15/GLADE/Linux)

My objective is for clients to be able to run continuously for months at a time.
I also want the server to run for months too.  If a client terminates, I want
to be able to restart it.  If the server terminates, I want to be able to
restart it too.

So I use the "Reconnection" and "Termination" policy so that I can restart the
client or the server as necessary.  If the main boot server is separate from
the client and application server partition, it works fine.  Except when the
main boot server is restarted :(

When the main boot server is terminated and restarted, executing clients run into
problems.  The existing clients cannot communicate with the new main boot
server invocation, nor with partitions started after the new main server is started.

The effect is that if the main boot server dies, all the clients need to be
restarted too. I have read the GLADE Users' Guide carefully, and can't see a
solution to the problem - it seems to be a design feature(!)

How can I prevent failure of the server causing failure of all the clients too?

I just want the "simple" behaviour like web servers and clients, where you don't
need to restart all the clients each time a server reboots.

Thanks for any help on this!
-- 
Dr Adrian Wrigley, Cambridge, England.

Clients get the following exception, even after restarting the server:
Exception name        : SYSTEM.RPC.COMMUNICATION_ERROR
Exception message     : Partition 1 is unreachable
Exception information : Exception name: SYSTEM.RPC.COMMUNICATION_ERROR




^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Annex E, GLADE and fault-tolerance
  2003-08-26 14:44 Annex E, GLADE and fault-tolerance Dr. Adrian Wrigley
@ 2003-08-27 15:26 ` Francisco Javier Loma Daza
  2003-08-31 13:00 ` Dr. Adrian Wrigley
  2003-08-31 13:27 ` Annex E, GLADE and fault-tolerance (GLADE bug???) Dr. Adrian Wrigley
  2 siblings, 0 replies; 4+ messages in thread
From: Francisco Javier Loma Daza @ 2003-08-27 15:26 UTC (permalink / raw)


"Dr. Adrian Wrigley" <amtw@linuxchip.demon.co.uk.uk.uk.uk> wrote in message news:<%mK2b.4892$L15.72@newsfep4-winn.server.ntli.net>...
> Hi all!
> 
> OK.  So I've got my client/server application working now, but have a problem
> with fault tolerance. (I am using Annex E/GNAT 3.15/GLADE/Linux)
> 
> My objective is for clients to be able to run continuously for months at a time.
> I also want the server to run for months too.  If a client terminates, I want
> to be able to restart it.  If the server terminates, I want to be able to
> restart it too.
> 
> So I use the "Reconnection" and "Termination" policy so that I can restart the
> client or the server as necessary.  If the main boot server is separate from
> the client and application server partition, it works fine.  Except when the
> main boot server is restarted :(
> 
> When the main boot server is terminated and restarted, executing clients run into
> problems.  The existing clients cannot communicate with the new main boot
> server invocation, nor with partitions started after the new main server is started.
> 
> The effect is that if the main boot server dies, all the clients need to be
> restarted too. I have read the GLADE Users' Guide carefully, and can't see a
> solution to the problem - it seems to be a design feature(!)
> 
> How can I prevent failure of the server causing failure of all the clients too?
> 
> I just want the "simple" behaviour like web servers and clients, where you don't
> need to restart all the clients each time a server reboots.
> 
> Thanks for any help on this!

Can you make a trivial passive partition on which install the boot
server? as this partition will not contains any code, can be assumed
that it will not die unexpectly, but if you don't feel good, you can
make another (the real server for example) a boot mirror with the
boot_mirror command line option. The real server can be another
partition that can be terminated alone, leaving the passive boot
partition running.
I have no time for test but I hope that helps



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Annex E, GLADE and fault-tolerance
  2003-08-26 14:44 Annex E, GLADE and fault-tolerance Dr. Adrian Wrigley
  2003-08-27 15:26 ` Francisco Javier Loma Daza
@ 2003-08-31 13:00 ` Dr. Adrian Wrigley
  2003-08-31 13:27 ` Annex E, GLADE and fault-tolerance (GLADE bug???) Dr. Adrian Wrigley
  2 siblings, 0 replies; 4+ messages in thread
From: Dr. Adrian Wrigley @ 2003-08-31 13:00 UTC (permalink / raw)

Francisco Javier Loma Daza wrote:

 > Can you make a trivial passive partition on which install the boot
 > server? as this partition will not contains any code, can be assumed
 > that it will not die unexpectly, but if you don't feel good, you can

I don't think the bootserver can be passive - doesn't it run code to
allocate partitionIDs and control termination?

 > make another (the real server for example) a boot mirror with the
 > boot_mirror command line option. The real server can be another
 > partition that can be terminated alone, leaving the passive boot
 > partition running.

I have adopted a tactic like this.  The boot server runs continuously,
and since it does very little, is robust.  All other partitions can
be restarted.

The real server is in a separate partition, that will be restarted, if
necessary (for software upgrades, or program error).

The question of boot_mirrors isn't very important in my application,
where only one mahine is involved at present.  I have been unable to
get boot_mirror to work for me - maybe I don't understand it properly.

Unfortunately, separating the boot server and the real server hits
a serious bug, presumably in the PCS.  I shall post this separately.

 > I have no time for test but I hope that helps

Thank you for your suggestions.
--
Dr. Adrian Wrigley, Cambridge

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Annex E, GLADE and fault-tolerance (GLADE bug???)
  2003-08-26 14:44 Annex E, GLADE and fault-tolerance Dr. Adrian Wrigley
  2003-08-27 15:26 ` Francisco Javier Loma Daza
  2003-08-31 13:00 ` Dr. Adrian Wrigley
@ 2003-08-31 13:27 ` Dr. Adrian Wrigley
  2 siblings, 0 replies; 4+ messages in thread
From: Dr. Adrian Wrigley @ 2003-08-31 13:27 UTC (permalink / raw)


OK So I have rewritten my application so that the boot server and
applicaion server are in separate partitions.

I then expect to make connections from each client to the application
server.  Everything is fine... for a while.

But eventually, the whole thing crashes - either with unexpected exceptions
in "the wrong partition", or things simply hang.  I have reduced this to
a simple test case.

Does *anyone* out there in Ada land use Annex E/GLADE?

I have found the experience a little frustrating (because I am using
closed-source library code with serious threading bugs).  However, I was
really pleased with how straightforward the whole process is - assuming
the tools work properly.

I have included the GLADE failure test base below.
Hopefully the comments explain what is happening...

Thanks for any input.
--
Dr. Adrian Wrigley, Cambride, England.



---------------------------- configuration file ----------------------
-- This configuration demonstrates an apparent bug somewhere in the
-- GLADE runtime, or perhaps the OS
--
-- It has been built using GNAT 3.15p and GLADE 3.15p
-- on Intel Linux kernel version 2.4.18 (RedHat 7.3)
-- also fails on RedHat 8.0
--
-- To build, use "gnatdist dist.cfg"
-- This will make three executables:
--    mybootpartition     -- the boot server
--    serveur             -- the application server
--    client              -- a client of the application
--
-- The failure is demonstrated as follows:
--
-- 1) Open (at least) four terminals on the same machine
-- 2) run "./mybootpartition" on the first terminal
-- 3) run "./serveur" on the second
-- 4) run "./soaktest" on both the third and fourth (and others)
--
-- What should happen?
-- each soaktest should output
-- Hello started
-- Hello World! 1
-- Hello World! 2
-- ...
-- up to 1000
-- before repeating
--
-- What does happen?
-- Erratic behaviour, including
-- various exceptions "raised SYSTEM.RPC.COMMUNICATION_ERROR : Partition 2 is 
unreachable"
-- lockup
--
-- Why?
-- It seems that communication between partitions does not always
-- go to the right invocation of the client
-- Sometimes, stopping one soaktest crashes the other

configuration Dist is

-- Boot server specification:
   pragma Starter (None);
   pragma Boot_Location ("tcp", "localhost:5926");

-- The boot partition
   MyBootPartition : Partition := (BootServ);
   procedure MainProc is in MyBootPartition;

-- The server
   Serveur : Partition := (Message);

   procedure Hello;
   Client : Partition;
   for Client'Main use Hello;

end Dist;
------------------------- soaktest script ---------------------
#!/bin/bash
# soaktest script

while :
do
   ./client
done
------------------------- gnachop this below ------------------
package BootServ is

    pragma Remote_Call_Interface;

end BootServ;
procedure MainProc is
begin

    loop
       delay 5.0;
    end loop;

end MainProc;
package Message is

    pragma Remote_Call_Interface;

    function Hello_World return String;

end Message;
package body Message is

    function Hello_World return String is
    begin
       return "Hello World!";
    end Hello_World;

end Message;
with Message;
with Text_IO;

procedure Hello is
begin

    Text_IO.Put_Line ("Hello started");

    for I in 1 .. 1000 loop
       Text_IO.Put_Line (Message.Hello_World & Integer'Image (I));
    end loop;

end Hello;
-------------------------




^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2003-08-31 13:27 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-08-26 14:44 Annex E, GLADE and fault-tolerance Dr. Adrian Wrigley
2003-08-27 15:26 ` Francisco Javier Loma Daza
2003-08-31 13:00 ` Dr. Adrian Wrigley
2003-08-31 13:27 ` Annex E, GLADE and fault-tolerance (GLADE bug???) Dr. Adrian Wrigley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox