Distributed Ada, robustness etc.

comp.lang.ada
 help / color / mirror / Atom feed

* Distributed Ada, robustness etc.
@ 2006-05-23 12:14 Dr. Adrian Wrigley
  2006-05-25  1:12 ` Dr. Adrian Wrigley
  0 siblings, 1 reply; 11+ messages in thread
From: Dr. Adrian Wrigley @ 2006-05-23 12:14 UTC (permalink / raw)


Up until now, I have been using fairly elementary Annex E features
with GNAT/GLADE on Linux.

I have a client/server model, where the clients are uncategorized
units each in their own partition, and may be multiply invoked.  The
one server is in a Remote_Call_Interface unit, and is invoked once
when the application starts.

I have a couple of problems:

1)  I need to use multiple servers partitions now. (spec change!)
2)  If the server node or partition fails, the application fails.

I understand that a partition with RCI units can only be invoked once
and cannot be restarted. This is a single point of failure :(
For flexibility and reliability, I think I have to avoid RCI units.

If I changed the RCI unit into an RT unit, each partition would get
its own instantiation.  This is prohibited since the server has
to access a single, common resource.

If I changed the server RCI into a normal unit, I could invoke it once
when starting up.  If the partition or node failed, it could be
restarted.  But how could I call the server in one partition from
another?  This is just like the example in ARM E.4.2.  This uses
an RCI as a name server, holding accesses to class-wide abstract
tagged limited private type.  This would solve (1), because I
could invoke multiple servers (each with their own single common
resource).  But it wouldn't solve (2) because there would still
be a single point of failure in the RCI of the name server.

Is it possible to achieve full redundancy/restartability in a
distributed application with a client/server architecture using
Annex E?

One basis for a solution would be if the servers could broadcast
their registration to all the interested partitions.  But I don't
see any facility to "broadcast" in Annex E.  And I don't find
anything for enumerating active partitions in an application.
Can this be done?  Basically, it is the "access value" for the
server that the client needs before they can connect, but this
can only be communicated through a "third party" known to
both units.

On a related topic, Are there any facilities in Annex E which result
in partitions being instantiated dynamically?  At the moment, I start
multiple instances of a partition manually (from a shell).

For the moment, I will have to use the "name server" approach
of E.4.2.  But as an application increases in scale and utility,
it becomes increasingly important to maintain robustness.

Annex E is amazingly easy to use, considering the power and
flexibility it has (is it still Annex E in the new standard?).
But it does seem to have its limitations.  I haven't looked
into any of the alternatives.  I still haven't studied
PolyORB to see what extra capability this might have.

Any thoughts?
--
Adrian




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Distributed Ada, robustness etc.
  2006-05-23 12:14 Distributed Ada, robustness etc Dr. Adrian Wrigley
@ 2006-05-25  1:12 ` Dr. Adrian Wrigley
  2006-05-25 10:34   ` Dmitry A. Kazakov
  2006-05-29  0:55   ` Dr. Adrian Wrigley
  0 siblings, 2 replies; 11+ messages in thread
From: Dr. Adrian Wrigley @ 2006-05-25  1:12 UTC (permalink / raw)

On Tue, 23 May 2006 12:14:05 +0000, Dr. Adrian Wrigley wrote:

> Up until now, I have been using fairly elementary Annex E features
> with GNAT/GLADE on Linux.
<snip>

Hmm.  Seems to have gone quiet round here!

OK. I've prototyped a system based on LRM E.4.2 (p. 412), where a
Remote_Call_Interface unit registers servers as they are instantiated.

This will work nicely, except for the single point of failure
issue resulting from having RCI units, and the following nuisance:

EITHER
every subprogram declaration using the remote despatching type
has to refer to it as "access Tape" (or whatever the type is called),

procedure Rewind (T : access Tape) is abstract; -- need to add "access"
...
TapeAccess := Find ("NINE-TRACK");
...
Rewind (TapeAccess);

Or

every despatching call needs to dereference an access variable
(so calls become something like "Tapes.Rewind (TapeAccess.all);")

procedure Rewind (T : Tape) is abstract;
...
TapeAccess := Find ("NINE-TRACK");
...
Rewind (TapeAccess.all); -- Need to add ".all" for every call!

I'd rather not change all my code to say "access" or "all" every
time I define or use one of these calls (lazy).  The only solution
I have come up with to avoid modifying the existing code to have
"all" with every call is to define a corresponding set of subprograms
which take the access values as parameters, and call the underlying
(remote) despatching interface by dereferencing it in the body.

So the code would become:

procedure Rewind (T : Tape) is abstract; -- unchanged code
procedure Rewind (T : access Tape) is begin Rewind (T.all); end Rewind;
...
TapeAccess := Find ("NINE-TRACK");
...
Rewind (TapeAccess); -- unchanged code

Am I missing a better, more obvious solution?

(interestingly, the remote despatching call may fail from
communications errors or other failure.  The corresponding
non-remote call can handle the exception by using the "Find"
to get another server to call instead, or take other
recovery actions.  Perhaps this is the way to go...)

I'm still thinking about eliminating the RCI failure point.

I'm wondering whether Annex E has ever been used as a basis
for a big commercial project...
--
Adrian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Distributed Ada, robustness etc.
  2006-05-25  1:12 ` Dr. Adrian Wrigley
@ 2006-05-25 10:34   ` Dmitry A. Kazakov
  2006-05-29  0:55   ` Dr. Adrian Wrigley
  1 sibling, 0 replies; 11+ messages in thread
From: Dmitry A. Kazakov @ 2006-05-25 10:34 UTC (permalink / raw)

On Thu, 25 May 2006 01:12:08 GMT, Dr. Adrian Wrigley wrote:

> On Tue, 23 May 2006 12:14:05 +0000, Dr. Adrian Wrigley wrote:
> 
>> Up until now, I have been using fairly elementary Annex E features
>> with GNAT/GLADE on Linux.
> <snip>
> 
> Hmm.  Seems to have gone quiet round here!
> 
> OK. I've prototyped a system based on LRM E.4.2 (p. 412), where a
> Remote_Call_Interface unit registers servers as they are instantiated.
> 
> This will work nicely, except for the single point of failure
> issue resulting from having RCI units, and the following nuisance:
> 
> EITHER
> every subprogram declaration using the remote despatching type
> has to refer to it as "access Tape" (or whatever the type is called),
> 
> procedure Rewind (T : access Tape) is abstract; -- need to add "access"
> ...
> TapeAccess := Find ("NINE-TRACK");
> ...
> Rewind (TapeAccess);
> 
> Or
> 
> every despatching call needs to dereference an access variable
> (so calls become something like "Tapes.Rewind (TapeAccess.all);")
> 
> procedure Rewind (T : Tape) is abstract;
> ...
> TapeAccess := Find ("NINE-TRACK");
> ...
> Rewind (TapeAccess.all); -- Need to add ".all" for every call!
> 
> I'd rather not change all my code to say "access" or "all" every
> time I define or use one of these calls (lazy).  The only solution
> I have come up with to avoid modifying the existing code to have
> "all" with every call is to define a corresponding set of subprograms
> which take the access values as parameters, and call the underlying
> (remote) despatching interface by dereferencing it in the body.
> 
> So the code would become:
> 
> procedure Rewind (T : Tape) is abstract; -- unchanged code
> procedure Rewind (T : access Tape) is begin Rewind (T.all); end Rewind;
> ...
> TapeAccess := Find ("NINE-TRACK");
> ...
> Rewind (TapeAccess); -- unchanged code
> 
> Am I missing a better, more obvious solution?

Basically I have same problem with other, but similar things. I do with
smart pointers. A smart pointer cannot be a perfect substitute for the
thing it refers to because you have to dereference it here and there.
Usually I make wrappers like you did, to delegate calls. It is a big
nuisance. Then it does not work properly when you have a types hierarchy
and wish to have a corresponding hierarchy of pointers.

This is rather a language problem. Alas, Ada has no support for either
delegation or parallel types hierarchies.

> (interestingly, the remote despatching call may fail from
> communications errors or other failure.  The corresponding
> non-remote call can handle the exception by using the "Find"
> to get another server to call instead, or take other
> recovery actions.  Perhaps this is the way to go...)

You have to decide whether objects are distributed or not. I would provide
both. Sometimes I would like to know the exact location of the object and
be sure that it is the one I am talking with. In other cases the object is
a clique of that provides a service, and I don't care which one does. But
this clique is an object itself, though distributed across partitions.

> I'm wondering whether Annex E has ever been used as a basis
> for a big commercial project...

Firstly, one should get a big project for Ada. (:-)) Then I'd like to see
remote entry calls, I don't know how it is in Ada 2005.

[rant on]
From my application domain (automotive automation), it is very important to
be able to get grip on the transport layer of Annex E. I didn't look at
GLADE, and I don't know how deep it is tangled with the compiler. But we
need the transport layer talking our protocols. Some of them are multicast,
like Ethernet based PGM or FlexRay, CAN, LIN-bus based protocols. That
could make synchronous RCIs difficult, as well as RCI as a paradigm (P2P).
I puzzled some time what an entry call to multiple tasks or a multiple
protected action could be. It seems that it goes in direction of multiple
dispatch anyway. Plain Java programming in Ada makes little sense to me.
[rant off]

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Distributed Ada, robustness etc.
  2006-05-25  1:12 ` Dr. Adrian Wrigley
  2006-05-25 10:34   ` Dmitry A. Kazakov
@ 2006-05-29  0:55   ` Dr. Adrian Wrigley
  2006-05-30 15:11     ` Dr. Adrian Wrigley
  1 sibling, 1 reply; 11+ messages in thread
From: Dr. Adrian Wrigley @ 2006-05-29  0:55 UTC (permalink / raw)

On Thu, 25 May 2006 01:12:08 +0000, Dr. Adrian Wrigley wrote:

> <snip>
> 
> Hmm.  Seems to have gone quiet round here!

perhaps it's the long weekend...
(...continuing the monolog)

Anyway, it's all working nicely.  But for one small snag:

The client partitions usually complete when there are no
more active processes in the partition.  I used:

  for myclient'Termination use Local_Termination;

so that clients don't have to wait for the whole program to
terminate.  So far, so good.

But, on the rare occasions when a client call to the
"nameserver" RCI causes a new server partition to be created
(using a system call to start the code), the client
doesn't terminate until after the new partition terminates.
This is a big nuisance.

If I start the server in another terminal window, the client
terminates while the server continues.  If the nameserver
starts the server from a client call, the client waits
for the server to terminate.

I don't really understand the mechanism for this behavior,
nor how to solve it.  Does the client partition track what OS
processes need to complete before exiting? How?  The client
waits, even if the server invocation is backgrounded (with "&").
I think using the "screen" untility might help, but I couldn't get
that didn't work properly either :(
--
Adrian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Distributed Ada, robustness etc.
  2006-05-29  0:55   ` Dr. Adrian Wrigley
@ 2006-05-30 15:11     ` Dr. Adrian Wrigley
  2006-05-31  5:49       ` Ludovic Brenta
  0 siblings, 1 reply; 11+ messages in thread
From: Dr. Adrian Wrigley @ 2006-05-30 15:11 UTC (permalink / raw)


On Mon, 29 May 2006 00:55:11 +0000, Dr. Adrian Wrigley wrote:

> On Thu, 25 May 2006 01:12:08 +0000, Dr. Adrian Wrigley wrote:
> 
>> <snip>
>> 
>> Hmm.  Seems to have gone quiet round here!
> 
> perhaps it's the long weekend...
> (...continuing the monolog)

...sometimes it feels lonely as an Ada programmer ;-|

I thought I'd put in some code to check if a server
partition is still alive (these functions in an RCI unit):

-----------------
function PartitionIsLive1 (SDK : SDK_T) return boolean is
begin
   declare
     Status : String := SDKStatus (SDK.all); -- call another partition
  begin
     return True; -- If we got a result... good!
  end;
exception -- normally a SYSTEM.RPC comms exception
  when others => return False; -- If we got any exception... Bad :(
end PartitionIsLive1;
---------------------

and this works about 99% of the time.
The other 1%, it gets stuck forever on the SDKStatus call :(  (why?)

So, thinking I'd be clever, I put a select/delay timout:

-----------------
function PartitionIsLive2 (SDK : SDK_T) return boolean is
begin
  select
     delay 30.0; -- give the partition adequate time to reply
  then abort
     begin
        declare
           Status : String := SDKStatus (SDK.all); -- call partition
        begin
           return True; -- If we got a result... good!
        end;
     exception -- normally a SYSTEM.RPC comms exception
        when others => return False; -- If we got any exception... Bad :(
     end;
  end select;

  return False; -- Couldn't get reply in time :(

end PartitionIsLive2;
---------------------

this works virtually all of the time.  But not quite.  Sometimes it
still jams.  And all subsequent calls (from other tasks) jam on
an SDKStatus call to the absent partition.  The whole system then gets
"gummed up".

Why doesn't the select/delay method guarantee a timely return from
PartitionIsLive2?

I'm trying to make the code resiliant to unexpected partition termination,
bugs, perhaps reboots.  But the gremlins keep thwarting the attempts!
--
Adrian






^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Distributed Ada, robustness etc.
  2006-05-30 15:11     ` Dr. Adrian Wrigley
@ 2006-05-31  5:49       ` Ludovic Brenta
  2006-05-31 12:40         ` Dr. Adrian Wrigley
  0 siblings, 1 reply; 11+ messages in thread
From: Ludovic Brenta @ 2006-05-31  5:49 UTC (permalink / raw)


Dr. Adrian Wrigley writes:
> On Mon, 29 May 2006 00:55:11 +0000, Dr. Adrian Wrigley wrote:
>
>> On Thu, 25 May 2006 01:12:08 +0000, Dr. Adrian Wrigley wrote:
>> 
>>> <snip>
>>> 
>>> Hmm.  Seems to have gone quiet round here!
>> 
>> perhaps it's the long weekend...
>> (...continuing the monolog)
>
> ...sometimes it feels lonely as an Ada programmer ;-|

[...]

> I'm trying to make the code resiliant to unexpected partition termination,
> bugs, perhaps reboots.  But the gremlins keep thwarting the attempts!

Hi Adrian,

I don't have anything useful to tell you, but please keep posting
here; I find this quite interesting.  Indeed, you seem to be at the
forefront of distributed Ada technology :)

-- 
Ludovic Brenta.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Distributed Ada, robustness etc.
  2006-05-31  5:49       ` Ludovic Brenta
@ 2006-05-31 12:40         ` Dr. Adrian Wrigley
  2006-05-31 13:21           ` Jean-Pierre Rosen
  2006-06-02 10:27           ` Stephen Leake
  0 siblings, 2 replies; 11+ messages in thread
From: Dr. Adrian Wrigley @ 2006-05-31 12:40 UTC (permalink / raw)

On Wed, 31 May 2006 07:49:20 +0200, Ludovic Brenta wrote:

> Dr. Adrian Wrigley writes:
>> On Mon, 29 May 2006 00:55:11 +0000, Dr. Adrian Wrigley wrote:
>>
>>> On Thu, 25 May 2006 01:12:08 +0000, Dr. Adrian Wrigley wrote:
>>> 
>>>> <snip>
>>>> 
>>>> Hmm.  Seems to have gone quiet round here!
>>> 
>>> perhaps it's the long weekend...
>>> (...continuing the monolog)
>>
>> ...sometimes it feels lonely as an Ada programmer ;-|
> 
> [...]
> 
>> I'm trying to make the code resiliant to unexpected partition termination,
>> bugs, perhaps reboots.  But the gremlins keep thwarting the attempts!
> 
> Hi Adrian,
> 
> I don't have anything useful to tell you, but please keep posting
> here; I find this quite interesting.  Indeed, you seem to be at the
> forefront of distributed Ada technology :)

Thanks for the encouragement!

I had a suspition that the silence was because c.l.a readers hadn't
met these problems before, rather than my decent into *everybody's*
kill-file :o

Anyway, continuing the story:

I realised that one of the problem areas of the design was that
calls are often attempted into terminated partitions.
The "nameserver" registers the partitions as they are elaborated,
but only unregisters them when communication with them has been
lost.

So I decided to add an unregister RCI call to the nameserver.
When the package that registered itself is in a terminating
partition, it makes the call to unregister itself.  Seemed
like a really sensible idea...

Unfortunately, the unregister call *always* fails.  I had
implemented the call through the finalization of a Controlled
type.  An instance of the type is in the package, and the
Finalize procedure is called when the partition terminates.
This call, however, seems to take place *after* the PCS for
the partition is brought down, and so the unregister
immediately fails.

I do now try to call unregister when a partition terminates,
using another mechanism.  But this doesn't quite match the
needs, and can't be used in all circumstances.  And it's
more complicated :(   If a partition is aborted, the only
way the code can find out is by attempting a call and watching
it fail.  This may be a weakness in the Annex E.

Is it legitimate to make RCI calls as library-level
units are finalized?  Shouldn't the PCS be brought up very early,
and be shut down late in a partition's life cycle, so that this
can be done?  Is there another way to implement package
finalization code?

Overall, the system seems to be working OK at the moment, with
overnight testing showing no anomolies.  But I'd like the whole
system to stay up for several months+, with thousands of client
partitions being invoked (serially, not concurrently!).
It's also important that substitution or failure of third-party
library code can happen while the system runs.
I may have achieved this already - only time will tell!
--
Adrian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Distributed Ada, robustness etc.
  2006-05-31 12:40         ` Dr. Adrian Wrigley
@ 2006-05-31 13:21           ` Jean-Pierre Rosen
  2006-05-31 14:38             ` Dr. Adrian Wrigley
  2006-06-02 10:27           ` Stephen Leake
  1 sibling, 1 reply; 11+ messages in thread
From: Jean-Pierre Rosen @ 2006-05-31 13:21 UTC (permalink / raw)


Dr. Adrian Wrigley a ï¿½crit :
> Unfortunately, the unregister call *always* fails.  I had
> implemented the call through the finalization of a Controlled
> type.  An instance of the type is in the package, and the
> Finalize procedure is called when the partition terminates.
> This call, however, seems to take place *after* the PCS for
> the partition is brought down, and so the unregister
> immediately fails.
> 
Note that finalization occurs in reverse order of elaboration. 
Therefore, this means that your package is elaborated *before* the PCS.

Maybe adding a (useless) "with system.rpc" to your package would suffice 
to have an appropriate elaboration, and therefore finalization, order

-- 
---------------------------------------------------------
            J-P. Rosen (rosen@adalog.fr)
Visit Adalog's web site at http://www.adalog.fr



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Distributed Ada, robustness etc.
  2006-05-31 13:21           ` Jean-Pierre Rosen
@ 2006-05-31 14:38             ` Dr. Adrian Wrigley
  2006-05-31 15:38               ` Jean-Pierre Rosen
  0 siblings, 1 reply; 11+ messages in thread
From: Dr. Adrian Wrigley @ 2006-05-31 14:38 UTC (permalink / raw)

On Wed, 31 May 2006 15:21:45 +0200, Jean-Pierre Rosen wrote:

> Dr. Adrian Wrigley a ï¿½crit :
>> Unfortunately, the unregister call *always* fails.  I had
>> implemented the call through the finalization of a Controlled
>> type.  An instance of the type is in the package, and the
>> Finalize procedure is called when the partition terminates.
>> This call, however, seems to take place *after* the PCS for
>> the partition is brought down, and so the unregister
>> immediately fails.
>> 
> Note that finalization occurs in reverse order of elaboration. 
> Therefore, this means that your package is elaborated *before* the PCS.

OK.  I don't think I understand the problem I'm having.

The package makes a "register" RCI call when it is being
elaborated.  This works fine (it's what I expected).
Doesn't that mean that I should be able to "unregister" it
when finalizing?

The error I got was:

raised SYSTEM.GARLIC.COMMUNICATION_ERROR : Send: Cannot connect to 1
autotrade_client

on the terminal where autotrade_client was invoked.

The autotrade_client partition contains the server package being
terminated, but is a different partition from the RCI nameserver.
The nameserver never gets the unregister.

> Maybe adding a (useless) "with system.rpc" to your package would suffice 
> to have an appropriate elaboration, and therefore finalization, order

I'll try this when I can.  I'll also try to see whether the PCS
is really being finalized prematirely.

Thanks for the hint!
--
Adrian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Distributed Ada, robustness etc.
  2006-05-31 14:38             ` Dr. Adrian Wrigley
@ 2006-05-31 15:38               ` Jean-Pierre Rosen
  0 siblings, 0 replies; 11+ messages in thread
From: Jean-Pierre Rosen @ 2006-05-31 15:38 UTC (permalink / raw)


Dr. Adrian Wrigley a ï¿½crit :
> The package makes a "register" RCI call when it is being
> elaborated.  This works fine (it's what I expected).
> Doesn't that mean that I should be able to "unregister" it
> when finalizing?

Normally, yes. If I understand correctly, your package declares a 
finalizable object just to support the Initialize and Finalize 
procedures. If that is correct, make sure that:
1) the registration is from the Initialize (to ensure symmetry)
2) your finalizable object is derived from Limited_Controlled, not from 
Controlled. This will prevent the compiler from doing nasty optimizations.

-- 
---------------------------------------------------------
            J-P. Rosen (rosen@adalog.fr)
Visit Adalog's web site at http://www.adalog.fr



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Distributed Ada, robustness etc.
  2006-05-31 12:40         ` Dr. Adrian Wrigley
  2006-05-31 13:21           ` Jean-Pierre Rosen
@ 2006-06-02 10:27           ` Stephen Leake
  1 sibling, 0 replies; 11+ messages in thread
From: Stephen Leake @ 2006-06-02 10:27 UTC (permalink / raw)

"Dr. Adrian Wrigley" <amtw@linuxchip.demon.co.uk.uk.uk> writes:

> On Wed, 31 May 2006 07:49:20 +0200, Ludovic Brenta wrote:
>
>> I don't have anything useful to tell you, but please keep posting
>> here; I find this quite interesting.  Indeed, you seem to be at the
>> forefront of distributed Ada technology :)
>
> Thanks for the encouragement!

I also don't have much to contribute, but find this interesting.

> <snip>
>
> I do now try to call unregister when a partition terminates,
> using another mechanism.  But this doesn't quite match the
> needs, and can't be used in all circumstances.  And it's
> more complicated :(   If a partition is aborted, the only
> way the code can find out is by attempting a call and watching
> it fail.  This may be a weakness in the Annex E.

"abort" is a really nasty way to end a program. As you have seen, the
program can't do much in the way of cleaning up after itself.

In my systems, I always have some sort of "terminate" signal or
rendezvous to every component, and ensure that is issued/called for
_every_ error condition that results in termination.

Sometimes this means the component must poll for the terminate signal,
even though otherwise it doesn't need to poll for anything.

Yes, it is more complicated. But that's because shutting down any real
system _is_ complicated!

I don't think this is a "weakness" in Annex E; it's just reality.

-- 
-- Stephe

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2006-06-02 10:27 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-05-23 12:14 Distributed Ada, robustness etc Dr. Adrian Wrigley
2006-05-25  1:12 ` Dr. Adrian Wrigley
2006-05-25 10:34   ` Dmitry A. Kazakov
2006-05-29  0:55   ` Dr. Adrian Wrigley
2006-05-30 15:11     ` Dr. Adrian Wrigley
2006-05-31  5:49       ` Ludovic Brenta
2006-05-31 12:40         ` Dr. Adrian Wrigley
2006-05-31 13:21           ` Jean-Pierre Rosen
2006-05-31 14:38             ` Dr. Adrian Wrigley
2006-05-31 15:38               ` Jean-Pierre Rosen
2006-06-02 10:27           ` Stephen Leake

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox