Proposed standard for asychronous activities on web services
Context
Currently, all our IVOA services run synchronously. I.e., they do what is requested of them during a single HTTP transaction. This is nice and simple but doesn't scale well to long-running activities. Such an activity might be
- a major archive query traversing a large DB table;
- a data-mining job run from a batch queue;
- a workflow with many steps;
- a workflow repeated for many data sets.
In any of these cases, the system is stressed once the actvity lasts longer than a few minutes and unreasonably fragile if the activity lasts longer than a few hours. With synchronous operations, all the entities in the chain of command -- client, workflow engine, broker, processing services -- have to stay up for the duration of the activity. If any one is restarted then the context of the activity is lost and the job has to be restarted from the beginning.
It is critical that we allow selected activities to run asynchronously. By this, I mean that a web-service operation starts the activity and completes as soon as the service accepts the job.
The activity continues outside the scope of any particular web-service operation. The client of the activity can find out from the service the state of the work; possibly, the service notifies the client asynchronously to avoid polling. The client has some way of keeping track of the job and of recovering this information without needing to stay connected to the service. Finally, there should be a system for cleanin up resources used by the activity. I have described this kind of activity as part of the planning of the
AstroGrid-2 project.
Clearly, we can build an ad-hoc solution for each kind of service, or even for each implementation of each kind of service. Equally clearly, it's more efficient in programming resources to have a standard solution and a reference implementation.
Proposal
I suggest that we need
- context: a way of associating a web-service operation with an activity;
- a way of getting information about the state of a context;
- management of the lifecycle of contexts such that resources are not leaked;
- notification by services of changes in contexts, e.g. "job complete".
The draft standards in the
WS-ResourceFramework (WS-RF) family address all these issues and I propose that we base our asynchronous-activity convention on WS-RF. Some details of the proposed usage are listed below.
The context of an activity is identified by a resource identifier from
WS-RF. An identifier is an opaque, unique token.
When an activity starts, the web-service operation that starts it ('operation' in the WSDL sense) returns the identifier for the resource.
A client associates a subsequent operation with the activity (e.g. an enquiry or 'abort' command) by passing the resource ID in the SOAP header, as described by WS-RF. The ID is carried inside an endpoint-reference structure as defined by
WS-Addressing.
Activities have timeouts. If an activity times out, then the service running it clears away all its state metadata and any locally-cached results. I.e. the client loses the output from a timed-out activity but the service reclaims its resources. The time-out on an activity is independent of any time-out of the
work done by the activity. The activity time-out governs the time for which the client can access the state and results of the work; this is typically longer than the time-out for the work itself. Consider a batch job in a queue with a run-time limit of 30 minutes. The work timeout is 30 minutes, but the period for which the service retains the results will be longer, possible a day or more.
The controls for the timeout on an activity are as specified in
WS-ResourceLifetime. I.e., the service implements a port-type that controls the activity's life-cycle.
Using this port, a client may end an activity before its time-out. A client may also increase or reduce the time-out period; but the service may reject some values of the proposed time-out period.
A client may wait for the end of an activity by calling a 'wait' operation on that activity. This operation is not part of WS-RF, so we must specify the details ourselves.
Services maintain state metadata on their activities. The metadata for an activity must include:
- a flag indicating 'running' or 'ended';
- a flag indicating 'no errors', 'some errors, handled' or 'fatal errors';
- a list of errors that occurred (hopefully empty in most cases).
These particular metadata are not specified by WS-RF, so we must choose them ourselves. The activity metadata may include other items.
A service
must provide operations for acccessing the state metadata. These must be implemented according to the
WS-ResourceProperties standard.
A client may subscribe to state metadata of an activity, as detailed in
WS-BaseNotification. A subscribed client, which must itself be a web service, receives asynchronous notification of metadata changes. A service need not support notification, but if it does so it
must support it as described by
WS-BaseNotification.
When a service implements these interfaces, it is promising to maintain its activities as persistent resources: i.e. persistent across restarts of the service. A restarting service
should restart all its activities where they were interrupted. If this is not possible, then the service
must mark those activities as aborted in their state metadata. A service
must not forget activities when it restart or delete the results of those activities,
unless the activities have timed out while the service was down.
A service implementing the feature above
must implement an 'activity' port-type which aggregates:
- the definition of the activity metadata;
- the operations for access to the activity metadata;
- the operations for controlling the lifetime of the activities;
- the 'wait' operation.
Concentrating these features into one port makes it easier to generate client stubs. The service
may build these features into another port-type that provides additional operations, but the service
must not disperse the features among many port-types.
In summary, a service conforming to this proposed standard must:
Areas that need specification in more detail are:
- the exact form of the activity metadata;
- the details of the 'wait' operation (which port; parameters; any timeout?; etc.);
- what faults a client receives if it asks about an unknown activity.
Why WS-RF?
We would do better to adopt an existing standard than to define our own from scratch. This aspect of the VObs isn't at all specific to astronomy, so there is no need for a tailored standard. By using an existing standard, we get the possibilities of using external implementations and of easy interoperation with externally-written services. The question is, which standard?
There are several frameworks for asynchronous activities that are quasi-standard.
- OGSI
- WS-RF
- WS-Coordination
- WS-CoordinatedApplicationFramework (WS-CAF)
- WS-GridApplicationFramework (WS-GAF)
OGSI is a GGF standard but is now deprecated in favour of WS-RF. WS-GAF is a private experiment produced by the Unversity of Newcastle; it is composed of other, simpler web-service standards. WS-RF, WS-Coordination and WS-CAF are each the subject of an OASIS technical-committee.
It seems that any of these frameworks could satisfy our requirements; they all provide the 'plumbing' from which we can build our IVOA conventions.
WS-GAF isn't a standard. Currently, there are no WS-GAF products to reuse.
OGSI has several implementations as libraries and a few services that we could re-use. However, OGSI is deprecated and the supported services using OGSI will migrate soon to WS-RF.
WS-Coordination and WS-CAF aren't used in either astronomy or grid computing as far as I know. Therefore, there are no complete web-services to re-use. There are no open-source library implementations of the protocol, either. WS-CAF specifies a complex pattern of agents and operations to manage activities; it would be relatively hard to implement and might not support all the patterns we need. WS-Coordination and WS-CAF do support transactions.
WS-RF does support the patterns we need (with the exception of the 'wait' operation, which can be added). However, WS-RF has no support for transactions. WS-RF is factored into parts that may be implemented separately; this should make it cheaper to support if we cannot get WS-RF libraries. Several academic implementations are in progress and commercial support is promised by IBM, BEA and HP. Most OGSI services (e.g. OGSA-DAI) are expected to be ported to WS-RF in the near future. It now seems likely that implementations of OGSA services will be based on WS-RF (although other frameworks could be used with OGSA).
Using WS-RF gets us easier integration with GGF grid computing. Using WS-Coordination or WS-CAF
might get us easier integration with commercial web services. On balance, WS-RF seems more useful to us.
--
GuyRixon - 06 May 2004