Shanghai interop 2017, discussion session for VOEvent
Change requests
Is there anything you would like to change in the current specifications ?
- Errata - things that need fixing
- Wishlist - things you would like
- Full support for STC and/or named targets (e.g., Jupiter)
- IVORNs should be unique identifiers, see e.g., NASA/PDS "lidvid" (logical identifier / version identifier) solution ( short explanation at NASA PDS Small Bodies Node)
- Embed cut-out images
- Do not repeat information when processing large blocks of similar events (e.g. share metadata across all 10000 events from an LSST visit)
- Alternative ways of referencing authors, for instance:
- refer to SPASE (Space Physics Archive Search and Exctract) registry (SPASE Metadata Working Group), where BaptisteCecconi registry record is refered as to as: spase://SMWG/Person/Baptiste.Cecconi. The record content can be reached in several ways:
- refer to ORCID...
- "Cool kids like JSON better" — could have multiple serialization formats of the same data
Signing and checksums
Do we need to support ways of signing the events or streams ?
- Do we need cryptographic signing (e.g. x509), or will arithmetic checksums do what we need (e.g. MD5sum) ?
- Can we do this in the transport layer, or do we need to do this at the content level ?
- What would be the cost in terms of processing time?
- Depends on what you do at the transport level.
Replay mechanism
Do we need to look at supporting some form of replay mechanism ?
- Would this be part of the stream protocol or a separate queryable archive ?
- Sounds as though Kafka (LSST/ZTF) already has this built in.
- Maybe depends if you are a high-volume producer (LSST) or a low-volume, high-value producer (LIGO)
- Can you iteratively refine a filter by tweaking it, replaying the event stream, repeat, ...?
- Other use case is to catch up when you actually miss events by dropping off the stream.
- Is there a requirement for catch-up functionality vs just querying the archive for the time you were offline?
- May never be practical for high-volume producers to implement this. (LSST might buffer e.g. one night's worth of events)
- The room is split on how useful this is: there are arguments that trying to have a single system support both streaming and rewinding is an unnecessary level of complexity, but conversely that e.g. Kafka gives you it for "free".
Describing Events & Streams
What types of events are people interested in ?
-
- What type of astronomical events do the VOEvents describe ?
- What type of content do the VOEvents contain ?
- How do we describe primary event streams ? e.g. Event streams from primary sources like Pan-STARRS, Atlas or LSST.
- How do we describe derrived event streams ? e.g. An aggregate stream combining events from both Pan-STARRS and Atlas.
- How important is provenance for third or fourth generation streams ?
If a user chooses stream A because it is from a well known source with a good reputation, and stream B because it uses a new filtering algorithm they think is interesting.
- How can we describe the filtering algorithm for stream B in a way that users can look for similar streams using similar algorithms?
- Is response time important?
A new filtering algorithm may be much better at detecting specific types of astronomical events, but it requires a larger set of historic measurements to make the assessment, which increases the latency between the first event and the classification result.
In some cases latency might not be an issue, preferring classification accuracy over fast response time.
In other cases, fast response time is vital, and the event consumer has to be willing to cope with a corresponding drop in accuracy.
- How do we describe things like classification accuracy, false positive rates, etc?
Is is up to the event provider to measure and publish their own accuracy statistics or would some form of third party rating mechanism be useful, possibly based on feedback on results from event consumers?
- What happens when the processing behind an event stream changes?
A specific data release of an archive is fixed. Running the same query will give the same results today and tomorrow. An event stream will change over time, not only because tomorrow will have different data flowing through it, but the processing pipeline that generates the events may change.
If a machine learning algorithm is trained to process data from an event stream, then the results of the learning algorithm are very sensitive to changes in the content of the event stream.
- How do we describe changes to an upstream processing pipeline in a way that is useful to end users?
Planning for the Future
TBD