Utypes Tiger Team
Utypes Use Cases
Raw use cases from Urbana-Champaign
- UC #1. Serialize DM instances into a file
- UC #2. Deserialize a DM instance from a file
- UC #3. Embed STC information into VOTables
- UC #3.1. Embed STC information in FITS
- UC #4. Provide an abstract (de)serialization strategy that can work with any expressive enough file format. A client can instantiate an object equivalent to the object that was originally serialized
- UC #4.1 Trivial roundtripping
- UC #5. Link columns in a relational model of the registry to VOResource schema elements
- UC #6. Tag metadata in a DAL query response
- UC #7. Render datasets/archives VO-compliant
- UC #8. Extensions of standard DMs
- UC #9. Support serialization of multiple instances of the same DM class
- UC #10. Standard, machine readable DM description
- UC #10.1 Versioning of DM descriptions
- UC #10.2 DM descriptions should express relationships between DMs (reuse, extensions)
- UC #11. Documentation of DM fields
- UC #12. Query archives by DM attribute (e.g. by observation’s target name)
Reviewed Use Cases and Requirements
--
MireilleLouys - 2012-09-10
Comments on the use-cases :
A.
there is an underlying core use-case not mentionned above :
create labels to map a piece of metadata (any_name, value) to an IVOA data model field if it exists and if usage defined in this model corresponds to usage in the considered service or application .
The very first goal why Utypes were invented is to "attach" to some metadata value the name of a data model attribute in an object oriented model describing astronomical observations and simulations. This was thought for a set of metadata , attached to an observation , and represented in a VOtable file.
The idea is just to propose as general uniform 'names' a set of strings logically constructed from the names of classes and attributes built up in data models.
Just looking at the variety of possible fits keywords and vocalulary extensions defined in various archives shows that the problem of defining a uniform langage for all pieces of metadata is too complex.
Relying on a recommended modeled metadata arrangement ( the current set of recommended Data models for IVOA) has given a framework for that.
Even restricted and bounded it covers common use-cases.
B.
Not only flat serialisation to consider
Today we can distinguish more various ways to distribute data sets:(
-- please iterate on this to enrich this section ...-- )
- Separated extra metadata file The data comes along in a file in a usual file format for astronomers : FITS table , image, bin-table , tsv , other,... The metadata comes together in a separate file as an extra documentation for this data set, where, when ,how was it obtained, and what kind of information is inside. This was encouraged from the beginning because this was a strategy to enrich and describe existing data set and circulate them, with their added metadata "wrap".
- All-in one file structure The data are coded in a usual format for the astronomer , or an adhoc format ( see Astrowise?) . Metadata are inserted in the data file and describe : where, when , how, etc... + the way the data arranged in the files ( table names , column names , content, etc.. Example is the enhanced FITS serialisation proposed in Spectrum 1.1 in Section 9 FITS Serialisation .
- Complex related datafiles Compressed-tied together archive files like in X-ray tarballs: data +metadata + interpretation charts and parameters, etc. Interpreting these data sets via applications, needs to identify the mapping of metadata parameters and values , with the individual parts of a data model. Publishing a data set to the VO , needs to apply IVOA data model items to local metadata values.
The interoperability relies on this mapping layer.
--
MireilleLouys - 2012-09-10/ updated Sept 12
Requirements:
- R #1 General requirement: provide a serialization and deserialization strategy for generic data model instances using VOTable. In other terms, given a complex object, it should be possible to serialize it, without loss of information, into a file so that a reader can reconstruct an instance of the original object.
- R #1a As R#1, but this should work for arbitrary data formats expressive enough to contain sufficient metadata
- R #1b As R#1, but readers and writers should not need to "know" more about the data model than a description of the data model in some specified language in order to (de-)serialize the objects
- R #1c As R#1, but readers and writers should be able to "pull through" data model instances they do not understand (i.e., preserve them through a load-save cycle).
- R #2 Utypes-XPATH: Utypes should be compatible with XPATH strings pointing to elements in a compliant XML instance: It should be possible, for instance, to use the utypes in a votable as XPATH strings to find the same information in an XML instance of a Data Model.
- R #3 Compliance with existing services and applications: since the main requirement is that compliant instances should be usable by end users and applications, a requirement is that transforming a non-compliant file into a compliant one of the same format should be as easy as possible and should not require changing the file but only adding metadata to it. This might not always be possible in some very complex cases.
- R #4 Support as many tabular formats as possible, in particular VOTable and FITS.
- R #5 Utypes must at least potentially work as opaque strings (e.g., when matching in database tables)
- R #6 Utypes must be case-insenstive
- R #7 It must be possible, from a utype or a group of utypes, to infer the data model used; more concretely, this probably means "infer a URI".
- R #8 When one data model P embeds another data model C, the utypes should allow an application that "understands" C but not P to at least deserialize/handle/treat instances of C
- R #9 Each FIELD or PARAM in a VOTable can fill more than one role within different data models, and all those roles must be denotable through utypes.
Remarks for R5-R7: I added them since these were at times design goals for utypes, and if we drop them we should do so consciously. R5 is a consequence of utype columns in TAP_SCHEMA, for example, and it also helps locating data model parts within larger documents. R6 is stipulated in SSA but has been hotly debated within Obscore. It's at least painful with current
ADQL, and IMHO outside of it, too, so it's at least contentious. R7, finally, has been the reason for the abuse of xmlns in SSA VOTables. --
MarkusDemleitner - 2012-09-07
Remarks on breaking out reqs 1a through 1c: Supporting 1a and 1b introduces a whole new set of constraints on the actual method(s) employed, and I'm pretty sure their cost in terms of specification complexity is rather high (e.g., 1b requires we define a DM specification language). If we go this way, we should do so seeing the cost. As to 1c, that's an el-cheapo dumb-down of 1b that's probably much easier to achieve but my just be good enough. --
MarkusDemleitner - 2012-09-12
Remarks on 8: The classic example (I'm reluctant to call it use case) is that an
STC library should be able to identify and e.g., transform coordinates given in characterization metadata even if it has no idea that something like char exists. --
MarkusDemleitner - 2012-09-12
Remark on 9: This is currently realized by way of FIELDref/PARAMref; an obvious -- and format-agnostic -- alternative is to allow multiple "pointers" in a single utype, e.g., by concatenating individual foo:bar.quux strings with a separator (";" was once floated as a candidate for that). Other solutions, possibly less demanding on the format (FIELDref/PARAMref) or the utype format itself (internal structure) are certainly conceivable. --
MarkusDemleitner - 2012-09-14
Abstract Use Cases:
Data Model (de)serialization
- UC #1 Serialize DM instances to file: given an instance of a Data Model and the DM machine readable description, a writer can serialize the instance into a number of supported tabular formats. The writer could be a DAL service.
- UC #2 Deserialize DM instance from file: given a serialized instance of a Data Model in a supported tabular format and the DM machine readable description, a reader can deserialize the instance into memory, building an object consistent with the DM itself.
- UC #3 Trivial round-tripping: given a serialized instance of a Data Model in a supported tabular format, an I/O library (possibly model-unaware) can convert the instance into a different, supported format without breaking its VO compliance.
- UC #4 Represent an arbitrary number of instances of the same class in a DM instance (for example, N instances of the PhotometryFilter class in a PhotometryCatalog instance of the Spectral DM). [Omar: in UTypes terms this means that the same UType could be used several times to describe attributes of several different instances, in the same file. Also, several Utyped values should be bundled together in some way, so that each instance of the class can be reconstructed].
VO Tools
- UC #4a VO-Importer: given a non-compliant file (or set of files) and the library of all the DM descriptions, an importer application can allow users to map columns and parameters in the file (or the set of files, or database) to IVOA DM attributes, thus producing a compliant version of the file (or set of files).
- UC #5 VO-Publisher: given a database and the library of all the DM descriptions, an helper application can allow data providers to map tables and columns in a database to IVOA DM attributes, in order to build a DAL service.
- UC #6 VO-Query: given a compliant archive/service it is possible to query it by using Utypes to refer to data model elements (for instance, query all observations for a target whose name is given). [Omar: I am not sure I understand this: does it mean that I could, for instance, query an archive for all the SDSS.g and SDSS.u magnitudes using Utypes?]
Data Model description
- UC #7 Data Model representation: DMs should be represented by a machine readable description that allows to:
- UC #7.0 Mapping from one simple piece of metadata to one single data model attribute at the finest description level
- UC #7.1 Describe and document Data Model elements.
- UC #7.2 Keep versioning information about the DM.
- UC #7.3 Reuse an existing DM in a new DM.
- UC #7.4 Extend an existing DM with a new DM.
- UC #7.5 Abstract the creation of VO-compliant I/O libraries from the details of the single DM. According to the programming language, each DM would be represented by some kind of plugin of the generic library. [Omar: this use case is actually a consequence of the implementation of the other 7.x cases].
Others
- UC #8. Link columns in a relational model of the registry to VOResource schema elements
Concrete Use Cases:
(Photometry Catalog, points in columns)
Represent a Photometry Catalog with a definite number of Magnitudes expressed in columns and astronomical sources in rows. For example, an SDSS catalog with the following columns:
SDSSID | RA | DEC | U | G | R | I | Z
(Photometry Catalog, points in rows)
A Photometry Catalog could refer to a single object observed in a number of filters, or to different objects observed in a number of filters, and the filters could be an arbitrary number. Employing an efficient relational approach would suggest to represent this as a table where each magnitude is expressed in a different row, and the other information (object name, coordinates, instrument, filter, etc) are in columns, or are factored out in the table header if they are common to all points.
For instance, here is a (simple) example of an (unnormalized) catalog for different sources. Notice that this table doesn't use any controlled vocabulary for filters, target names and instruments, while VO documents should:
TargetName | RA | DEC | Instrument | Filter | Magnitude | Units
M51 | xx.yy | xx.yy | SDSS | u | xx | ABMAG
M51 | xx.yy | xx.yy | SDSS | g | xx | ABMAG
NGC1068 | xx.yy | xx.yy | GALEX | nuv | Jy
(Aggregated SED)
An Aggregated SED is defined as an aggregation of different segments of spectro-photometric data, where each segment can be a Photometry Catalog, a Spectrum or an entire SED itself. It should be possible to serialize an SED as a list of several tables, each table representing a segment. Complex formats like VOTable and FITS can allow the tables to be stored in the same file.
STC serialization)"> (STC serialization)
It has to be possible to embed
STC instances in tables and attach these instances to other objects in the table. This is a generic example of Model reuse (the
STC model is reused by other models)
STC serialization in a VOTable)"> (STC serialization in a VOTable)
Since right now, the only non-deprecated way to include
STC metadata in VOTables relies on utypes, it is particularly unfortunate we have no way of doing this that's REC. I'm counting this as a separate use case since IMHO it's a particular shame something as basic is almost undefined in our recommended format -- the format we, as the VO community, actually control.
Also, IMHO ideally clients should not have to worry about what (if any) data model some data within a VOTable conforms to. Just as any VOTable library could support (the now deprecated) COOSYS element, support for "modern"
STC metadata should be "generic".
Sorry for littering the use case with discussions on practice; if you have a better place for this stuff, please do move it there.
One plan to do this is described in
Referencing STC in VOTable. The basic idea is to collect all information pertaining to
STC (or even some other data model) in one group, like this:
<GROUP utype="stc:CatalogEntryLocation">
<PARAM name="CoordFlavor" datatype="char" arraysize="*"
utype="stc:AstroCoordSystem.SpaceFrame.CoordFlavor"
value="SPHERICAL"/>
<PARAM name="CoordRefFrame" datatype="char" arraysize="*"
utype="stc:AstroCoordSystem.SpaceFrame.CoordRefFrame"
value="ICRS"/>
<PARAM name="ReferencePosition" datatype="char" arraysize="*"
utype="stc:AstroCoordSystem.TimeFrame.ReferencePosition"
value="BARYCENTER"/>
<PARAM name="TimeScale" datatype="char" arraysize="*"
utype="stc:AstroCoordSystem.TimeFrame.TimeScale" value="TT"/>
<PARAM name="Epoch" datatype="char" arraysize="*"
utype="stc:AstroCoords.Position2D.Epoch" value="2010.2"/>
<PARAM name="yearDef" datatype="char" arraysize="*"
utype="stc:AstroCoords.Position2D.Epoch.yearDef" value="J"/>
<PARAM name="TimeInstant" datatype="char" arraysize="*"
utype="stc:AstroCoords.Time.TimeInstant"
value="2002-01-28T09:30:00"/>
<PARAM name="URI" datatype="char" arraysize="*"
utype="stc:DataModel.URI"
value="http://www.ivoa.net/xml/STC/stc-v1.30.xsd"/>
<FIELDref ref="raErr"
utype="stc:AstroCoords.Position2D.Error2.C1"/>
<FIELDref ref="deErr"
utype="stc:AstroCoords.Position2D.Error2.C2"/>
<FIELDref ref="ra" utype="stc:AstroCoords.Position2D.Value2.C1"/>
<FIELDref ref="de" utype="stc:AstroCoords.Position2D.Value2.C2"/>
<FIELDref ref="pmra"
utype="stc:AstroCoords.Velocity2D.Value2.C1"/>
<FIELDref ref="pmde"
utype="stc:AstroCoords.Velocity2D.Value2.C2"/>
</GROUP>
<FIELD ID="ra" name="ra" datatype="float"/>
<FIELD ID="de" name="de" datatype="float"/>
<FIELD ID="raErr" name="raErr" datatype="float"/>
<FIELD ID="deErr" name="deErr" datatype="float"/>
<FIELD ID="pmra" name="pmra" datatype="float"/>
<FIELD ID="pmde" name="pmde" datatype="float"/>
One advantage of this scheme is that it's fairly easy to isolate the
STC parsing/unparsing code from the rest of the VOTable handling since the stuff it has to operate on is "just an element", not many elements spread out over the entire document.
This scheme also lets you embed multiple data models in a single VOTable. The "primary" data model (e.g., obscore or spectrum in
ObsTAP or SSAP, respectively) could still use the FIELD's utype attributes; even though the "primary" data models only have crippled
STC metadata (and would not, e.g., support proper motions), such information can still be transmitted in the VOTable and evaluated by clients. Here's an example that could be part of an obscore response in which the server also provides information on SSA:
<GROUP utype="stc:CatalogEntryLocation">
<PARAM name="CoordFlavor" datatype=... etc, as above
</GROUP>
<GROUP utype="spec:Spectrum">
<FIELDref ref="ra" utype="spec:Target.pos.ra"/>
<FIELDref ref="dec" utype="spec:Target.pos.dec"/>
... (whatever else is in the table for Spectrum) ...
</GROUP>
<FIELD ID="ra" name="ra" datatype="float"
utype="obscore:char.spatialaxis.coverage.location.coord.position2d.value2.c1"/>
<FIELD ID="de" name="de" datatype="float"
utype="obscore:char.spatialaxis.coverage.location.coord.position2d.value2.c2"/>
...
(ignoring the fact that I've probably made up spec utypes and obscore talks about observation rather than target; I guess you get the drift anyway). Of course, it would be much nicer if we could agree on throwing overboard most of the existing practice and could just say
<GROUP utype="stc:CatalogEntryLocation" id="targetPos">
<PARAM name="CoordFlavor" datatype="char" arraysize="*"
utype="stc:AstroCoordSystem.SpaceFrame.CoordFlavor"
value="SPHERICAL"/>
... etc, as before
</GROUP>
<GROUP utype="spec:Spectrum">
<GROUPref ref="targetPos" utype="spec:Target.pos"/>
</GROUP>
But it's probably too late for that, we don't have a GROUPref element in the first place, and less referencing is better as a rule.
--
MarkusDemleitner - 2012-09-20
STC in a Spectrum)"> (STC in a Spectrum)
This is a more specific example of model reuse. Spectral DM uses some parts of the
STC DM to describe the reference frame of the observations: the frame can include several axes: time, spectral coordinate, flux, photometry filters. Some of these axes could be different instance of the same
STC or
STC-derived class, like the photometry filters. This has to be represented in a generic tabular format using Utypes to describe the structure of the serialized instances. Different instances of the same class must somehow be disentangled from the others.
(Create compliant SSA service using a database and a non-compliant archive)
Given a database and an archive containing spectra serialized in a non-compliant way (but in a supported format, like FITS), a data publisher might want to create a VO service: in principle this would require the creation of a new database (or a view on the original one) and to copy and change the headers of the non-compliant files. A more efficient solution would be to leave the archive and the database untouched and to add an additional layer on top of them: the layer would add the required metadata to the original files on the fly (see R #4). For example, the service can read the information in the database and fill a VOTable compliant header (putting together the database values with the predefined Utypes) that will wrap the original FITS file in the files response.
Current Practices and Uses
Spectrum 1.1 REC, Quality Flags: Data.FluxAxis.Quality.n, where n is an integer. (Parsable? Anyway this is gone in Spectral 2.0)
Photometry 1.0 PR, Access class: “we use the Access class defined in
ObsTAP and inherited from SSA” -> PhotometryFilter.transmissionCurve.Access.*
Photometry 1.0 PR, Spectrum is imported using the spec namespace (notice the difference with the previous approach).
Namespace (in several DMs): the namespace must be parsed out of the Utype string… but then again which is the actual Utype string?
Extensibility (e.g. NED SED): Data.FluxAxis.Published.Value: is this Utype by any chance related to the standard Data.FluxAxis or to Target.Name? (How can I infer it?)
Introduced as an attribute for FIELD and PARAM in VOTable 1.2:
- Maps FIELD/PARAM to a DM attribute
- Encourages use of the XML namespace convention for avoiding name collisions
- Encourages use of the XML xmlns for linking to the DM
- Highlights the usefulness of utypes for space-time coordinates and provides an example
for
STC
- Does not say anything about parsability
Redefined in SSA 1.1:
- The goal of utypes is to “flatten a hierarchical data model so that all fields are represented
by fixed strings in a flat namespace”
- They are introduced as “fixed” strings, but no explanation is given on the meaning of
“fixed”.
- “Of course, if a data model becomes complex enough this will no longer be possible”
- Introduces a serialization mechanism for multiple instances (multiple equal Utypes in the
same file), providing an example using serialization specific features, for VOTable.
- Does not say anything explicit about parsability, however…
- In others sections (e.g. query response metadata) other features are introduced:
Utype is built with the pseudo-grammar “”.””
spec:Spectrum.Target.Name and ssa:Target.Name are the same thing.
- More information about utypes in Section 4.2.7 (Metadata Extension Mechanism)
Redefined in Spectrum 1.1, also introducing Data Model inheritance:
- Analogy with XPATH (‘.’ instead of ‘/’). “a.b.c.d”, dots indicate “has-a” relationship (3.5)
- ‘Data Model Field’ and ‘Utype’ interchangeable (3.5)
- “Other IVOA standards may use a different prefix instead of “Spectrum.” … This
represents Data Model inheritance.” (3.5)
- “the utypes can be used to infer the data model structure” (8.2)
Most DMs define utypes in tables, using different conventions
Utypes strings can change when DMs are reused. Also, the namespace changes globally for each DM (spec:Target.Name, ssa:Target.Name)
Utypes are only partially used in FITS serializations: they can be used for columns, not for parameters: in this case, an arbitrary 8 char string is provided by the DM document.
DMs do not define an “xmlns” link to the DM URI
VO-URP and UTYPE-s
See
here for a discussion on how VO-URP can support UTYPE-s discussion.
Minutes of telecons