We describe a minor change to the interpretation of UType values in VOTables, which helps document UType meanings, and makes it easy to relate UTypes to each other, supporting interoperability while requiring minimal standardisation.
This is an IVOA Note.
This document is an IVOA Note expressing suggestions from and opinions of the authors. The first release of this document was YYYY Month DD.
It is intended to share best practices, possible approaches, or other perspectives on interoperability with the Virtual Observatory. It should not be referenced or otherwise interpreted as a standard specification.
A list of current IVOA Recommendations and other technical
documents can be found at
http://www.ivoa.net/Documents/
.
None, yet
UTypes are defined in section 4.5 of the VOTable standard [std:votable], with a definition which is sufficiently compact that we can reproduce it in full here.
In some contexts, it can be important that
FIELD
s orPARAM
eters are explicitly designed as being the parameter performing some well-defined role in some external data model. For instance, it might be important for an application to know that a givenFIELD
expresses the surface brightness processed by an explicit method. None of the existingname
,ID
orucd
attributes can fill this role, and theutype
(usage-specific or unique type) attribute has been added in VOTable 1.1 to fill this gap. By extension, most elements may refer to some external data model, and theutype
attribute is legal also inRESOURCE
,TABLE
andGROUP
elements.In order to avoid name collisions, the data model identification should be introduced following the XML namespace conventions, as
utype="datamodel_identifier:role_identifier"
. The mapping ofdatamodel_identifier
to an xml-type attribute is recommended, but not required.
At the time, this was addressing an anticipated, but not yet actual, need, and so this terse definition sensibly neither greatly constrains UType syntax, nor defines any specific instances.
Our situation is now different. The SIA protocol [std:sia] has acquired a number of UTypes (informally introduced in a mail message from J C McDowell), and the on-going Dataset Characterisation effort [std:characterisation] includes a list of UTypes in at least one version of its draft note. None of these have yet been formally standardised, so that now, with examples in mind and standardisation in prospect, is a good moment to refine the UType definition.
We make three suggestions, which we can summarise as follows.
datamodel_identifier
prefix above as an
XML namespace, with the syntactic requirements that implies, and
interpret the UType as a URI naming a concept.The second and third suggestions build on the first, but are independent of each other.
@@TODO MBT strongly recommends that `require', above, be changed to `strongly recommend', on the practical grounds that that is how it would probably be used in fact. My own feeling is that blessing that degree of casualness in creating UTypes might be harmful to their usefulness, but I can appreciate the practical force of the argument, and can see the extra permissiveness as encouraging the uptake of UTypes.
Further discussion of each of these appears in the sections below, and a rationale for the overall approach appears in C–Rationale. Although simple uses of the reasoning framework described there would be immediately available, the more elaborate possibilities would require further work. We would like to stress, however, that this is not the only benefit of the UType refinement we are suggesting, and that the consistency and documentation benefits described here would follow even if the reasoning potential were never exploited.
The draft characterization document describes a possible mechanism for serialising a data source using a data model and UTypes. We presume the existence of such an agreed-upon mechanism in the discussion of data sharing below.
In this proposal, an organisation creating a UType must perform three steps, mirroring the steps described in section 1–Introduction.
The UType definition quoted above (section 1–Introduction) includes a datamodel_identifier
which syntactically resembles an XML namespace identifier without
necessarily being one, and in particular without being necessarily
associated with a URI which would give it uniqueness and a potential
reference to documentation.
We suggest slightly expanding the UType definition by interpreting this
datamodel_identifier
prefix as precisely an XML namespace
identifier (which must therefore be defined using an
xmlns
attribute if it is used), and
identifying the UType as the string concatenation of the namespace
name and the local name as given in the utype
attribute, using the terminology of [std:xmlns]. There is precedent for this approach in the
definition of `Compact URIs' (CURIE, see [birbeck05]), and it is a syntax used extensively
and successfully in the RDF world.
In this interpretation, the following three fragments would represent identical UTypes and would be deemed to be equivalent.
xmlns:utns="http://www.ivoa.net/ut/#" utype="utns:axis"
xmlns="http://www.ivoa.net/ut/#" utype="axis"
utype="http://www.ivoa.net/ut/#axis"
The first is the usual XML namespace mechanism, and closely resembles
the VOTable definition, the second uses the XML notion of the default
namespace, and the third explicitly gives the URI which
the other two resolve to. As with XML namespaces, the string used as
the prefix -- utns
in the example here -- is
arbitrary, and it is only the post-concatenation URI that has any
meaning attached to it.
This proposal requires no syntactic changes to the VOTable specification. It is purely a mild reinterpretation of the syntax already defined and used.
The UType string that results from this concatenation must be a valid URI.
Since the namespace name is necessarily a URI, this constraint is
satisfied if the local name matches a restricted form of the
URI
syntax of of RFC3986 (see [std:rfc3986]):
( path-absolute | path-rootless ) [ "?" query ] [ "#" fragment ]
In practice, we expect most UTypes' local name parts would match
the fragment
syntax, and more specifically that subset of
it matching [0-9a-zA-Z_/.-]+
.
@@TODO: what characters should be allowed in the local name? The above is a rather conservative set. XML allows the local name to be
(Letter | "_") (NameChar-":")*
, butNameChar
includes large chunks of Unicode. This could be accomodated by requiring support for IRIs [std:rfc3987], but the XML namespace document includes only ambiguous support for that. Is the VO ready for kanji in its UTypes? Probably not.Even without worrying about IRIs, we shouldn't rely on the fact that XML has Unicode sorted out. Other formats, and other software, will have to read UTypes, and so encoding issues rear their heads. In particular, we mustn't require any encoding which uses more than one byte per character, since that would generate various transcoding challenges, to put it mildly, when handling FITS files.
We could restrict ourselves to the characters of 7-bit ASCII, but it would probably be painless to use ISO-8859-1 in fact. The defined 0-127 characters in that set exactly match the printable 7-bit ASCII characters, and ISO-8859-1 as a whole matches Unicode code points 0-255. Thus, although this does not correspond to any Unicode encoding, there is a broad compatibility with Unicode in this case.
It would be wise to exclude '.' from the set of UType characters, as this character plays a syntactic role in Notation3, so that it would be mildly inconvenient to describe UTypes including a dot. Are there more similar restrictions?
In this example and below, we illustrate
UTypes using the URI fragment identifier #
: this is
regarded as best practice in the RDF community and would generally be
more convenient in the procedure we illustrate, but there is no
technical reason why a set of distinct, fragmentless, URIs could not
be used instead. One advantage of using the fragment identifier is
that in this case it is natural to have the namespace URI refer to
an overview document describing the namespace as a whole.
UTypes used in non-XML contexts -- such as FITS files -- would have to use either the third explicit mechanism or some separate namespacing mechanism, not specified here, though briefly discussed in appendix A–UTypes and FITS.
This mechanism makes it possible to mint URI UTypes through a wide
variety of processes, from very formal and widely shared ones, managed
by an elaborate standards process and probably in a
www.ivoa.net
namespace; through semi-formal ones specific
to, and managed by, particular interest groups, perhaps on the way to
full standardisation; to very precise ones, perhaps specific to a
single instrument. Applications would choose which UTypes it was most
useful to them to support: presumably most generic VO applications
would support most www.ivoa.net
UTypes, and X-ray
applications, for example, might support many X-ray-specific UTypes.
Perhaps a few applications will support instrument-specific UTypes
directly -- perhaps because they fill a gap in a community-supported
vocabulary -- but most such UTypes would likely be handled via the
reasoning mechanisms described below.
Once UTypes have been defined as URIs, then they immediately provide a source of documentation, if the namespace URI is made dereferenceable.
For example, to define a UType
http://example.org/utypes/1.0#sharpBounds
(presuming that we own
the example.org
domain), we would create a web page at
the URL http://example.org/utypes/1.0
(see section B–Apache recipes for hints on making Apache return
HTML for such URLs which don't end in .html
), within
which we have a link target with the same name, which leads to a
human-readable description of the UType's semantics.
<h2><a name='sharpBounds'>Accurate bounds</a></h2> <p>In our data, <code>#sharpBounds</code> are the bounds on a bandpass where the transmission goes from 0% to 100% within 10nm. This is distinguished from <code>#fuzzyBounds</code> data, where...
The description here can go into as much or as little detail as is appropriate for the formality and intricacy of the document. Thus the URI UType will, when entered into a browser, show the documentation for precisely that concept.
Obviously, any entity minting UTypes is making an institutional commitment to the long-term stability of the namespace URI. An entity unable or unwilling to make such a commitment should avoid creating externally visible UTypes.
While the UType documentation described in section 3–Documentation is useful for humans, it is of course unintelligible to the applications that must interpret the data source annotated with the UType.
To continue our example, we might wish to share data using our new
#sharpBounds
concept. Doing so means that any application which
is written to understand our more precise concept can make good use of the
more precise meaning, but we want to make it possible for applications
which do not know about this concept to make use of the data
also.
We suggest a minimal profile of the W3C best-practice document [w3c:swbp] which describes how best to share standard RDF [std:rdf] and RDFS [std:rdfs] vocabularies.
We wish to assert that our new #sharpBounds
UType is a
more specific version of a concept
#characterizationAxis-coverage-bounds
,
which we presume has already defined by the IVOA in the namespace
http://www.ivoa.net/ut/characterization#
, and which we
can reasonably
expect software to know about. We can do this using RDFS (here
written in Notation3 syntax [std:n3]):
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>. @prefix myns: <http://example.org/utypes/1.0#>. @prefix ivoa: <http://www.ivoa.net/ut/characterization#>. myns:sharpBounds a rdfs:Class; rdfs:subClassOf ivoa:characterizationAxis-coverage-bounds .
This asserts that
http://example.org/utypes/1.0#sharpBounds
is a concept --
a Class in RDFS terms -- and that it is a more specific concept than
the Characterisation model's bounds concept.
We propose that the file containing this machine-readable
documentation of our UTypes be available at the namespace URI, and
returned when the URI is dereferenced using an HTTP Accept header of
text/rdf+n3
. All non-trivial HTTP APIs have support for
manipulating request headers in this fashion, and if all else fails,
the command-line curl
application can do the
retrieval:
% curl --header accept:text/rdf+n3 http://example.org/utypes/1.0
Recipes for setting up a web server to support such content negotiation are in section B–Apache recipes.
There are multiple systems (for example [app:jena] and [app:pellet]), in multiple languages, which can ingest such specifications and help an application make the necessary deduction. While an application could incorporate such functionality, it is straightforward to wrap such a reasoning system in a web-based service, and a system such as this has been prototyped.
Using such a resolver, an application which comes across the
previously-unknown UType
http://example.org/utypes/1.0#sharpBounds
can resolve it
in a single URL dereference (shown using curl
here):
% curl http://localhost/resolver?q=http://example.org/utypes/1.0%23sharpBounds
http://www.w3.org/2000/01/rdf-schema#Class
http://www.ivoa.net/ut/#characterization.characterizationAxis.coverage.bounds
http://example.org/utypes/1.0#sharpBounds
This returns the list of superclasses of the
#sharpBounds
concept (which includes the
#sharpBounds
class itself,
and the technical RDFS class), and so the application can simply work
through this list until it finds a UType it recognises, and then
proceed exactly as if that UType had been the one found in the input data
stream, instead of the previously unknown #sharpBounds
Utype. By making the subClassOf
assertion above we have
stated that this is a reasonable thing for an application to do.
The resolver does not need to be pre-loaded with a set of known UTypes. In fact, the reasoner can start off knowing about no UTypes at all, since when it is asked to resolve a hitherto unknown UType such as this one, it can simply dereference the URI as described in section 4.1–Describing subclass relationships, and add the retrieved relationships to its knowledgebase, ready to respond to this and any future queries. Since UType definitions will be stable, they can be aggressively cached (the assertions will be permanent in principle, but might include bugfixes and updates in practice). Thus this proposal requires no infrastructure beyond the dereferenceable URIs described above, and the commitment of the authors of those UTypes to maintain the URIs into the future.
The description above is expressed in terms of XML, through its reference to XML namespaces and its use of VOTable examples, but it is not specific to XML. To demonstrate this, and illustrate the potential use of these UTypes in other systems, we present here an example of how one might include UTypes in FITS files.
In a message
to the IVOA data-modelling group, Jonathan McDowell proposed FITS
keywords for UCDs and UTypes, namely TUCDnnnn
and
TUTYPnnn
, each providing a UCD and UType for the data in
the nnn
th column.
This is already enough to reliably associate UTypes with columns, but it has the disadvantage that the UTypes in question would probably quickly run into the 72-character limit on FITS card values.
We could expand Jonathan's proposal by requiring the
TUTYPnnn
to include a namespace prefix, exactly as the
utype
VOTable attribute has, and adding a further header
card to define the namespace prefix. This could be done with a header
card TUTNSnnn
, as follows:
TUTNS001=pfx:http://www.ivoa.net/ut/# TUTYP010=pfx:axis
where the numbers nnn
in TUTYPnnn
refer
to the annotated column, and nnn
in TUTNSnnn
distinguishes the namespace header cards from each other.
Alternatively, namespaces could be defined with a card
TUTNSaaa
where the aaa
letters define the
necessarily short namespace prefix, as in
TUTNSpfx=http://www.ivoa.net/ut/# TUTYP010=pfx:axis
This would have the side-effect of requiring that UTypes (or
rather, the part of them following the namespace URI) have a maximum
length of 68 characters (72 characters of a FITS card value, minus the
three aaa
characters and the colon). While this is
unlikely to be a great imposition, it is worth noting that some of the
proposed Characterisation UTypes [std:characterisation] are already tens of characters long.
@@TODO is there more to say, here?
In sections 3–Documentation and 4–Shared semantics above we describe dereferencing a URL
and retrieving either HTML or RDF depending on the content-negotiation
phase of the HTTP transaction -- that is, depending on the content of
the HTTP Accept
header. In this appendix we describe a
simple recipe for configuring Apache to support this; there will be
similar configurations for other web servers. We describe
only a single configuration here; fuller examples are available in the
W3C best-practice document [w3c:swbp].
A namespace such as http://www.ivoa.net/ut/#
would
(typically) correspond to a directory .../ut
on the web
server. Let us suppose that we have, in this server directory, HTML
documentation in a file namespace.html
and RDF in the
Notation3 syntax in a file namespace.n3
. For
completeness, we might as well have the same information in (the
largely unreadable) RDF/XML [std:rdfxml]
syntax as well, in a file namespace.rdf
.
We presume that this configuration is being done in a per-directory
.htaccess
file, and that the server has been configured
to allow this, by allowing the FileInfo Options
overrides. The following .htaccess
file will
have the desired effect:
AddType application/rdf+xml .rdf # The MIME type for .n3 should be text/rdf+n3, not application/n3: # see MIME notes at http://www.w3.org/2000/10/swap/doc/changes.html AddType text/rdf+n3 .n3 AddCharset UTF-8 .n3 RewriteEngine on # RewriteBase is the path to the current directory RewriteBase /ut # Use response code 303, 'See Other'. RewriteCond %{HTTP_ACCEPT} application/rdf\+xml RewriteRule ^$ namespace.rdf [R=303] RewriteCond %{HTTP_ACCEPT} text/rdf\+n3 RewriteRule ^$ namespace.n3 [R=303] # Default -- typically text/html RewriteRule ^$ namespace.html
With this configuration we can dereference the namespace URL in two different ways, to retrieve two different results:
% curl http://www.ivoa.net/ut/ <html> <head> [...] % curl -i --header accept:text/rdf+n3 http://www.ivoa.net/ut/ HTTP/1.0 303 See Other Date: Thu, 30 Nov 2006 16:19:51 GMT Server: Apache/1.3.33 Location: http://www.ivoa.net/ut/namespace.n3 Content-Type: text/html; charset=iso-8859-1 [...] % curl -L --header accept:text/rdf+n3 http://www.ivoa.net/ut/ @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . [...] %
(the HTTP 303 `see also' response is the appropriate RFC2616 [std:rfc2616] response indicating that [t]he response
to the request can be found under a different URI and SHOULD be
retrieved using a GET method on that resource
, and the
-L
option tells curl
to follow any
Location
headers in the initial response).
We include in this appendix a more discursive introduction to the problem this proposal is attempting to solve, and the larger social structure we expect to arise from it.
Standardisation is expensive, in both time and effort.
A standard must be as small as possible, so that it is more easily agreed on, and so that its documentation is not overwhelming; and it must at the same time be as large as possible, so that it covers enough of what its users want to exchange, to justify the effort of agreeing. The pressure for expanding the standard arises because, while standardisation is expensive, going beyond the standard incurs crippling costs as a result of the consequent loss of interoperability. Thus standardisation is not an end in itself, but merely a means to reach the real goal of universal interoperability.
The costs of standardisation arise because the participants in the standardisation process will have different designs in mind, and bring different implementations to the discussion. Sometimes these differences are merely accidents of history and taste, but sometimes they arise because the participants have different and incompatible requirements, so that the resulting standard ends up substantially more complicated than the designs that preceded it, still without completely satisfying anybody. Our particular concern here is the data models which structure shared data, which are variously designed for the convenience of the various data providers, but which a wide variety of data reduction applications nonetheless hope to read.
In this Note, we propose a structure which allows the different participants to retain their data models, and achieve interoperability, not by transforming their data into some never quite satisfactory consensus model, but by `explaining' their data model in terms applications can understand. Data providers can `explain' their model by analogy, saying that a concept in their data model is the same as, or a more specific variant of, a concept in another data model; if the latter concept is one which an application understands, then it knows how to handle the underlying data.
We would therefore expect to see a hierarchy of sets of UTypes.
We would therefore expect to see a large number of UTypes, which are of equal status in principle, but not in practice. It is in data providers' interests to make their data as widely intelligible as possible, by either using well-known UTypes or, where that is insufficiently precise, by `explaining' more specific ones in those terms. This creates an instability which produces a consensus on which UTypes are recognised as `well-known'. Of course, this process could be primed with an initial set of high level IVOA standard UTypes.
With this proposal, this last highest-level set of UTypes can be smaller than it might otherwise be, because it is no longer a costly disaster to omit things. If in retrospect it appears that a high-level standard omitted important concepts, then those can be developed in an agile fashion and stitched into the larger structure.
This agility emerges because this proposal facilitates not only different levels of specification, but also versioning and deprecation. The costs of versioning arise because it is expensive for applications to be reworked to use an updated version of a standard. If the new version's concepts are described in terms of the older version's, however, then it becomes reasonable for data providers to use the new improved version of a UType set, knowing that applications can deduce the relationship with the previous version they have coded-in knowledge of.
As well as versioning, reducing the community's reliance on a small set of gold-plated standards makes it possible for components of, or extensions to, standards to be designed, prototyped and maintained by specific interest groups, working independently.