International Virtual Observatory Alliance |
The main part of this document describes the adopted part of the VOTable standard; it is followed by appendices presenting extensions which have been proposed and/or discussed, but which are not part of the standard.
This is an IVOA Working Draft for review by IVOA members and other interested parties. It is a draft document and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use IVOA Working Drafts as reference materials or to cite them as other than ``work in progress''.
This proposed recommendation is made available for public review. Comments to this document should be sent to votable@ivoa.net, a mailing list with a public archive. It is appropriate to reference this document only as a recommended standard that is under review and which may be changed before it is accepted as a full recommendation.A list of current IVOA Recommendatrions and other technical documents can be found at
http://ivoa.net/Documents/
Acknowledgments
This document is based on the W3C documentation standards, but has been adapted
for the IVOA.
The VOTable format is an XML standard for the interchange of data represented as a set of tables. In this context, a table is an unordered set of rows, each of a uniform structure, as specified in the table description (the table metadata). Each row in a table is a sequence of table cells, and each of these contains either a primitive data type, or an array of such primitives. VOTable is derived from the Astrores format [1], itself modeled on the FITS Table format [2]; VOTable was designed to be close to the FITS Binary Table format.
Astronomers have always been at the forefront of developments in information technology, and funding agencies across the world have recognized this by supporting the Virtual Observatory movement, in the hopes that other sciences and business can follow their lead in making online data both interoperable and scalable.
VOTable is designed as a flexible storage and exchange format for tabular data, with particular emphasis on astronomical tables.
Interoperability is encouraged through the use of standards (XML). The XML fabric allows applications to easily validate an input document, as well as facilitating transformations through XSLT (eXtensible Style Language Transformation) engines.
VOTable has built-in features for big-data and Grid computing. It allows metadata and data to be stored separately, with the remote data linked. Processes can then use metadata to `get ready' for their input data, or to organize third-party or parallel transfers of the data. Remote data allow the metadata to be sent in email and referenced in documents without pulling the whole dataset with it: just as we are used to the idea of sending a pointer to a document (URL) in place of the document, so we can now send metadata-rich pointers to data tables in place of the tables themselves. The remote data is referenced with the URL syntax protocol://location, meaning that arbitrarily complex protocols are allowed.
When we are working with very large tables in a distributed-computing environment (``the Grid"), the data stream between processors, with flows being filtered, joined, and cached in different geographic locations. It would be very difficult if the number of rows of the table were required in the header we would need to stream in the whole table into a cache, compute the number of rows, then stream it again for the computation. In the Grid-data environment, the component in short supply is not the computers, but rather these very large caches. Furthermore, these remote data streams may be created dynamically by another process or cached in temporary storage: for this reason VOTable can express that remote data may not be available after a certain time (expires). Data on the net may require authentication for access, so VOTable allows expression of password or other identity information (the `rights' attribute).
Data Storage: Flexible and Efficient
The data part in a VOTable may be represented using one of three different formats: TABLEDATA, FITS and BINARY. TABLEDATA is a pure XML format so that small tables can be easily handled in their entirety by XML tools. The FITS binary table format is well-known to astronomers, and VOTable can be used either to encapsulate such a file, or to re-encode the metadata; unfortunately it is difficult to stream FITS, since the dataset size is required in the header (NAXIS2 keyword), and FITS requires a specification up front of the maximum size of its variable-length arrays. The BINARY format is supported for efficiency and ease of programming: no FITS library is required, and the streaming paradigm is supported.
We hope that VOTable can be used in different ways, as a data storage and transport format, and also as a way to store metadata alone (table structure only). In the latter case, we can imagine a VOTable structure being sent to a server, which can then open a high-bandwidth connection to receive the actual data, using the previously-digested structure as a way to interpret the stream of bytes from the data socket. VOTable can be used for small numbers of small records (pure XML tables), or for large numbers of simple records (streaming data), or it can be used for small numbers of larger objects. In the latter case, there will be software to spread large data blocks among multiple processors on the Grid. Currently the most complex structure that can be in a VOTable Cell is a multidimensional array.
VOTable is constructed with XML (extensible Markup Language), a powerful standard for structured data throughout the Internet industries. It derives from SGML, a standard used in the publishing industry and for technical documentation for many years. XML consists of elements and payload, where an element consists of a start tag (the part in angle brackets), the payload, and an end tag (with angle brackets and a slash). Elements can contain other elements. Elements can also bear attributes (keyword-value combinations).
The payload may be in two forms: parsed or unparsed character data. Examples are:
<text>François</text> <text><![CDATA[ a & (b <= c) ]]></text>
In the first example, the sequence ç is interpreted as part of the ISO/IEC 10646 character set (Unicode), and translates to an accented character, so that the text is ``François". The second example uses the special CDATA sequence so that the characters <, >, and & can be used without interpretation; in this case, any ASCII characters are allowed except the terminating sequence ]]> For more information, see any book on XML.
Following the general XML rule, element and attribute names are case-sensitive and have to be used with the specified capitalisation. For VOTable, we have adopted the convention that element names are spelled in uppercase and attribute names in lowercase (with an exception for the ID attribute). Element and attribute names are further distinguished in this paper by being typed with a fixed-width font.
In this section we define the data model of a VOTable, and in the next sections its syntax when expressed as XML. The data model of VOTable can be expressed as:
Metadata is divided into that which concerns the table itself
(parameters), and the definitions of the fields (or column
attributes) of the table.
Each FIELD represents the metadata
that can be found at the
top of the column in a paper version of the table:
in the example introduced in the
section below, the first FIELD has its name attribute
set to "RA". The Field can be thought of as a class definition,
and the table cells below it are the instances of that class.
A parameter (PARAM)
is similar to a FIELD,
except that it has a value attribute.
Parameters can be seen as ``constant columns'', containing for instance
FITS keywords or any other
information pertaining to the table itself or its environment, such as the
Telescope parameter in the example of section 3.1.
An informative parameter (INFO) (see INFO)
is a restricted form of the PARAM it is always understood
as a string (i.e. datatype="char"
and arraysize="*" are implied).
The ordered list of Fields at the top of the table thus provides a
template for a Row object (also called a record). The
template allows interpretation of the data in the Row.
The
record is a set of Cells, with the number and order of Cells the same for each
Row, and the same as the number of Fields defined in the Metadata.
In VOTable,
there is generally no advance specification of the number of rows in the table:
this is to allow streaming of large tables, as discussed above.
However, if the number of rows is known, it may be specified in a
dedicated nrows attribute.
From Version 1.1, columns may be logically grouped, so that it is
possible to define table substructures made of column associations.
Such an association is declared as a GROUP, which typically
contains column references (FIELDref)
and associated parameters (PARAM).
Each Cell is composed from Primitives, each of which is a datatype
of fixed-length binary representation, as listed in
the accompanying table.
Cells may consist of a single Primitive (this is
the default), or of an array (eventually multidimensional)
of Primitives (see the next section).
Except for the Bit type, each primitive has the fixed length in
bytes given in the table.
Bit scalars and arrays are stored in
the minimum number of bytes feasible (so that b bits take the integer
part of (b+7)/8 bytes). These primitives
are described in more detail in section 6.
VOTables support two kinds of characters: ASCII 1-byte characters
and Unicode (UCS-2) 2-byte characters. Unicode is a way to represent
characters that is an alternative to ASCII. It uses two bytes per
character instead of one, it is strongly supported by XML tools, and
it can handle a large variety of international alphabets. Therefore
VOTable supports not only ASCII strings (datatype="char"),
but also Unicode (datatype="unicodeChar").
Note that strings are not a primitive type: strings are
represented in VOTable as an array of characters.
VOTable = hierarchy of Metadata + associated
TableData, arranged as a set of Tables Metadata = Parameters + Infos + Descriptions
+ Links + Fields + Groups Table = list of Fields + TableData TableData = stream of Rows Row = list of Cells Cell =
Primitive or variable-length list of Primitives or multidimensional array of Primitives Primitive = integer, character, float, floatComplex, etc
(see table of primitives below).
datatype Meaning FITS
Bytes "boolean" Logical "L" 1 "bit" Bit "X" * "unsignedByte" Byte (0 to 255) "B" 1 "short" Short Integer "I" 2 "int" Integer "J" 4 "long" Long integer "K" 8 "char" ASCII Character "A" 1 "unicodeChar" Unicode Character 2 "float" Floating point "E" 4 "double" Double "D" 8 "floatComplex" Float Complex "C" 8 "doubleComplex" Double Complex "M" 16