Scope and Role of TAP Parameter Query
Introduction
First some history. Back when we began discussing TAP a couple
of years ago, it was agreed that TAP should be able to query table
metadata as well as table data. While all agreed that VOSI support
was needed for use internal to the VO projects, some of us felt that
the VOSI approach of a big block of XML describing the full tableset
was not what we wanted to provide to science application writers (often
non-professionals) for table metadata queries. Rather we wanted to use
the same table query mechanism to query table metadata as well as table
data, in the spirit of the elegant SQL information schema concept.
While
ADQL could be used for this purpose, at the time it was felt
that requiring
ADQL to support fully general table metadata queries,
while nice as an advanced feature, was more than was needed and would
require data providers to implement the TAP_SCHEMA as actual database
tables, which we (
AstroGrid in particular) wanted to avoid. A simple
parameter query interface could support basic table metadata queries
more flexibly than VOSI but without requiring full
ADQL support for
table metadata queries, sharing the same basic query i/o interface
with the
ADQL query. Most of what a client would want for table
metadata queries could be provided by such simple queries such as
<preamble>&FROM=TAP_SCHEMA.tables
which would list and describe all tables supported by the TAP
service, and
<preamble>&FROM=TAP_SCHEMA.colums&WHERE=table_name,foo
which would describe the columns of table "foo". These basic queries
are simple enough to be implemented using static metadata if desired,
as with VOSI, but would return only the metadata required by typical
client use cases, in a format convenient for the client to process.
The standard query interface with all of its features would be
available without change for client table metadata queries.
Once such an interface was contemplated it quickly became clear
that it could also be useful for simple filter-type queries of
individual tables (e.g. astronomical catalogs). Adding a spatial
region constraint (cone search) capability was also easy, and would
provide an attractive upgrade path for phasing out legacy cone search.
This concept was easily extended to support multi-position queries
and support for general
STC regions (REGION parameter) as well.
Fully relational DBMS queries would require
ADQL (which TAP also
provides) but are not required for typical queries of individual
astronomical catalogs.
A final motivation for the parameter query, at least for some of us,
was to provide something simple which could be implemented robustly
now, for use for our most common astronomical use cases while the
more complex and powerful
ADQL-based TAP query technology matures.
In particular advanced functionality such as scalable multi-position
("multicone") queries could be robustly implemented while providing a
simple interface to the scientist-programmer user. Once we succeed
in enticing astronomer-users with such a simple interface they will
be motivated to learn the more complex and powerful
ADQL interface,
with all the advanced analysis capabilities that SQL provides
(one reason for providing both in the same service interface).
Professional programmers and the large projects developing advanced
portal applications would probably use the
ADQL query capability from
the beginning, developing this technology into a mature capability
in the process.
The parameter query and
ADQL query are merely two alternative ways of
expressing a query. Param query is much more constrained than
ADQL
but provides explicit support for some important common use cases.
ADQL is a general parsed query language, directly leverages SQL, and
is much more flexible and powerful, but also more complex. Both share
the same service interface, execution engine, and output processing.
Prototypes of the TAP param query have thus far been implemented
only within NVO (to my knowledge). Fairly complete prototypes are
available from both
STScI and IPAC/IRSA. Similar capabilities (each
with a different interface) have however been provided by years by
many of the major astronomical data centers, e.g., CDS, CADC, IRSA,
HEASARC, and others, and have been quite popular with users for basic
catalog access. A survey and analysis of these was done early on in
the development of the TAP param query proposal.
Scope of TAP Parameter-Based Queries
Thus far two proposals have been made to begin to define the scope of
TAP parameter-based queries. The first originated within NVO and deals
explicitly with the issue of querying table data and metadata within
the TAP interface. In a later phase of the TAP discussions the concept
of a generalized parameter query language (
PQL) was also introduced.
The functionality proposed for parameter-based queries of table data
and metadata was summarized in late February. The details can be
found here:
http://www.ivoa.net/forum/dal/0902/1016.htm
We won't repeat the details here but the capabilities discussed in
the link above include the following:
- Simple table/DBMS metadata queries.
- Cone search replacement (spatial data model support).
- Multi-position queries ("multicone").
- Simple filter-type queries of astronomical catalogs.
- Query for table modifications (MTIME).
- Use of views to leverage SQL with simple param queries.
As part of TAP, param query would support inline or URL-based table
uploads, and querying of arbitrarily large catalogs using async
execution or streaming data transfers. Integration with VOTable
and use of UTYPE-based queries may also eventually be possible.
These capabilities are however shared with
ADQL-based queries and
are not specific to the parameter query, and with the exception of
VOTable integration and UTYPE support has already been specified.
The parameter query language (
PQL) concept proposes an
ADQL-like
general query capability using parameters instead of a parsed language
to pose the query. To the extent that this is used to query tables it
is the same as what is described above; the issues arise if we attempt
to use a generic query to query virtual data or access typed data
(images, spectra, etc.) where the semantics of the query necessarily
depend upon the type of data being accessed. This issue is discussed
further under "Issues for Discussion" below.
Interface
The most complete definition of the proposed param query interface
and functionality may be found in section 3.3 of V0.3 of the TAP draft
(as presented in the Baltimore interop in fall 2008):
http://wiki.ivoa.net/internal/IVOA/TableAccess/TAP-v0.3.pdf
The basic interface is preserved in later versions of the draft
TAP spec but TAP-specific functionality such as use of param query
for table metadata queries and multi-position queries is no longer
fully specified.
The param query interface as currently proposed includes the following
parameters:
POS, SIZE |
-- "Cone search" type spatial queries including multi-position queries using table uploads. |
REGION |
-- Spatial queries using more general STC-based regions. |
SELECT |
-- Specifies the table columns to be returned. |
FROM |
-- Specifies the table to be queried (including TAP_SCHEMA tables). |
WHERE |
-- Specifies an optional simple filter to be applied to specific table fields. |
Usage of these parameters is more fully presented in the TAP draft
specifications, as at the link above.
Other parameters, common with the
ADQL query, can also be used, e.g.,
FORMAT, UPLOAD, MAXREC, MTIME, RUNID (an issue not discussed further
here is whether MTIME should be limited to the param query).
PQL preserves all this but proposes a more general DAL parameter-based
query language, not specific to table data. Non-spatial query
parameters such as BAND and TIME are proposed; these are not normally
associated with table data but are used in the other DAL interfaces.
Aside from semantics the most significant change is replacement of
separate ParamQuery and AdqlQuery operations with a single query
operation, using LANG (or some such parameter) to specify the type
of query method to be used, i.e.,
ADQL, other-QL, or param.
Issues For Discussion
1. TAP-Specific Parameter Queries vs Generic PQL
The issue here is whether the TAP param query should be specific to
TAP, or some more generic query language like
ADQL which could be
used in other contexts.
A primary requirement for parameter queries in TAP is that we fully
specify how to query data tables as well as table metadata - not
images, not spectra, not spectral line lists, etc, but actual
tables
of some sort, as this is what TAP is primarily for. As noted under
"scope" above, we want to be able to do cone search or multi-position
queries of astronomical catalogs, possibly including a filter
constraint specified over the table fields. A simple filter-type
query with no spatial constraint is also needed. Param query should
provide a basic mechanism for table metadata queries. Whatever we
do, TAP param query needs to fully and explicitly specify how we do
these things.
The possibility to use a generic parameter-based query to query for any
type of data (not just tables) is also intriguing, and is part of the
motivation for the
PQL proposal. While there is some potential here
(more on this below) there are two main issues with this proposal.
First, while DAL queries such as SIA, SSA, etc. may look similar and
have similar parameters, they are used for actual data access as well
as for data discovery and the semantics are necessarily specific to
the type of data being accessed and the need to specify virtual data.
For example, if we look at what is required for spectral extraction,
or slicing and dicing a data cube, or generation of synthetic spectra
from a theoretical model, this has little to do with some generic
query mechanism. Second, if we try to make TAP parameter queries
generic we must not in the process compromise our primary requirement
of fully specifying how to query table data and metadata.
Nonetheless there is a role in DAL for a generic data query mechanism,
known as the
generic dataset query. This has been under discussion
for some years and is documented in the DAL2 architecture document
http://wiki.ivoa.net/internal/IVOA/SiaInterface/DAL2_Architecture.pdf
In object modeling terms the generic dataset is the base class for
all the DAL interfaces, with SIA, SSA, etc. being subclassed from the
generic dataset, providing specialized access for each major type of
astronomical data. In addition, the generic dataset query would make
it possible to discover any type of data with a single query, describe
associations among related primary datasets to model complex data,
link to the actual physical (archival) datasets for retrieval, or link
to data services which could be used for more advanced data access.
DAL has long proposed adding an actual service to implement the generic
dataset query. However, one can't help but notice that the proposed
PQL and the generic dataset query have similarities, especially if we
restrict
PQL to data discovery (no actual data access or virtual data).
While something like
PQL could not serve as the base class for actual
data access services, it is similar to the generic dataset query.
The proposed generic dataset query would provide both parameter
and
ADQL capabilities to query for generic datasets, returning the
result as a table providing associations and data links to model
complex data and provide access to such data. In practice a site
would probably construct an actual DBMS table providing an index to
all primary datasets (e.g., archive files) available at the site,
describing each such dataset using the generic dataset metadata (this
is essentially the same as the Observation data model in DM parlance).
If this generic dataset index is an actual table, can we use TAP to
query it? Clearly we can, with both parameter and
ADQL interfaces,
because that is what TAP already specifies. The next question is
whether we can generalize the TAP parameter query capability to
provide the functionality of the proposed generic dataset query.
It is very close already. If we extend TAP param query to do the
generic dataset query (as well as general table data/metadata queries)
then we will have the generic dataset query as well as something
much like the proposed
PQL. So long as the generic dataset index
at a site is an actual table, TAP
ADQL queries would be supported
as well. What the TAP param/GDS query would add would be integrated,
high level support for the generic dataset data model.
2. Use of Range-List for Filter Specification (WHERE parameter)
The value of the WHERE parameter in TAP param query is a simple list
of table field constraints, each of which is a simple open or closed
range, list of allowable values, or textual whole or substring match.
Negation and null value comparisons are supported. Pattern matches
are case insensitive but can be made case sensitive by quoting.
Lexical analysis can also be defeated by quoting.
The use of a simple range-list for the WHERE parameter has always
been a debatable issue. It provides all that is needed for simple
table metadata queries (the original motivation), but one might like
a more powerful parsed expression capability for general table filter
constraints. The range list (a DAL2 standard parameter syntax) does
not require a rule based parser to process; all that is required is
simple lexical token generation. It is straightforward to convert
a param query WHERE clause into an equivalent native SQL (or
ADQL)
expression.
WHERE as proposed is simple to compose and process, and adequate for
simple filter-type table field constraints. It is tempting to permit
more general expressions but this could significantly complicate
implementations, and in any case we already have the
ADQL query if
general expressions are required. The issue is whether a more general
expression mechanism is warranted, and if so what it would look like.
If possible we would like to maximize DAL2 compatibility to promote
code reuse, and minimize use of HTTP-unfriendly metacharacters to
simplify user submission of queries with common Web tools. The syntax
currently proposed is a compromise, taking all these considerations
into account.
3. Table Metadata Queries
Should VOSI (for the registry or registry-oriented client apps)
and param query be our primary TAP mechanisms for metadata queries?
MAXREC=0, with either the
ADQL or param query, can also be used but
this provides more limited information.
Imagine we are demonstrating TAP to a user, using only a Web
browser. It is very tempting to type in something like
<preamble>?FROM=TAP_SCHEMA.tables,FORMAT=text (or html,csv,tsv)
to get a simple list of the tables the service supports, followed
by something like
<preamble>?FROM=TAP_SCHEMA.columns&WHERE=table_name,xxx
to examine the columns of a table, after which we are ready to
submit a data query.
So long as we have the TAP_SCHEMA this is all straightforward.
An
ADQL query could be used as well, although it would be overkill
for simple metadata queries. The issue is whether we want to define
a minimum requirement for such metadata queries (as in the original
TAP 0.3 draft). The above queries are implementable without requiring
actual metadata tables in the DBMS.
4. Specifying Query Method
Since we no longer have separate ParamQuery and AdqlQuery service
operations, both having been combined into a single operation,
an issue is how we specify this in the service interface. What we
currently have is LANG, however the parameter query is not a general
query language in the sense that
ADQL is. Perhaps LANG should be
generalized to QUERYTYPE, QUERYMETHOD, or some such concept.
--
DougTody - 14 May 2009