This page contains my comments on the
TAP 0.31 spec
First summary of main points, then list of more detailed comments with location intext.
Some of these comments have (no doubt) been noted by others. I have been writing comments down while reading the spec over a couple of days, so may be somewhat repetitive.
Some comments may have been made irrelevant by the recent version 0.4.
NOTE: Notes like this one included below once action has been taken with respect to each point (PD aka
PatrickDowler). If TAP doc version is not specified, it is TAP-0.41.
Major points/issues/questions
- /sync vs /async: I think it preferable if it were possible to make a choice for implementing /sync and/or /async and not mandate both /sync and /asyn ADQL.I think /async is so much harder to implement that a /sync-only service should be allowed, but I can imagine if some implementers would prefer always /async for data queries. I propose that either (or both) is allowed, and should be part of service metadata.
NOTE: post to dal mailing list (2009-03-02) to explain and initiate discussion (PD)
- Metadata: (at the bottom of this page a proposal for a UML data model containing all contents already in TAP_SCHEMA model and extra. From it XML schema and TAP_SCHEMA tables can be easily derived. Based partially on discussions on mailing list.)
- foreign keys MUST be queriable (though may not exists ofcourse), therefore added to metadata
- indexes SHOULD [GL changed from MUST]be queriable (though may not exists ofcourse) , but MUST NOT be specified simply with an index=true attribute on column metadata
- "SQL type" SHOULD (MUST?) be added as possible data type to column metadata. [GL see datatypes page
- IF UDFs are really part of ADQL, metadata about them MUST be queriable (though ... ); maybe here also the standard functions such as INTERSECTS etc should then be specified IF they are supported.
NOTE: metadata discussion deferred until after next draft (PD)
- Grouping of and dependencies between HTTP parameters for the different request types should be made explicit.
- Imho, MAXREC and MTIME parameters should not be mixed with ADQL.
NOTE: This should be much more clear in TAP-0.4 (PD)
- Case sensitivity: The QUERY parameter is supposed to be case sensitive. Imho this should not be the case.
- ADQL is case insensitive. So are some major online databases (SDSS, Millennium, others?). So are many default settings on relational databases.
- Propose that case sensitivity is only an issue for column values, not (never ?) for names of tables and columns etc.
- Propose to make this a capability, possibly can be added at level of complete database, or schema, or table, or even column level. It is only relevant for (VAR)CHAR columns, maybe the T and Z in iso8601 dates(?).
NOTE: posted explanation and request for comment to dal mailing list on 2009-03-02 (PD)
NOTE: After discussion on dal mailing list, the doc has been changed to defer to the query language spec in matter of case sensitivity.
line-by-line notes/questions/issues
(s=section,p=page,par=paragraph on page or in section).
- s1 p4 par2: "... it is not a table containing links to data object ...". I suppose that if someone publishes a table that contains links to data sets, images or spectra, there is no problem with that. Queries might than indeed produce such links.
NOTE: this text is no longer in as of TAP-0.4 (PD)
- s1 end p4: ".. is not visible to users." I don't know whether it is necessarily a good idea to completely abstract away from a user whether there is a relational database on the backend or not. In some sense the fact that one can send ADQL, which is clearly an SQL dialect, makes users expect relational database technology. They may then also expect, and use, some specific database features such as indexes and foreign keys when writing their queries.
Also I think if this abstracting-away would translate into a suggestion to potential implementers, that they could just as well implement TAP on files, we'd do them a disservice. The best way to suport
ADQL queries is by storing one's results in a relational database and pass it the
ADQL, possibly slightly adapted. Not write one's own database engine.
NOTE: extra text about abstraction removed for clarity (PD)
- s1 p5 par2: "... joins ... and provided the service supports these capabilities.". I would think that services MUST support joins, as those are an intricate part of ADQL and because service MUST support ADQL queries. Or is it possible to specify that one supports only a subset of ADQL?
NOTE: In my opinion it is necessary to allow services to support a subset of
ADQL; this would be described in the capabilties returned from the VOSI capabilities request... not sure if one lists all the
ADQL features (keywords) that are supported or the version of
ADQL and then the ones that are not (should be a smaller list)... TBD (PD)
- s1 p5 par3:".. conforming to the second generation (DAL2) interface standards [ref]." It would be really good to have this [ref]! Maybe such a "meta-specification" would be a good place to put some of the parameter query specification in.
NOTE: this text is no longer in as of TAP-0.4 (PD)
- s1.1.1: Confusing section. There seem to be at least three ways of querying for table metadata:
- querying standardised tables using ADQL or PARAMQUERY
- tableset queries
- VOSI queries
NOTE: deferring (as above for metadata)
- s1.1.2 p6 end par2:" ... (ADQL), a standardized subset of SQL92...". Is not quite correct. Is based on SQL92, but no strict subset as it adds extensions such as user defined functions and of course all the REGION stuff.
NOTE: this text is no longer in as of TAP-0.4 (PD)
- s1.1.2 p6 par3: "... use an off-the-shelf ADQL parser...". This is the problem with ADQL, that in general one can not simply pass it through to the underlying database, even if it is properly supplied with the required user-defined-functions.
NOTE: this text is no longer in as of TAP-0.4; underlying issue not otherwise addressed (PD)
- s1.1.2 p6 par3: "... simplified parametric queries for the most common use cases." How do we know what the "most common use cases" are? I think this depends strongly on the database. It likely refers to the usual suspect cone search as the most common use case, but is that true? Could be changed to "some common use cases".
NOTE: this text is no longer in as of TAP-0.4 (PD)
- s1.1.3 p6 par3: Use of UWS, which is not accepted yet, in this specification, would seem to require that TAP must define its view of what UWS is. This would be particularly useful for those people who want to implement TAP before UWS is completely accepted. Same is true for possible dependencies on other not-yet-accepted standards such as VOSI.
NOTE: It was accepted in Trieste that the UWS spec would have to be developed and standardised ahead of TAP (PD)
- s1.1.3 p7 par1: "... there are many more advanced use cases where synchronous queries are not sufficient." I would argue that this has not much to do with how "advanced" a use case is, as with queries requiring lots of work and/or resources on the server side. The query can be as simple as
select * from thattable
, not advanced at all. But it may lead to timeouts/overflows for /sync queries. Whereas other queries make very advanced use of ADQL, and precisely because of that (calculating statistics on the server iso download, proper index usage, proper database design etc) can be supported with /sync just as well. And /sync is MUCH easier to implement.
NOTE: this text is mostly gone as of TAP-0.4; discussion of sync vs async (from main points above) redirected to DAL mailing list (PD)
- s2 "Requirements for a TAP service (normative)" (my italics). It seems to me that there are some requirements in this section that are aimed at clients, not the service. Should identify those and if correct must something be done about that?
NOTE: it is true that when describing a service interface that some things are requirements for the service and some for the clients; the latter also need to be described so that the correct response can be specified (e.g. an error when a required param is missing); will try to clarify after next draft (PD)
- s2.1: As /sync is SO MUCH easier to implement, and can nevertheless provide more than adequate support (from experience with sync-only Millennium database), is it possible to change the requirements to something like: "A TAP service MUST support at least one of sync-ADQL and async-ADQL". I first thought that sync alone should be made mandatory, but I guess some people would like to only implement async.
NOTE: discussion of sync vs async (from main points above) redirected to DAL mailing list (PD)
- s2.1 p9 3rd item in list I would think that table metadata MUST be provided. Without it no queries are possible.
NOTE: agreed, changed in TAP-0.41
- s2.1 p9 final par "...inheritance of requirements ...". This is relevant as well for SimDB. There we define a global data model for describing (3+1D/space-time/"cosmological") simulations. The model gets a mapping to TAP with the goal that users can use ADQL (sync only necessary!) to query SimDB implementations.
- s2.2 p9 par1+2 and p10 par2 "...service must be represented as a tree structure..." and "... represent the service as a whole" and "...web resource must represent the results...". Is "represent" a formal concept in REST or so. Otherwise what is meant by this? Must everything under the root be related to the service?
NOTE: Yes, the R in REST is represent(...). I don't see any reason one could not have/serve other web resources from within the tree. TAP (and UWS) simply enumerate a (required) set of resources and what they mean (PD).
- s2.2 p10 par4 "...may return a cached copy...". Don't really understand this paragraph. Isn't this up to service. If it knows that a certain query always corresponds to a particular cached data product, why would it depend on a GET or a POST? Also (see par7) does it mean that /async requests can never return cached data?
NOTE: This is just explaining how HTTP works in practice and really belongs in Use of HTTP (Section 7 in TAP-0.4).
- s2.2 p10 par1 and par5 "A TAP service must provide a web resource with relative URL /sync" and "A TAP service must provide a web resource with relative URL /async." See the comment (@*2.1*) above for motivation. Could this be SHOULD or MAY? Or allow implementers to choose one (or both)?
NOTE: as above, sync vs async discussion on mailing list (PD)
- s2.4 p11 par2 Not all combinations of the parameters are meaningful." Would be good to make an explicit indication of which combinations are valid.
NOTE: Let's revisit this w.r.t. TAP-0.4 now that many parameters have been moved to a separate document (PD).
- s2.4.1 p11 par1 "A TAP client must set this parameter correctly ...". This is an example of comment @*2* above, a MUST requirement on a client. Is this appropriate.
NOTE: It is informative for a service implementor though: it tells them what to assume and what is an error. It could be worded in a more service-implementor centric fashion, but then a client-centric doc would be needed -- maybe is? (PD).
- s2.4.1 p11 par2 "If a service receives a spurious parameter ...". Is a parameter that is not in the list of parameters to be considered spurious as well, or is it an error?
NOTE: It is spurious. It is assumed that the service will extract parameters it knows about from the request and ignore anything that is not applicable, which includes everything it does not know about.
- s2.4.1 p11 par1 "If a TAP service receives a request without...". I assume that this concerns a TAP service request that has a /sync or /async added to the root, otherwise it seems to be inconsistent with the last par on p9, which does not mandate error.
NOTE: not sure what this refers to as the page numbers are not that helpful (did you print on A4?) but you are right that the REQUEST requirement applies to direct access to the /async and /sync endpoints. That is, you would not need REQUEST to access the child resources under a UWS job. Text clarified in TAP-0.41 (PD).
- s2.4.1 p11 par2, list Case of allowed values seems to have arbitrary case. Is this to be coordinated with the table on p11?
NOTE: consistent case in TAP-0.4 (PD).
- s2.4.1 p11 par2, list The statement on getCapabilities, getAvailability and especially getTableMetadata relate to corresponding VOSI metadata.
- As VOSI is not yet an accepted standard (correct?), might be good (formally necessary) to give TAP's view on what this means explicitly. (Or is this done later?)
- Why does this spec, which seems to be the correct specificaiton for defining how to talk to and about table sets/database, defer to another, not yet accepted spec, for table metadata? Actually, there seems to be no tables metadata in VOSI spec at all (I refer to http://www.ivoa.net/Documents/WD/GWS/VOSI-20081023.pdf, is that the correct VOSI spec?)
NOTE: As with UWS (above), we expect that VOSI as it pertains to TAP will be standardised ahead of TAP (PD). The returned XML is specified by the
VODataService spec; anyway, this needs discussion as part of the whole metadata topic (PD).
- s2.4.2 p12 par1 "The query string is case sensitive."
- ADQL spec states (p4, 3rd line; p6 1st line): "Case insensitiveness otherwise stated" and "Both the identifiers and the keywords are case insensitive". So why does TAP go against this?
- IF this is sometimes desirable, could this be a capability and would it be possible to state for a TAP service that it is in fact case0-insensitive. SkyServer and Millennium database are not case sensitive, as MS SQLServer is case insensitive by default. Note that for these databases the case-insensitivity even applies to values of CHAR and VARCHAR columns! The latter is not so in Postgres, though as far as keywords and table and column names also Postgres seems to be case insensitive (at least in my default installation on my desk top pc). Maybe useful to look at report on different database systems by JVO in Victoria. Therefore there might be two modes of case insensitivity: keywords+schema and CHAR values. SQLServer allows case sensitivity, and this can be configured at the column level even. This might imply another metadata element for columns: isCaseSensitive. In any case it would be useful to see how other database handle case sensitivity (by default).
NOTE: initiated discussion of case sensitiveness on mailing list 2009-03-02 (PD).
- s2.4.2 p12 par1 "...the case of table and column names must be preserved..." This seems a requirement on the client, or does it imply that if the client uses a different case for a table for example the service MUST report an error?
NOTE: part oft he whole case-senitive topic above; it is a requirement on the client as stated (PD).
- s2.4.2 p12 par2 "...the service must support the use of datetime/timestamp values in ISO8601 format." Apparently ISO8601 is still rather liberal and has different versions.
- Is ISO8601:2004 intended?
- Must all of ISO8601(:2004) be supported?
- MS SQLServer 2005 seems not to support all allowed ISO8601 versions, even though it claims it is compatible. For example it seems (in my installation) not to allow yyyymmdd, needs extended version yyyy-mm-dd.
- An overview of other RDBS would be useful.
NOTE: Agreed: reviewing what DBs mostly support so that dates can be passed through easily would be good... TBD (PD)
NOTE: We wanted to specify a fixed set of
ADQL region constructs that everyone supported so make it easier on the client and on the implementor (fewer decisions). The text says "contains columns with spatial" AND "service wants to support". This is intended to mean that spatial querying support via
ADQL region constructs is optional. The example of a range of dec above is independent of this and perfectly acceptable. Text clarified in TAP-0.41 (PD).
- s2.4.2 p12 par3 "the extent of STC/S support within the REGION function is left up to the implementation" I can read this as allowing no support for STC string at all, which implies really that I do not support REGION, which I MUST do when supporting spatial queries. Seems not consistent.
NOTE: For consistency with the direct
ADQL constructs, we could require support for position, circle, and box in
STC/S. In general, the claim is that services and applications can supprot whatever part of
STC they like and that is OK... (PD).
- s2.4.2 p12 par4 "...should return an error if ... mix constants and column references for coordinate system and coordinate values." I do not understand the reason for this restriction at all. Also noted by Markus Demleitner I think. This seems like a change to the language, which might even require different parsers/interpreters than one would normally implement. How far does this restriction go. Is the following query ok for example:
select POINT(c.coordSys, t.ra, t.dec)
from (select 'ICRS' as coordSys) c
, table t
...
NOTE: I agree that
ADQL allows these and in
ADQL discussions where people didn't like the look of such constructs it was argued that this was just the nature of
ADQL (SQL) and it's treatment of argument types (literal is equivalent to column ref); this text was included in a provocative manner when it should be simply a warning to users that if they do this they are possibly going to make mistakes. Of course, there are plenty of ways to make mistakes with
ADQL and this particular complexity is not going to solve that. Changed text to make this a note/warning in TAP-0.4 (PD).
- s2.4.4 p13 "The service SHOULD implement the LANG parameter." What if the service does not, which language/version is supposed to be supported. Is this a capability ?
NOTE: changed to MUST in TAP-0.4 (PD)
- s2.4.5 p13 par1 Could the acceptable MIME types be listed explicitly in the document?
NOTE: good idea (PD)
- s2.4.5 p13 list Might it be useful to have an html-table (i.e. starting with <table..> and ending with ) as possible return type. Such a result could be added to a wrapping web page, possibly AJAX like. Might TeX tables be of interest?
NOTE: There is an html format but that is for the whole page. Can you plausibly get an html table element without associated CSS style sheets and expect something useful? Marginally maybe... (PD)
- s2.4.5 p13 list Is it allowed for the VOTable to contain data in all its DATA types available, TABLEDATA, BINARY, FITS, also LINKs iso DATA? (Maybe answered in 2.12?)
NOTE: I think the intent is for TABLEDATA ONLY. TBD? Will clarify text to say TABLEDATA only for now in TAP-0.41 (PD)
- s2.4.6 p14 par1 "...name for the table name SHOULD be an unqualified tablename...". Seems a requirement on clients, but not a MUST. What if not obeyed?
NOTE: It must be a legal table name as defined by
ADQL, so not following the should means that one has added optional schema (and maybe catalgo) names as prefixes. If the schema name is TAP_UPLOAD (doc incorrectly says TAP_SCHEMA) that would be ok but if it is anythign else it would have to be an error. Clarifying to say "must be an unqualified table name" in TAP-0.41 (PD).
- s2.4.7 MAXREC seems not necessary for ADQL, as TOP plays that role there. Useful for ParamQueries though.
NOTE: MAXREC is used to possibly negotiate a query size limit with the service, which may not otherwise be able to tell what the query will return. Without adding MAXREC to an
ADQL query (even one using TOP) the service may truncate the result at a different place due to default limits (PD).
- s2.4.7 p14 par4 "...if overflow occurs, MAXREC plus one rows should be returned to indicate that overflow occurred ...". In my opinion, if a user requests that MAXREC rows are to be returned, either using this parameter, or using TOP in ADQL, I think MAXREC rows (or less) MUST be returned, not MAXREC+1. In particular, enforcing this would mean that the obvious implementation (using TOP or LIMIT in the SQL) would need to use TOP ..+1 etc. ONLY if the service's "maximum permitted value for MAXREC" is reached should an overflow warning be give, but in the manner described in 2.8.4, using an INFO element.
NOTE: this paragraph was removed in TAP-0.4; rules for indicating truncation are described elsewhere (PD).
- s2.4.7 p14 par5 "..null query, that is, a query which produces an empty table.." In its current form (i,e, using MAXREC) I would not call this a null query, but a null request.
NOTE: Last sentence mentioning null-query removed in TAP-0.41 (PD).
- s2.4.8 I don't think MTIME should be used together with ADQL. IF a table contains a "lastModfied" column, users can use it in their ADQL queries. If there is no such column it is an indication that it is not possible to pose this type of query. It might be suggested that in general it is good practice to have such columns, "createDate", "updateDate",
especially if tables get updated over time. If tables get created and filed in one bulk insert it may be useful to add such information to the table's metadata?
NOTE: It is true that MTIME is intended for finding new/changed/deleted records and making a mirror. While that may generally be best done via param-query, at this point only
ADQL support is required so although MTIME is optional we did not want to make it dependent on other optional optional features. If the service cannot deal with MTIME it is ignored, as usual (PD).
- s2.4.11 This seems to me a perfect example of a meta-standard suitable for the "DAL-2 family of specifications": how to specify lists and ranges in DAL service parameters. Something similar was specified in SSA already as well. [I guess it has indeed be removed from version 0.4]
NOTE: Yes, this was moved to a separate document (PD).
- s2.4.13 "Parameter names must not be case sensitive, but parameter values must be so." Seems to conflict with the requirement on LANG in 2.4.4. See also my comment on case sensitivity of ADQL queries above.
NOTE: The section on LANG no longer says value is case insenstive as of TAP-0.4; other case-sensitivity issues TBD (PD).
- s2.4.14 p17 par2 "Clients should not repeat parameters in a request". Seems to be a SHOULD requirement on clients.
NOTE: It is, although it also says that the service never has to deal with multi-valued parameters in the HTTP sense. Not sure why not.. will bring up on DAL list (PD).
- s2.5 This section seems to belong to 2.6, can it not be merged with that section?
- s2.5 p17 par1 "[[catalog_name”.”[schema_name”.”]table_name]]" Following ADQL, shouldn't this be [[catalog_name”.”]schema_name”.”]table_name ? Note, if I am not mistaken, ADQL does not allow catalog_name..table_name , i.e. schema_name="" (possible IF catalog_name = ""), something which is allowed in SQLServer and corresponds to using the default schema.
NOTE: Fixed in TAP 0.41. Clarified to say that table name is defined in the query language spec. In cases where one can use the .. construct to specify the default schema, there is still a schema and you can put that explicitly in the metadata, so I don't see a problem. (PD)
- s2.6 I understand this section to imply that TAP should expose these three tables and make them accessible through ADQL and Param queries. If so, that might be made more explicitly clear. Some comments on the actual metadata prescription (a summary of the proposal can be inferred form the UML diagram at the bottom of this page):
- first table In first row (schema_name), "catalog.schema", should this be [catalog.]schema ?
NOTE: fixed in TAP 0.41 (PD)
-
- second table In first row (schema_name), "catalog.schema", should this be [catalog.]schema ?
NOTE: fixed in TAP 0.41 (PD)
-
- second table In second row (table_name), "catalog.schema.table", should this be [[catalog_name.[schema_name.]table_name?
NOTE: fixed in TAP 0.41 (PD)
-
- second table IN third row (table_type). As apparently views are described in TAP_SCHEMA.tables, I think it would be useful to store the SQL(ADQL?) that defines this view in this table as well. I suggest an extra row, "view_sql, containing the SQL that defines this view (for rows with table_type=view).
NOTE:
ADQL does not specify CREATE statements so this could not be described with
ADQL. As for showing the SQL CREATE VIEW, is that actually worthwhile? It will not necessarily map to anything the user could infer from the metadata (table and column names could be arbitrarily different, for example). (PD)
-
- third table 2nd row (table_name), "catalog.schema.table" should this be [[catalog.]schema.]table ?
NOTE: fixed in TAP 0.41 (PD)
-
- third table, datatype I believe it would be very useful to also have an indication of the SQL type of a column. It is that type, and not its mapping to VOTable types that is of relevance when constructing queries.
NOTE: neither the SQL type nor the VOTable type is actually sufficient; one needs the
ADQL type which includes the region constructs as well. The SQL types of those will be (var)char or (var)binary (most likely)... will post discussion to dal list (PD)
It is understoood that the result of a query is to be expressed as a VOTable, but VOTable is a messaging format, and should not determine how to express metadata for table sets, database really, that can be queried with
ADQL.For example, date-like types are missing from the VOTable types. This issue has been discussed in the mailing list, in particular in some emails in the registry thread on
VODataService starting with
Ray's email here. One problem that has been identified there is that
ADQL does not define data types explicitly. One reason why it seems not to need them in the language is because DDLs are not supported.But also the CAST function can now not be supported. One issue would therefore be which SQL types to use.
NOTE: There have been discussions about functions and so far the consensus has been that we should just leave it out of the initial version. That does mean that people will not be able to use any of the
ADQL region stuff without just guessing it will work and being ready for an error. Will initiate further discussion on dal list (PD)
- s2.6 p19 par2 "The schema name TAP_UPLOAD should be included in the table name for any tables uploaded to the service by a client." I suppose this is a requirement on the client? Must TAP_UPLOAD also be added in the TAP_SCHEMA.schemas table? * s2.6 p19 par3 "...may be queried for tables named TAP_SCHEMA.*..." Is this intended to imply the following ADQL query?
select *
from TAP_SCHEMA.tables
where table_name like 'TAP_SCHEMA.%'
This is a JPEG version of a MagicDraw model which is available in UML form
here.
In white components that have been taken over unchanged. In orange existing components that have been updated. In purple completely new components.
In green a suggestion by Francois Ochsenbein on primary keys and their use in the definition of foreign keys.
NB, the
original MagicDraw diagram can be obtained from the VO-URP GoogleCode project as well.
That project is a split-off from the SimDB development in Volute. XML schema serialisations of the model, as well as a specific design for DDL schemas can be derived form the UML automatically. :
For those who don't like UML, here an attempt at a summary:
- database [name,description, utype]
- schema [name,description, utype]
- table/view: [name,description, utype, sql (for views)]
- column [[name,description, utype,datatype, ucd, etc]
- foreignkey [toTableName, ...]
- foreignKeyColumn [fromColumnName, toColumnName]
- index[name, description, ...]
- indexColumn [columnName, rank]
- group [name, id, ...]
- columnRef [columnName, rank]
- param(Ref) [...]
- group(Ref) [...]
- param [name, ucd, ..., value]
- QueryResult
- Result column
- ?source column?