NotesOnSP2020 < IVOA

IVOA Web>IvoaInteropPOC>InterOpMay2020>InterOpMay2020GWS>NotesOnSP2020 (2020-05-05, GiulianoTaffoni) (raw view)
<br /> <!--
      * Set ALLOWTOPICRENAME = IVOA.TWikiAdminGroup
-->

<div id="magicdomid53">This text is intended as a starting point for the discussion.</div> <div id="magicdomid54">We will edit the text together during the session and then transfer the final version back to the IVOA wiki afterwards.</div> <div id="magicdomid55"></div> <div id="magicdomid56">

---+ *Science Platforms: Topics to discuss*
</div> <div id="magicdomid75"></div> <div id="magicdomid853">participants: 80</div> <div id="magicdomid99"></div> <div id="magicdomid58">Main area of discussion and open questions:</div> <div id="magicdomid59"></div> <div id="magicdomid314">
   * The role of IVOA
</div> <div id="magicdomid315">
   * interoperable AAI,
</div> <div id="magicdomid316">
   * Data access and data proximity,
</div> <div id="magicdomid333">
   * [Gregory D-F] I'm interested in how to make data access both transparent, so that a reference to data works across multiple SPs (e.g., if the same notebook is executed on each one), but is appropriately optimized for efficient access to locally available data (delivering on the code-to-data aspect of SPs).
</div> <div id="magicdomid318">
   * Data staging and userspace
</div> <div id="magicdomid319">
   * Software metadata description and software registry
</div> <div id="magicdomid320">
   * Massive data analysis and ML
</div> <div id="magicdomid66"></div> <div id="magicdomid67">There will be some contributions:</div> <div id="magicdomid68">
   1 from Antonio Disanto
</div> <div id="magicdomid69">
   1 from Dave Morris
</div> <div id="magicdomid70"></div> <div id="magicdomid71"></div> <div id="magicdomid73"></div> <div id="magicdomid419"></div> <div id="magicdomid1233">(Christophe Arviset) For reference, at ESA, we are also developing a Science Platform that we call ESA Datalabs: http://tinyurl.com/y8mlqyrn</div> <div id="magicdomid1783">Use cases were collected through various internal workshops at ESAC with space missions teams and scientists, a summary can be found at http://tinyurl.com/y9hq7glw, mainly articulated around 4 pillars:</div> <div id="magicdomid1879">
   1 science data exploitation,
</div> <div id="magicdomid1880">
   1 pipeline development environment,
</div> <div id="magicdomid1881">
   1 collaborative reserarch environment
</div> <div id="magicdomid1882">
   1 software preservation
</div> <div id="magicdomid635"></div> <div id="magicdomid5328">(Mark Allen) For reference, there is a white paper about "A Science Platform Network to Facilitate Astrophysics in the 2020s" https://ui.adsabs.harvard.edu/abs/2019BAAS...51g.146D/abstract</div>

<div id="magicdomid5104"></div> <div id="magicdomid4837">Matthew Graham - The steep learning curve with all these science platforms means I will only run at one and not many.</div> <div id="magicdomid1805"></div>

Christine Banek: Agree, I think standardization of the interface and tools facing the user are as interesting if not more interesting
<div id="magicdomid1127"></div>

PeterT: I agree, i find them (not used many) all very cumbersome, being used to a very finetuned laptop, thi is like working in the stone age.
<div id="magicdomid563">Here's my pet peeves from my last painful experience, and this skips the cumbersome steps just to get a terminal</div> <div id="magicdomid501">- did not have my own user account (wait, I can't use my own zsh setup?)</div> <div id="magicdomid502">- very cumbersome/impossible to cut and paste from my laptop, since i was forced to work in a browser</div> <div id="magicdomid703">- frustrating to see a load of 0.0 still run slow, cannot see the other users and memory usage</div> <div id="magicdomid620">- many "standard" commands not working (csh, time, man, ...) though it was "nice" to be able to pick from many instantiations of your virtual machine.</div> <div id="magicdomid950">- cannot use vnc to do local graphics? (this may be peculiar to sciserver)</div> <div id="magicdomid727">This "user experience" really needs to be worked on.</div> <div id="magicdomid833">Mine was a special introdiuction to SciServer, and I've also used AWS, which was slightly better because I could install my own packages that i needed.</div> <div id="magicdomid725">Could you let us know what platform you were using ?</div> <div id="magicdomid1165"></div>

Matthew G.: We use Google Colab a lot and it's hard to see how this interfaces with Astronomy resources, particularly data centers - it's the old question of how do I process Tbs/Pbs at data center 1 against data center 2 when my computing resources at in Cloud 3
<div id="magicdomid2588"></div>

Andy: following what the user want to do is the crucial thing. Not clear if users wants to make platform interoperable. challenge CSP to find waht IVOA can do. Find the user requirements.
<div id="magicdomid4833"></div>

Mike Fitzpatrick: For the NOAO Data Lab one design idea was that it would be user-data (e.g. sub-selected tables in a "MyDB" or uploaded/generated files in a VOSpace) that would be moved around, not the petabytes in the main data center. So, aside from the custom interfaces peculiar to our SP, there are entry points for pure VO services like TAP and VOSpace to provide access points for other platforms/apps to pull data out of our SP into another in a standard way. Code is much harder (at this point) but can likewise mix local and remote data thru VO service calls. Mixed Auth systems are a bigger issue (for me) at this point, and CDP is not universally supported yet
<div id="magicdomid2147"></div>

Groom: users would like to interact more closely to the data, so they want to run "close to the data". What does that mean from the perspective of the data service? what do those applications need that they can't get remotely, aside from not traversing the network? Is it just about eliminating the network, or is it also looking for a richer API to the data objects held at the server?
<div id="magicdomid1875">(thanks :-)</div> <div id="magicdomid2001"></div>

Kai: we forget that we are tech guys but astronomers are not. how can we package eneriting so that it comes out of the box so that users can use focus on the use of software and make science.
<div id="magicdomid2416"></div>

Gregory: Existing VO standards tend to be single item requests, limited support for bulk data.
<div id="magicdomid3273">Kai just adding: -&gt; Massive data is mandatory for a lot of deep-learning application and this limit is currently often a show stopper</div> <div id="magicdomid2187">History behind different institutes means platforms will be different.</div> <div id="magicdomid2209">What is possible is optimised data access</div> <div id="magicdomid2258">Write code locally on laptop then transfer the code to a science platform to get better/faster access.</div> <div id="magicdomid2298">So code API is similar/same, but graphical user interface may be different.</div> <div id="magicdomid3529">Would massive-data applications like ML be satisfied using only VO API's to access the data, if those methods were very fast because they were "local"?</div> <div id="magicdomid2547">user enviroments are complex, having a higher level than we have now but not as complex as user enviroment is.</div> <div id="magicdomid2799">Standardising the libraries and use them at the platform level (e.g. astro py).</div> <div id="magicdomid2800">+1</div> <div id="magicdomid3045">Allow users to share their workspace with other, may allow to share different competences. Enviroment to enhacen collaboration.</div> <div id="magicdomid3103">Providing a platform to work together as a larger team of experts.</div> <div id="magicdomid3849">+1 SPs should be collaborative research platforms where users can share data and code (Christophe)</div> <div id="magicdomid6883"></div>

Intra-platform - inside platforms (how to access data/services consistently from platform to platform but only using one platform at a time)
<div id="magicdomid5280"></div>

Extra - platform protocols, use Sciserver from outside (how to access platform services programmatically from outside the platform)
<div id="magicdomid6337"></div>

Inter-platform, where we want to use resources of multiple platforms within single analysis thread
<div id="magicdomid6053"></div>

Gregory: we develop having in mid tha Intra and inter platform capabilties. We can lear from HP community in terms of Grid. Strengt of containers is that I can put and run anywhere. But when I need to access data it is problem. Posix fs must be visible in the container.
<div id="magicdomid6350"></div>

Gregory : Levels of stanardisation:
<div id="magicdomid6450">Containers - user defines the whole stack - sites can't modify what is inside the container, the container is a 'black box'.</div> <div id="magicdomid6249">API access - sites can swap the low level libraries that implement the API to give similar behaviour, optimised for the site</div> <div id="magicdomid6911"></div> <div id="magicdomid3410">(Question: isnt it a sort od extention of the standard the libraries and APIs we use to build the SP?)</div> <div id="magicdomid3209">Inter-platform capabilities</div>

Petr: move code between pltforms. But the original idea of VO is that we have an agent moving from a site to another site on different data. Move data instead of sw. Not working on the same processing on different data, but send data from place to place accoring to the applications. SP are tuned to specific algorithms and data is movet to the SP becase of this algoritm.

<div id="magicdomid3566">Missing interfaces and things to improve.</div> <div id="magicdomid3760">JJ - Use containers to move SW into a SP. Build the container at home and move into the SP to process.</div>

Gerard: +1 maybe container can be built interactively on one SP, then exported to run on another. For notebooks on their own it is hard to provide an environment that can run it without containerizing it someway.
<div id="magicdomid7075">Main issue I think is that I wil likely run on "your" SP because you have interesting data. How can I write my analysis against this data if I cannot try it until it is running on your SP? Honestly I don't think I would want to write IVOA protocols to access data that is basically local. But maybe we can standardize on something like (I guess) LSST's butler (is). And have different SPs implement it as efficiently as possible?</div> <div id="magicdomid6832">Giuliano +1 I agree (this was +1 vs JJ's comment not mine necessarily)</div> <div id="magicdomid4002">JJ - experience is that virtual machies are too heavy weight for users to transfer from laptop to platform. Containers may help to make this easier</div> <div id="magicdomid4413"></div>

Dave Morris - when we were looking at access levels one of the things we were thinking external TAP service would have row limits, internal TAP service would have the same data, but faster access and row limits
<div id="magicdomid5085"></div>

Antonio: balance between usability and trustability interm of users. Trust on the outcome of the platform (the sw)
<div id="magicdomid5412"></div>

Severin: inter-platform interoperability. SKA use case and make interoperable data centers. We have usecases out from SKA that we can use.
<div id="magicdomid5484">SPs are stand alone righ now , not designed for interoperability.</div> <div id="magicdomid1659"></div>

Marcos Lopez-Caniego: it makes a lot fo sense to connect science platforms, for example to analyze Euclid and LSST data connecting ESA Datalabs and the LSST science platform, or SKA data that will be splitted in different data centers
<div id="magicdomid7232">Yihan</div> <div id="magicdomid2056"></div>

Brian Major: Standardizing the software delivery is important so the same software can be sent to different platforms that have different data offerings.
<div id="magicdomid6715"></div>

Simon O'Toole +1 Agreed. This is where containers are useful. Users want prebuilt software that they can simply run on their data, wherever that data is. - but i don't mind to say "sudo brew install cfitsio"
<div id="magicdomid7055">- sure, but not all users are comfortable with this. Also, users often break their system mixing brew and condos and other package managers.</div> <div id="magicdomid6065">Giuliano: containers can be an approach if we find a way to annotate also containers.</div> <div id="magicdomid6542">Simon: a thin layer that is container platform agnostic? Something that describes the container metadata: what it is and what it does, plus provenance information.</div> <div id="magicdomid5115"></div>

Dave Morris : In response to Steve - what if we added 'POSIX access' or 'filesystem access' as protocols that a service could advertise ?
<div id="magicdomid1859">+1</div> <div id="magicdomid6025"></div>

Ani Thakar: Should the IVOA define a set of core science capabilities (use cases) that science platforms should support? This should also boil down to a core set of libraries that each SP should support. IVOA also needs to define exactly what interoperability means in the SP context. What should users or agents be able to do seamlessly between SPs? Should Docker images be able to run on any SP?
<div id="magicdomid4974"></div>

Jesús: Should we have a set of IVOA docker containers that all the science platforms could offer? (including e.g. pyVO)+1
<div id="magicdomid5510"></div>

Stelios Voutsinas: I think that would be a good idea Jesus, perhaps we could use https://hub.docker.com/u/ivoa to provide an IVOA repo with a set of VO-related Docker images that we would maintain. (Although there is the danger that we get tied to one container technology) (true)+1
<div id="magicdomid4910"></div>

PeterT: Didn't we used to have this? Theere used to be a 'linux for astronomy" CD you had all the stuff you could dream a out. Then ESO has a distro that contained "lot" of tools. Now I see official distro's for linux carry some tools (e.g. ds9, pgplot, cfitsio, libwcs). Maybe we as community need to put more effort in making that easy. That would making building containers easier too (and help those workng on your own laptop).
<div id="magicdomid5071">I think this would be to resurrect this idea but using containers. If all the science platforms have this set of containers, you can run your code in different platforms blindly (e.g. selecting the one that has the data closer)</div> <div id="magicdomid7015"></div>

Mathieu: another aspect would be to re-run an analysis executed on a science platform in a standard way (send a command with a given workflow+configuration that would re-execute a sequence). Alternatively, get in a standard way the provenance graph of what was done to get a result.
<div id="magicdomid6644"></div>

Stelios Voutsinas: I think another (different) discussion, is in terms of reproducibility of the Science Platforms themselves. Meaning how does a (power) user take a Science platform (as a set of services) and recreate the environments on a generic cloud. (Something LSST and others are doing very well with Helm Charts / Containers).
<div id="magicdomid7076">This would potentially allow users to take such an environment and scale it as much as there funding allows (i.e. paying for resources in a burst, short term model), in the case where their allocation on the original Science Platform is too limited for their requirements.</div> <div id="magicdomid7228"></div>

Kenny Lo: Another aspect to consider is to increase VISIBILITY of what's generally available in the science platforms. That'll go beyond the VO protocols, into things like file systems, datasets, system capacity, etc, etc.
Topic revision: r1 - 2020-05-05 - GiulianoTaffoni
IVOA
Log in or Register
IVOA.net
Wiki Home
WebChanges
WebTopicList
WebStatistics
Twiki Meta & Help
IVOA
Know
Main
Sandbox
TWiki
TWiki intro
TWiki tutorial
User registration
Notify me
Working Groups
Interest Groups
Time Domain
Committees
Stds&Procs
www.ivoa.net
Documents
Events
Members
XML Schema