The Mythical Data Lake: FAIR-EASE Post-webinar Report

The FAIR-EASE project is an EU-funded project that sails under the European Open Science Cloud (EOSC)-Association flag.

It combines 28 partners from 7 countries and will run until August 2025. 

The overall objective of FAIR-EASE is to customise and operate distributed and integrated services for the observation and modelling of the Earth system, the environment, and biodiversity.

Specifically, the project is set up to balance between two big ambitions:

  1. Serve and progress five concrete scientific pilots (e.g. Volcano Space Observatory Pilot), grouped in three appointed “Use Cases”, namely: i) Earth and Environmental dynamics; ii) Environmental Biogeochemical Assets; iii) Biodiversity Observation. 
  2. Develop an innovative, common, and interoperable technical architecture, designed through collaboration between the technical work packages of the project, but also in collaboration with the user communities and EOSC. 

Opportunities are provided by the project’s team to obtain feedback from the user communities: FAIR-EASE is represented at conferences (e.g., EGU, EOSC Symposium 2023 (Madrid)), and regular webinars are being organised. 

The first webinar was held in April 2023 and introduced FAIR-EASE. The second webinar, which is the focus of this article, was held in July and aimed to present the essential elements of the FAIR-EASE "Data-Lake" architecture, as laid down in Portier et al. (2023). The title of the webinar, "The Mythical Data Lake", makes a tongue-in-cheek reference to an observation made during FAIR-EASE's first year of operation: the understanding that the proclaimed "data lake" solution, as initially worded in the FAIR-EASE proposal, required a more solid definition. This definition had to be strong and clear enough to face the many and diverse challenges expressed in the meetings between the FAIR-EASE technical board and the leaders of the Use Cases. 

The “mythical” adjective fits with (1) the discovered mismatch in expectations of the aforementioned parties on what this solution should and could do inside FAIR-EASE, which was naturally reflecting (2) the many possible definitions of “Data Lake” for the multidisciplinary and plural communities gathered within FAIR-EASE. More importantly, “mythical” also reflects its rather ephemeral nature in the resulting FAIR-EASE architecture: there the "lake" is depicted as an overall functional solution, an emerging property from the interplay of many defined components, but remains very much detached from a single physical entity (a single software on some allocated hardware) that could be pointed at.

The webinar consisted of three 15-minutes presentations. The first of those by Marc Portier (VLIZ), “From Data Lake to Data Space”, guided the audience to leave behind any preconceptions on existing data solutions (be they named lake or otherwise). The presentation also explored the boundaries of the FAIR-EASE architecture as a big black-box solution in between its two gatekeeping elements, namely the "Data Provider" and "Data Access" blocks, that are controlling the way to get data «in to» and «out from» it respectively. The design and approach to those were motivated through the core guiding principle of avoiding copying or moving needless data fragments (often referred to as subsets), regardless of the diverse and possible cross-domain nature of the questions we should be able to address.

The second presentation by Julián Rojas (UGent-imec) on “Novel approaches for decentralised knowledge graph querying in Data Space contexts”, provided an evaluation of the FAIR-EASE design from an external viewpoint. It highlighted the similarities between the FAIR-EASE architecture and the contemporary "Data Space" approach, in the formulated solution and in the challenges that they both claim to address. Furthermore, it showed how we could fulfil the dream of "finding data fragments close to the source" by highlighting recent research around the continued standardisation and practical use of Linked Data Fragments and its applied use in the TREE specification, which enables data to be decentrally accessed by, for example, the federated Web query engine Comunica.dev

The third presentation by Dorian Ginane (Geomatys) on “Examind and Geospatial Interoperability”, brought the topic back to FAIR-EASE, while narrowing the field of subsetting down to the geospatial aspects of the datasets. The presented tool and strategy permitted to prepare, share and view distributed geographic datasets in an aggregated way, as aimed at in the FAIR-EASE design. While showing the conceptual fit, and even the principled reachability of the goal, the presentation also disclosed the challenges inside FAIR-EASE to extend this approach to aspects other than geolocation.

With this webinar, the audience got the chance to interact with the speakers and provide feedback. Participants learnt about specific elements of the architectural design of the FAIR-EASE project, and how these fit in with contemporary software research and innovation: both at the levels of reaching cross-domain interoperability through managed dataspaces and of detailed practical solutions that already cover niche aspects. 

The topics that were covered in the webinar brought together the approaches of the data-driven environmental research community and that of the data engineering community, in a new kind of partnership. The challenge to maintain a fruitful dialogue between these communities consists in bridging the gap between them, by avoiding confusion in differently perceived terms, by avoiding detail when introducing an overview, and by agreeing on both sides to use common vocabularies and definitions. A risk of misunderstanding may arise from the use of unintended blocking jargon or assumptions: In such a case, a powerful key is communication and the need to ask for clarification of the piece of jargon that slipped through. As organisers, the lesson learned is that our webinars should always create a space where open and natural interactions can take place.

A recording of this webinar has been made and made available on the FAIR-EASE YouTube channel. Additionally, the slides are being made available through the FAIR-EASE community section on Zenodo. Additional events and webinars will be organised by FAIR-EASE in the coming months, and will be advertised on the project’s website (https://fairease.eu/events) and on FAIR-EASE’s social media (Twitter | LinkedIn). 

 

Mythical Data Lake Webinar Report