Geoinformatics: Data to Knowledge
Managing scientific data: From data integration to scientific workflows*
-
Published:January 01, 2006
Scientists are confronted with significant data management problems due to the large volume and high complexity of scientific data. In particular, the latter makes data integration a difficult technical challenge. In this paper, we describe our work on semantic mediation and scientific workflows and discuss how these technologies address integration challenges in scientific data management. We first give an overview of the main data integration problems that arise from heterogeneity in the syntax, structure, and semantics of data. Starting from a traditional mediator approach, we show how semantic extensions can facilitate data integration in complex, multiple-world scenarios, where data sources cover different but related scientific domains. Such scenarios are not amenable to conventional schema integration approaches. The core idea of semantic mediation is to augment database mediators and query evaluation algorithms with appropriate knowledge representation techniques to exploit information from shared ontologies. Semantic mediation relies on semantic data registration, which associates existing data with semantic information from an ontology.
The Kepler scientific workflow system addresses the problem of synthesizing, from existing tools and applications, reusable workflow components and analytical pipelines to automate scientific analyses. After presenting core features and example workflows in Kepler, we present a framework for adding semantic information to scientific workflows. The resulting system is aware of semantically plausible connections between workflow components as well as between data sources and workflow components. This information can be used by the scientist during workflow design, and by the workflow engineer, for creating data transformation steps between semantically compatible but structurally incompatible analytical steps.