A new international initiative for facilitating data-driven Earth science transformation
Published:November 10, 2020
- Open the PDF for in another window
Qiuming Cheng, Roland Oberhänsli, Molei Zhao, 2020. "A new international initiative for facilitating data-driven Earth science transformation", The Changing Role of Geological Surveys, P. R. Hill, D. Lebel, M. Hitzman, M. Smelror, H. Thorleifson
Download citation file:
Data-driven techniques including machine-learning (ML) algorithms with big data are re-activating and re-empowering research in traditional disciplines for solving new problems. For geoscientists, however, what matters is what we do with the data rather than the amount of it. While recent monitoring data will help risk and resource assessment, the long-earth record is fundamental for understanding processes. Thus, how big data technologies can facilitate geoscience research is a fundamental question for most organizations and geoscientists. A quick answer is that big data technology may fundamentally change the direction of geoscience research. In view of the challenges faced by governments and professional organizations in contributing to the transformation of Earth science in the big data era, the International Union of Geological Sciences has established a new initiative: the IUGS-recognized Big Science Program. This paper elaborates on the main opportunities and benefits of utilizing data-driven approaches in geosciences and the challenges in facilitating data-driven earth science transformation. The main benefits may include transformation from human learning alone to integration of human learning and AI, including ML, as well as from known questions seeking answers to formulating as-yet unknown questions with unknown answers. The key challenges may be associated with intelligent acquisition of massive, heterogeneous data and automated comprehensive data discovery for complex Earth problem solving.
Numerous scientific advances indicate that geoscience is entering a new era of real-time Earth system innovation as a result of modern space observation technology, analytical and experimental facilities, outdoor experimental infrastructure, computer simulation and data discovery technology. Geoscientists have been continually making gratifying progress in many fields of Earth sciences including deep space, deep Earth, deep ocean and deep time. For example, researchers in South Africa reported that a 12 000 kg dinosaur named Ledumahadi mafube was one of the largest animals on Earth during the Mesozoic era of the planet's existence (McPhee et al. 2018). Australian scientists have examined fat molecules extracted from the fossil of a mysterious creature called Dickinsonia and confirmed that it lived 558 million years ago, making it the earliest known member of the animal kingdom (Bobrovskiy et al. 2018). British scientists have reported that life, based on genetic and fossil evidence, may have begun on Earth nearly 4.5 billion years ago, much earlier than previously thought (Betts et al. 2018).
In space and planetary science, Italian scientists have used radar measurements to detect a 20 km wide lake of liquid water underneath solid ice in the Planum Australe region on Mars (Orosei et al. 2018). The presence of liquid water on Mars has long been debated and has implications for astrobiology and future human exploration. NASA has launched a new mission, InSight, which has landed on Mars to pursue three main goals: taking the planet's temperature, measuring its size, and monitoring quakes (Voosen 2018). The new data about the thickness, size, and composition of Mars will help scientists better understand why Mars is so different from Earth. Plate tectonics has been found on Earth but not on any other planets.
In deep Earth, the 10-year Deep Carbon Observatory project announced the discovery of significant a variety of lifeforms living up to at least 4.8 km deep underground, including 2.5 km below the seabed (Schiffries et al. 2018). The high- pressure form of crystalline ice, Ice VII, was identified among inclusions found in natural diamonds (Tschauner et al. 2018), presumably formed when water trapped inside the diamonds retained the high pressure of the deep mantle. This discovery might provide new insight into the deep Earth water cycle, which is a key question of Earth dynamics and plate tectonics.
In the field of critical minerals and resources, researchers in Japan reported finding centuries’ worth of rare-earth metals in the deep ocean mud in the northwest Pacific (Yasukawa et al. 2016). One of the fastest-growing research fields in Earth system science is the automatic extraction and integration of information from big data in the cloud environment (Bush et al. 2017; Khozin et al. 2017; Ma et al. 2017; Gewin 2018; Savage 2018; Scheffer & Van Nes 2018; Knüsel et al. 2019; Reichstein et al. 2019).
Among the above fundamental geoscience research advances, some findings are related to questions that fill gaps in our knowledge about the Earth and its operation and others are directly associated with the sustainable development of our society. Some are known questions and knowledge gaps for which we seek answers in a process that can be referred to as ‘we know what we don't know’. In other cases, neither the questions nor their answers are known, i.e. ‘we don't know what we don't know’ or ‘unknown unknowns’, as famously quoted by former U.S. Secretary of Defense Donald Rumsfeld in the 2000s.
Traditional research approaches are mostly related to the search for unknown answers and solutions to known questions. This type of research can be considered as problem-driven or question-driven. With the development and application of revolutionary technology, especially digital technology, Earth science has just entered a new era of fast-growing innovation. The tools and technologies of the digital revolution combined synergistically with transdisciplinary and integrative efforts are boosting innovation and discovery in fundamental and applied Earth science research. The digital revolution is transforming scientific research from a problem-driven paradigm into a data-driven one. In the new paradigm, through a data discovery process, scientists might be interrogated by a machine about questions of which they might not yet be aware. From this perspective, big data discovery may boost innovation not only by finding answers to existing questions but also by revealing and defining as-yet unknown questions.
Therefore, the advantages of using big data are not only to access data volume, variety and velocity but also to implement innovative technology for enabling knowledge discovery, which may fundamentally change the methods of conducting Earth science research. For example, some types of data are captured and utilized in one field but are rarely known and available for use in other fields. Big data technology enables these types of data to become accessible and interoperable to all who are interested (Halpin et al. 2010). Integrating these ‘new’ data into the discovery process can increase the dimensions and variety of the data, affording a better description of the complexity of the phenomena being studied, and lead to the discovery of new patterns or previously unknown questions.
Using machine learning (ML) with big data is also an indispensable tool for reading and collecting data to support science, which is another notable advantage. For example, Macrostrat is a platform that links the GeoDeepDive digital library with a machine-reading system for rapidly aggregating geological data relevant to the spatial and temporal distribution of rocks as well as the data extracted from them. The efficiency of machine reading and extracting information from publications clearly convinces researchers that the cutting-edge technology of these approaches can not only significantly increase the big ‘V's of data but also, notably, release them from spending time and effort on data handling (Peters et al. 2018). In the remaining sections of this paper, we will elaborate on the opportunities and challenges facilitating data-driven Earth science transformation. First, we will show the advantages of big data and data-driven discovery and then provide suggestions concerning the challenges and solutions. Finally, we will introduce a new international initiative facilitating data-driven Earth science transformation, the International Union of Geological Sciences (IUGS)-led Deep-time Digital Earth (DDE).
Data-Driven Earth science paradigm transformation
From question-driven to data-driven research
From the point of view of Earth system science and a habitable planet Earth, the major questions include fundamental issues such as the interactions and co-evolution of multiple spheres of Earth systems (Fig. 1), as well as applied questions related to the occurrence and spatial–temporal distributions of natural resources (e.g. minerals, energy and water) and geohazards (e.g. earthquakes, volcanos, landslides, floods, tsunamis) (Fig. 2). Many global databases with either archived data accumulated in the past or monitoring data acquired from observatory networks are available for linking lithosphere, biosphere, hydrosphere and atmosphere (Fig. 1). With adequate deep time scale control, these data can be georeferenced to demonstrate the variability of each relevant Earth sphere through time as well as to analyze the associations and co-evolution of multiple spheres. As an example, Figure 1 shows several global datasets, from which time-series can be derived, to describe secular changes of Earth biodiversity, chemical indexes of atmosphere, chemical change in seawater, and a number of mineral phases. In addition to the general secular trends of the time series, more importantly, the data show the major extreme terrestrial events that have occurred during the evolution of the Earth. These types of events include the formation of supercontinents (e.g. labelled in Fig. 1c as P, Pangea; G, Gondwana; R, Rodinia; C, Columbia; K, Kenorland), the mass extinction of life, iron formation, big oxidation events, glaciations and clustering occurrence of mineral deposits etc. The results plotted in Figure 1 illustrate not only the variability of the data but also complex associations of these major events.
While some studies have focused on the intersections of these events, including timing and controlling factors, new work could focus on cross-correlation and long-term interdependency (symmetrical or asymmetrical) as well as the cascading and escalating effects on each other. When time-series data reach proper temporal resolution, they can be analyzed by spectral methods to reveal periodicity and internal associations. For example, five global databases of detrital zircon age data were analyzed by a new time-series filtering method, local singularity analysis (LSA; Cheng 2007), which revealed periodicity in the secular trends of age distributions of certain geological events (Cheng 2018). Surprisingly, the results of the LSA-filtered time series not only show the periodicities of the age peaks but also reveal systematic and simultaneous reductions of wavelength and intensity of the periodicity of peaks on the filtered time series (Fig. 1c). This gave rise to new scientific questions about why the peaks show a linear decrease and whether this can be extrapolated to predict the future of the Earth. After integrating other secular changes of Earth attributes (mantle temperature, crustal reworking rate, the ratio of collisional to collisional + accretionary orogens with time and the volume of continents; Rino et al. 2008; Herzberg et al. 2010; Spencer et al. 2014; Condie et al. 2015), two models indicate that the time required for the intensity of major magmatic activity to vanish and for the temperature of the mantle to reduce to below the mantle solidus would be 1.45 Gyr into the future (Cheng 2018).
This was the first attempt to predict the future dynamics of plate tectonics from the internal properties of the Earth, based on the observed geological data. The primary questions of when plate-tectonics-related events such as subduction, volcanism, earthquakes and rifting occur remain to be fully elucidated (Andrews 2018). This study is an excellent demonstration of an ‘unknown and unknown’ scenario.
The second example introduced here is related to critical Earth science issues threatening human life and societal development. These problems include the utilization of natural resources (e.g. minerals, energy and water) and the reduction and prevention of risks caused by geohazards (e.g. earthquakes, volcanos, landslides, floods, tsunamis). Understanding the behaviors of these types of extreme events is essential for developing theory, methods and models to predict and assess their distributions and impact. Figure 2 shows the two major tectonic converging zones and global orogenic belts: the Alpine-Himalayan Orogenic Belt (Tethys Belt) and Circum-Pacific Orogenic Belt (Pacific Belt). These two orogenic belts control most known porphyry copper mineral deposits, earthquakes, and volcanic activity of the planet. They also host large cities and almost half the world's population. In addition, much synthetic infrastructure including big dams and nuclear power plants have been built along the coastal zones of these regions. As a consequence, human vulnerability is increasingly affected by earthquakes, volcanoes, landslides, floods, hurricanes and tsunamis. While research has focused on studying each type of event and their direct associations, few studies have emphasized the interconnection or teleconnections of these types of events, especially where the events are located far apart and their apparent interdependencies are weak. For example, do earthquakes occurring in the two different tectonic plate systems of the Tethys Belt and the Circum-Pacific Belt have any teleconnections? Are there indirect and weak connections that exist between these systems that could be revealed by big data mining? These are typical ‘unknown’ questions that could be revealed by a machine, data-driven research approach. Figure 2b illustrates the frequency distributions of U–Pb ages of igneous zircons from intrusions in the Tethys and Circum-Pacific belts. The results show that the two patterns of time series depict some degree of similarity between the peaks of magmatic activity (<100 Ma). A large number of porphyry copper mineral deposits (PCDs) have been found in these two global scale orogenic belts. Two fundamental questions are whether there are differences and similarities between the PCDs in these orogenic belts, and what might drive their characteristics (Yang & Hou 2009; Richards 2013). A profound property of these PCDs is that they are clustered both spatially and temporally, suggesting a link between deep processes of plate subduction and mineralization occurring in the crust (Cheng 2019). This was demonstrated through the integration of various geodatabases with Earth time and paleogeography. Temporal and spatial clusters of PCDs can be associated with localized plate motion and the geometric properties of a subducting slab. Linked databases can reveal the elements that govern cluster distributions of PCDs, a fundamental question for modelling their formation (Cheng 2019). Comparative investigations about the causal associations between extreme events (mineral deposits, magmatic activity, etc.) that have occurred in the two orogenic belts is an excellent domain for calling on the assistance of big data, ML, AI (artificial intelligence), complex network analysis, and visualization.
From human learning to ML and AI in geosciences
Learning from experience and available knowledge, human beings can make effective judgements about new situations and in turn gain experience from the objective observation of the failure or success of these judgement calls. ML has greatly contributed to this positive feedback process through the dedicated study experience to improve the performance of the learning system itself. In short, the main purpose of ML is to find patterns hidden in data (Bishop 2006). The process of capturing models from data is termed ‘learning’ or ‘training’. From this perspective, ML could be considered as a form of AI (Provost & Kohavi 1998). ML was used initially in computer vision and natural language processing, but it has been rapidly applied to many fields of natural science, social science, and engineering. The application of ML has profoundly affected most industries in the past decade, and in particular the financial and commercial sectors. ML has been introduced and utilized in geoscience for several decades for various purposes, ranging from prediction and simulation to multivariate analysis. For example, a suite of techniques including logistic regression, neural networks, weights of evidence and fuzzy logic have been developed and applied to mineral potential mapping and resource assessments (Agterberg 1989; Bonham-Carter 1994; Cheng et al. 1994). Other applications of ML include quantitative stratigraphic comparison (Agterberg & Gradstein 1988), the classification of soil and vegetation using hyperspectral data by neural networks (Benediktsson et al. 1989) and reservoir modeling and applied geophysics by Markov models and Bayesian inference frameworks (Gavalas et al. 1976; Godfrey et al. 1980; Karamouz & Vasiliadis 1992), to name just a few. More thorough reviews of the application of ML in geosciences can be found in several recent publications (Karpatne et al. 2017; Bergen et al. 2019; Reichstein et al. 2019). This paper is not intended to detail specific technological developments, but rather to introduce a few recent examples to demonstrate how ML is advancing the development of geoscience.
ML has been successfully applied to predict and monitor extreme events such as weather, mineral deposits, volcanos and earthquakes. Since the beginning of the twentieth century, scientists have been making great efforts to establish detection systems to accurately monitor these types of events. It has been demonstrated that ML has great potential for detecting extreme events from Earth-monitoring data. An example is the detection of extreme weather events from climate model simulations in the field of atmospheric sciences. Boers et al. (2019) used complex networks to reveal long-distance global-scale dependencies of extreme-rainfall events, which may potentially improve the predictability of associated natural hazards. ML has also been used for automatically detecting occurrences of earthquakes (Ruano et al. 2014; Perol et al. 2018), distinguishing between natural earthquakes and explosions caused by humans (Wu et al. 2018), automatic recognition of volcanogenic seismic events (Titos et al. 2018) and classification of volcanic ash particles (Shoji et al. 2018). Other examples include establishing a global surface water-monitoring system using remote sensing data (Jia et al. 2017), using radar data to predict the presence of liquid water on Mars (Orosei et al. 2018), predicting floods and tornadoes based on remote sensing and radar data (Yu et al. 2015; Zhuang et al. 2016), causality analysis of storm paths (Ebert-Uphoff & Deng 2014), and dimensionality reduction and clustering analysis of seismic attributes (Köhler et al. 2010; Zhao et al. 2017; Qian et al. 2018).
Another potential application of ML is in Earth dynamic simulation and prediction. ML can provide efficient numerical models for the improvement of accuracy and efficiency in inversion modelling of Earth processes. ML methods can provide elegant and accurate approximations to complex geophysical inversions such as predicting mantle flow processes by simulation of mantle convection using temperature fields as training data (Shahnas et al. 2018). Trugman & Shearer (2018) developed a set of non-parametric ground-motion prediction equations by the random forest method to associate stress reduction with peak ground acceleration in northern California. Rouet-Leduc et al. (2017, 2019) used the support vector machines method on seismic and GPS data to predict the instantaneous velocity of a subducting plate. It is worth mentioning that these types of technologies and physical simulations based on actual observations have achieved consistent results.
Challenges for Earth science in the digital revolution
Integrative questions require international cooperation and technology innovation
Earth scientists are still facing many profound scientific issues that require the pace of geoscience investigations to keep up with the urgent global science challenges. For example, since its formation 4.6 billion years ago, the Earth has undergone several major evolutionary stages, including the origin of life and the origin of plate tectonics. The solid Earth operates as a complex system with interactions between and within crust, mantle and core and through the transmission of matter and energy between these layers. These internal processes not only control the evolution of the lithosphere and the genesis of tectonism, magmatism and subduction processes, but also strongly influence surface systems such as mountain building, volcanism, derived sedimentation patterns and the configuration of ocean circulation through the geometry of continent and deep ocean ridges and troughs. Big data show that major geological events such as the formation of supercontinents, the mass extinction of life, ore formation, big oxidation events, glaciations, and clustering occurrences of mineral deposits in the Earth's surface system during geological history are highly consistent with intense periods of deep subduction, plume and large-scale magma activity (Cheng 2017). Understanding how these simultaneous internal processes control the occurrences of extreme events in the lithosphere or in the crust has been a focus of much attention for scientists through the twentieth century to the present.
Another geological challenge, which requires using big data and ML for support, is the exploration for resources in frontier regions such as in the deep ocean, deep Earth, deep space, polar regions, in areas covered by transported glacial or other material or obscured by vegetation, and in shallow complex environments where systematic sampling through drilling is prohibitive. The difficulty and cost of access for direct observation and mapping are problems that have invited the development and application of remote sensing technology, geophysical survey technology, indicator mineral methods, and geochemical surveys and research. New equipment, such as underwater robots, the InSight Mars Lander, and a variety of new geophysical methods, have made possible the evaluation of resources in these special regions. Therefore, data processing and interpretation are key for making new discoveries pertaining to geological problem-solving (Cheng 2012). An excellent example is the Australian effort with the Olympic Dam resources and the Australian big data cube presented at the 32nd IGC and again discussed during 34th IGC in Brisbane. This led to the establishment of the RFG by Ian Lambert (Williams et al. 2004). All these problems involve the unravelling of the interaction between multiple Earth systems over a wide range of spatial and temporal scales. To describe these complex geo-systems and their interactions requires multi-institutional, multi-organizational international cooperation in geoscience, and technology innovation with special considerations of several challenges that will be outlined below.
Explosive growth of Earth science data
Over the past few decades, with the rapid development of geophysical methods, Earth observation technology, digital technology and significant expansion of the number of sensors and related monitoring systems, there has been an enormous growth in the amount of data available to geoscientists. These data span from the molecular to the global and astronomical geospatial scales, and on the time scale from near-instantaneous events such as earthquakes and river discharge rates, to that of orogenic belts and sedimentary basins, spanning hundreds of millions of years. Among many advanced technologies, supercomputing has become a powerful tool for supporting geoscience research for simulating complex processes such as plate tectonics, mantle convection, formation of orogens and basins, the progression of tsunamis, and earthquake generation through fault rupture.
With the rapid growth of data resources, various specialized large-scale digital databases have emerged. A related trend is that the scientific community is realizing the benefits of sharing their data and computing services, and thereby promoting distributed data and computing community infrastructure (Ludäscher et al. 2006). Although a huge amount of data such as those from geological surveys or other established big thematic (e.g. global earthquake) databases are available in digital and machine-readable form, a major challenge for utilization of these is harmonization of data and interoperability of databases. Joint efforts are needed to facilitate the standardization, harmonization and integration of these diverse data, especially in distributed databases. One excellent example of these types of collaboration efforts is OneGeology (http://www.onegeology.org), which is an international initiative by a number of the world's geological surveys. OneGeology makes data obtained from worldwide geological providers accessible to those who would like to see and use them. Many are portrayed in traditional geological maps. They provide an excellent geological data hub, which can be extended to include more types of data from individuals and research groups who are willing to share and allow the free use of their data. Another influential work is GroundWaterML2 (GWML2) which is an international standard for online exchange of groundwater data. GWML2 aims to overcome the problem of heterogeneity in groundwater databases and to promote multiple forms of data exchange and information integration (Brodaric et al. 2018).
Government survey data generally have good diversity but limited geographic coverage that ends at national borders. Integrating the data compiled in academic research studies and government databases can yield a huge collection of unorganized information in the form of pictures and scanned images, tables, notes, sketches, cross-sections, videos, samples, measurements scattered in documents or even in geoscientists’ notebooks. Setting free these types of ‘volunteer’ generated information is essential for creating big data in earth science. Proper mechanisms including incentive policy and adequate computer technology such as AI techniques need to be found to motivate and facilitate organizations and geoscientists to share their data so that many can benefit from making them FAIR (findable, accessible, interoperable, and re-usable) (Wilkinson et al. 2016).
There are several geodatabases operated and maintained by geological surveys, other governmental agencies, scientific organizations and industry such as the International Seismological Centre (ISC - http://www.isc.ac.uk), the National Earthquake Information Centre (NEIC - https://www.usgs.gov/natural-hazards/earthquake-hazards/connect), and the Global Centroid Moment-Tensor Project (GCMT - https://www.globalcmt.org/). In addition, there are many other databases that are developed by individual scientists, teams of scientists or consortia through scientific programs with limited duration and specific objectives. Some established databases may lose their maintenance capacity, either technically or financially. Some of them even become ‘dead’ or evolve into isolated ‘data islands’ due to the lack of proper management. There is an urgent need to review their relevance and determine whether some of these require maintenance or attention so that they can be linked or transformed into modern databases to broaden their accessibility and usefulness. The accessibility and quality of these Earth science data are key to promoting big data utilization
Linking knowledge systems and AI to automate data discovery
The main interest of geoscientists in the acquisition and utilization of scientific data is to elucidate natural processes using rational, hypothesis-driven problem solving. How much information and knowledge one can extract from data is therefore the primary measure of success. Human innovation in modern civilization is closely related to the accumulation of knowledge and experience, which are disseminated through publications and other forms of media for knowledge transfer. Engineering innovation often involves systematic design and automation of the flow of highly complex processes that could be far beyond the cognitive capacity of any individual human brain. The integration of the flow of human ideas and ML processes can be aided and automated through the application of modern AI technology and various advanced semantic knowledge engines. These flows can be extended to combine ML processes, information management and infrastructural organizations. The flow models developed by the knowledge engine can be edited, modified and reused for similar problem solving which, in turn, refine the models through positive feedback processes. These types of flow models can be represented visually and implemented through proprietary or native programming languages (Bennett et al. 2016; Ludäscher 2016; Goble et al. 2020). In building these types of flow models, the existing processes developed for specific tasks can be taken as primary building blocks to form more complex and sophisticated processes for tackling large-scale issues (Fig. 3). Techniques and availability of big data are the key elements of the knowledge engine and models for applications. To illustrate the concept of a process model, we can use the simple Model Builder concept developed in the field of geoinformatics. Models for applications in this field are referred to as workflows that string together sequences of geoprocessing tools, feeding the output of one tool into another tool as input. The models can not only ensure automation of processes built into the model but also enable smarter data processing and intelligent reasoning by iteration, optimization and conditional branching of the processes and tasks involved in the sequence of them (Cheng et al. 2009). The significance of these types of modeling processes is that human intelligence, including ideas and experiences, in addition to data, tools and processes that are usually shared in the science domain, can be readily shared, reused and improved by AI technology. Newer technologies are rapidly being developed to facilitate the linking of resources to automate the processes (Fig. 4). For example, Wolfram Alpha is a knowledge engine integrating expert-level knowledge and algorithms for automatically answering questions, doing analysis and generating reports. GitHub provides a platform and storehouse for project management and collaboration on code development and sharing.
There is no doubt that ML has been widely used to solve various geological problems, but it needs to be emphasized that the pursuit of learning technologies may effect a far-reaching change in the nature of research in geoscience. As an intelligent system, AI will bring new vitality to geoscience data acquisition, applied robotics, remote and in-situ sensing, information integration and human–computer interaction (Gil et al. 2018). There is still some basic work that needs to be conducted. This includes launching global universal geoscience benchmark datasets like ImageNet (Deng et al. 2009), and expanding the use and application of new data derived from interferometric synthetic aperture radar, high-resolution satellites and multispectral images (Biggs et al. 2014) to detect changes in the landscape that reflect deep rooted solid Earth processes and development of a framework that can incorporate prior knowledge of geoscience (Hoskins 2013; Karpatne et al. 2017; Gil et al. 2019). The opportunity for further progress lies in new AI techniques for tackling complex problems involving nonlinear Earth system processes.
IUGS big science programs for facilitating data-driven earth science transformation
Social and economic development and dramatic improvements in the quality of life of people living in the new century have increasingly imposed heavy pressure on water, energy and mineral resources, as well as increasing risk related to earthquakes, volcanic eruptions, floods, hurricanes, water contamination, air pollution, food security, clean energy, urban space utilization and health. These issues are closely related to the UN 2030 Sustainable Development Goals (SDGs) (Nilsson et al. 2016) and they require knowledge and solutions from geoscientists and geoengineers (Fig. 5). On the other hand, with the digital revolution, the paradigm of Earth science research will also undergo a tremendous transformation. The critical efforts enabling the success of a data-driven approach for knowledge discovery in Earth system science must include FAIR data (Wilkinson et al. 2016), knowledge systems for semantic searching, and integration of modern algorithms for computing and physical processes-based models for knowledge discovery. These ultimately need the integration of data, models, computers and people (Fig. 6). IUGS is keenly aware of this change and has launched new initiatives for facilitating big data and data discovery for Earth science innovation. Promoting big science programs for international collaboration in facilitating data-driven and knowledge-driven Earth system science has been recognized as both challenging and a strategic priority for the geoscience community (IUGS Annual Report 2017, https://www.iugs.org/annual-reports). IUGS takes the lead on this new initiative in line with its primary goals of strengthening its leading role in Earth science, increasing the level of interaction among IUGS communities as well as cooperation with other organizations, and ultimately improving its service to society and community building. In the rest of this paper we will elaborate on the origin of the initiative, with a focus on the general challenges and opportunity to boost data-driven Earth sciences
The vision of IUGS embraces the following three aspects: (1) to promote the development of Earth sciences through the support of broad-based scientific studies relevant to the entire Earth system; (2) to apply the results of these and other studies to preserve Earth's natural environment, to use natural resources wisely and to improve the prosperity of nations and the quality of human life; (3) to strengthen public awareness of geological sciences and advance geological education in the broadest sense (http://www.iugs.org). Over the past few decades, IUGS has established or jointly initialized various international science programs such as the International Geosphere–Biosphere Programme (IGBP) (Lindesay et al. 1996), the Global Sedimentary Geology Program (GSGP), the International Geoscience Program (IGCP) (http://www.unesco.org/new/en/natural-sciences/environment/earth-sciences/international-geoscience-programme/), the Global Geochemical Baseline (GGB), the International Lithosphere Program (ILP) (http://www.scl-ilp.org), participated in the OneGeology initiative (http://www.onegeology.org) and Resourcing Future Generations (RFG) (http://www.iugs.org). These programs have created long-term impacts on Earth science research and community development. RFG is an initiative with the mission to focus the world on the challenge of sustainable resource supply and to achieve national development and poverty reduction through a sustainable resource development framework (Nickless et al. 2014).
Undoubtedly, the aforementioned programs have greatly advanced the frontiers of Earth sciences, stretched the limits of our scientific abilities in many directions, and ultimately promoted cooperation in the geoscience community. However, facilitating cross-disciplinary and convergent research and bridging natural science, social science and engineering, as well as fundamental research and solution-oriented science and technology present both challenges and new opportunities for our communities. New international programs must focus on enhancing public awareness and education, promoting international collaboration, encouraging open data sharing which is essential to facilitate ‘open science’ in the big data world, and facilitating transdisciplinary research. During the 72nd IUGS Executive meeting held in Potsdam, Germany, in January 2018, the initiative proposed by President Cheng for setting up new IUGS-recognized Big Science Programs and Centers of Excellence was approved, aiming to promote and support big science programs focusing on the integration of several aspects of resources and meeting the following criteria (IUGS Annual Report 2018, https://www.iugs.org/annual-reports):
Global and major issues
Collaboration with ISC, UNESCO, etc.
Promotion in underrepresented nations.
Integrating FAIR data to form connected data hubs
To a large extent, understanding the long-term evolution of the Earth system and its controlling factors, including anthropogenic attributes, relies on the development of new methods for integrating and querying different types of observations and models of geoscience data. In this new era of data-driven Earth system science, new platforms and programs are required for facilitating efficient use and deep learning of geoscience data. For example, the construction of complete geoscientific databases would require integration of surveyed maps, which may be stored in government data warehouses, and other data collected by academia, either partially published or in their personal computers. The new initiative of IUGS also aims to provide interoperability of databases operated and maintained by national geological survey organizations (GSOs), other government organizations, academic organizations, and industry as well as databases developed by individual scientists, teams of scientists or consortia through scientific programs with limited duration. From these perspectives, big science programs need to address these critical issues and deliver community-based solutions. The first-tier goal is to promote the development and adoption of specifications to achieve distributed systems of connected geoscience databases with FAIR data standards.
Integrating data science and geoscience to build transdisciplinary community
The Earth is a complex system and the Earth sciences ought to be multidisciplinary (Horton 1998). Driven by major geoscientific issues, Earth science has been undergoing a significant transformation from a separate discipline to integrated Earth system science. New theoretical knowledge systems and technical and methodological platforms are required to facilitate the development of transdisciplinary and convergent research. The multiple spheres of Earth dynamically interact with and influence each other. The fundamental goal of contemporary Earth science is to study the operational mechanisms of the various spheres of the Earth, their interrelationships and the factors that control co-evolution and regulation. Such fundamental and integrative problems of Earth science require multidisciplinary efforts including new theoretical knowledge systems linking all disciplines by breaking the current discipline barriers. New platforms linking multidisciplinary knowledge systems digitally would enable integration of transdisciplinary and multidimensional Earth science data, and require new data discovery techniques for handling and mining big data for knowledge discovery. The transformation from a narrow focus on separate disciplines to a comprehensive and integrative focus will rely on integration of traditional Earth science disciplines and modern disciplines such as geoinformatics, geomathematics and data sciences (Fig. 7).
Integrating massive big data from different disciplines for problem-solving requires algorithms that enable the rapid and efficient processing of big data and that address the challenges of rapidly growing, heterogeneous, multi-source data volumes. Advanced information technologies such as cloud computing, parallel computing, supercomputing, complex networking, knowledge graphing, ML and AI can provide indispensable support to Earth science. This integration will ultimately facilitate the development of complex models to enhance abductive fusion of data-driven and model-driven approaches (Duerr et al. 2015; McPhillips et al. 2015), which in turn support the use of essential computing resources and the best scientific thinking to address challenges (Bergen et al. 2019).
Promoting interactions between national GSOs and International Scientific Associations
The mandates of national GSOs (as illustrated by the papers in this volume) are to collect, monitor, and analyze scientific information to provide knowledge about natural resources and other issues to support societal development and to improve the quality of human life. Most GSOs are responsible for database construction and maintenance. Some GSOs have broad expertise with varied sources of funds to carry out multi-scale and multidisciplinary investigations in order to provide impartial scientific infor mation to resource managers, planners and other clients (http://www.usgs.gov). International scientific associations represent academic professionals including students and typically have the objective of promoting the development of science in specific disciplines
IUGS is supported by two types of members: Adhering Members and Affiliated Members (the statutes and bylaws of the IUGS are available on-line at https://www.iugs.org/statutes-bylaws). A geoscientific organization from a country or geographic region, supported by an appropriate authority, may become a member of the Union as an Adhering Organization. Adhering Organizations constitute the contributing and voting membership of IUGS and typically include representatives from either or an overarching committee of GSOs, Geological Societies or Academies of Sciences. The Affiliated Membership generally includes international associations and societies. One of the strategic priorities of IUGS is to facilitate interactions and collaborations of all these groups to provide integrative knowledge essential for strengthening geological science for fundamental science problem-solving and for supporting the sustainable development of societies. Given the mandates of the GSOs, typically to promote the application of geoscience for the public good, and their national roles and reach to support government decision making, enable economic development, improve public safety and protect the environment, it is important for the IUGS to act as an enabler and facilitator of international collaboration, innovation, and knowledge exchange between the GSOs and the other constituent members of the IUGS.
Many successful examples can be listed to demonstrate the essential contribution of close collaboration between governmental agencies and international associations on major science programs that tackle global-scale issues. The International Lithosphere Program (ILP), which promotes the cooperation between geology, geophysics and geotechnology led to the establishment of the International Continental Drilling Program (ICDP). IUGS, the International Union of Geodesy and Geophysics (IUGG) and national members of LIP (http://www.scl-ilp.org; participate in this effort and facilitate the integration of imaging, monitoring and modelling for the study of the global Lithosphere. The IUGS Big Science Programs initiative aims to focus on fundamental and integrative science questions or global-scale issues closely related to the strategic objectives of GSOs. Such an initiative may include but is not limited to the understanding and assessment of global change due to natural geological processes and anthropogenic influences, global distribution and mechanisms of extreme geological events (e.g. extreme weather, volcanoes, Earthquakes, tsunamis and floods) that threaten human life, and global assessments of water, air, mineral and energy resources. This is intended to support the establishment of integrated decision-support systems for resource utilization with environmental stewardship, disaster risk reduction, and ecological protection, systems which are in line with the general mission of GSOs (Hill et al. this volume, in review).
The survey data collected by GSOs and other government agencies, such as space agencies (NASA, CNSA, ESA, etc.), and other types of scientific data collected by academia and associations are complementary and ought to be integrated in order to describe the whole spectra of the Earth. While the survey and monitoring data, including satellite remote sensing data, can be vast in volume, the other types of scientific data collected by scientists or scientific programs and treated as ‘long-tail’ and ‘small’ data (such as isotope ages and fossil samples) can provide key information about the genetic properties of geological processes and events. While most survey and monitoring data are in relatively uniform and standard formats, those scattered in academic research studies, with small portions published and available in the repositories of publishers, are less organized with variable data standards. The collective efforts of all relevant organizations to standardize the databases and to make them FAIR will be required.
In the new era of Earth system science, driven by the digital revolution, new programs and platforms are urgently needed to facilitate efficient use of geoscience data and move from traditional research approaches to a modern approach driven by digital technologies. Many transitions towards this goal reflect the new ‘big data’ paradigm for scientific research. Since different subject areas have various scientific focuses and needs, and scientists in separate regions experience different financial conditions and scientific infrastructure, IUGS is an ideal organization to bring together their expertise through big science programs and member engagement. International collaborations are required and encouraged in establishing big science programs, especially among IUGS constituent groups including Adhering National Committees and Affiliated International Associations. Over the past few years, President Cheng and his IUGS Executive Committee team have been taking all possible opportunities of attending numerous conferences and other occasions to promote new IUGS initiatives and to convince IUGS national committees, organizations and associations to develop proposals for establishing and participating in IUGS big science programs, as well as setting up centers of excellence with the primary objectives of sharing new knowledge, facilitating cutting-edge technological innovation and tackling global issues. It is believed that organizations jointly participating in big science programs as elaborated in this paper will benefit from them as long and as much as they will invest themselves and invite their collaborators to do likewise. This investment should take the form of creating enabling projects to integrate their data and disciplinary knowledge systems, and moving toward a multidisciplinary system facilitating ML, AI and data discovery, facilitating construction of knowledge engines for automated discovery flow of multiple models and techniques using big data resources, exploring fundamental and integrative questions including not only the known questions seeking answers but also unknown question for further innovation, engaging in a dynamic and international community of transdisciplinary sciences and achieving a profound international impact on promoting the transformation of the GEO-scientific research paradigm.
The senior author thanks the IUGS EC for their comments and support during the development of the big science program initiative. We also thank Dr Daniel Lebel, Director General of Geological Survey of Canada, for his invitation to give the presentation of the paper at the session ‘Changing Role of Geological Surveys’ of RFG2018 Conference held in Vancouver, Canada, June 2018, and to write up this paper. Dr Phil Hill is thanked for his editing and polishing the English of the manuscript. The two reviewers are thanked for their constructive comments on reviewing the earlier version of the manuscript.
This research has been supported by the National Key Technology R&D Program of the Ministry of Science and Technology of the People's Republic of China and the State Key Program of National Natural Science Foundation of China award to Qiuming Cheng.
QC: conceptualization (lead), investigation (lead), methodology (lead), project administration (lead), resources (lead), supervision (lead), writing – review & editing (lead); RO: conceptulization (equal), investigation (equal), validation (equal); MZ: data curation (equal), formal analysis (equal), methodology (equal), software (equal), validation (equal), visualization (equal), writing – original draft (equal)