Minerals are information-rich materials that offer researchers a glimpse into the evolution of planetary bodies. Thus, it is important to extract, analyze, and interpret this abundance of information to improve our understanding of the planetary bodies in our solar system and the role our planet’s geosphere played in the origin and evolution of life. Over the past several decades, data-driven efforts in mineralogy have seen a gradual increase. The development and application of data science and analytics methods to mineralogy, while extremely promising, has also been somewhat ad hoc in nature. To systematize and synthesize the direction of these efforts, we introduce the concept of “Mineral Informatics,” which is the next frontier for researchers working with mineral data. In this paper, we present our vision for Mineral Informatics and the X-Informatics underpinnings that led to its conception, as well as the needs, challenges, opportunities, and future directions of the field. The intention of this paper is not to create a new specific field or a sub-field as a separate silo, but to document the needs of researchers studying minerals in various contexts and fields of study, to demonstrate how the systemization and enhanced access to mineralogical data will increase cross- and interdisciplinary studies, and how data science and informatics methods are a key next step in integrative mineralogical studies.

The potential for data-driven methods to make novel, unintuitive, and groundbreaking discoveries in Earth and planetary science research will only grow as the volume and variety of data increases with time. Mineralogy, in particular, is ripe for the application of data-driven methods. Minerals form as a result of their unique chemical and physical conditions and, in the process, retain information regarding their formation that offers an opportunity to study the complex geologic and biologic past of planetary bodies (Prabhu et al. 2021b).

Mineralogy has been the subject of scientific curiosity and study for millennia (Agricola and Bandy 1955; Needham and Wang 1995; Bandy and Bandy 2004). In addition to their roles as captivating specimens for collection and study, minerals and their ores are essential in the survival and industrialization of humankind (Coates 1985; Murray 1995). This interest and utility has led to the characterization and systemization of mineralogy and mineral occurrence on Earth and other planetary bodies (Dana 1895; Bragg and Bragg 1913; Strunz and Tennyson 1941; Lehnert et al. 2000; Lafuente et al. 2015; Hazen and Morrison 2020). As a result of this rich history of scientific investigation, vast amounts of information are available on the occurrence and attributes of minerals. These data provide a robust platform for the analysis of more complex, multidimensional, and larger mineralogical systems; the integration of heterogeneous data types, linking to data from other fields of science; and predictive, data-driven scientific exploration—all of which leads to the answering of complex, multidisciplinary questions. The potential of data-driven mineralogical research has been exemplified by important scientific advances in the last decade. Recent discoveries have demonstrated periodicity of mineral formation and diversification associated with supercontinent assemble (Bradley 2011; Voice et al. 2011; Hazen et al. 2014; Nance et al. 2014), an association of mineral redox state to the oxidation of Earth’s atmosphere (Liu et al. 2021; Hummer et al. 2022; Large et al. 2022), and that much of Earth’s mineral inventory is the direct or indirect result of interactions with water and/or biology (Hazen and Morrison 2020, 2022), as well as the prediction of the number of as-yet undiscovered mineral species (Hazen et al. 2015; Hystad et al. 2015, 2019), the chemical composition of minerals on Mars (Morrison et al. 2018a, 2018b, 2018c), and the location of undiscovered mineral deposits (Prabhu et al. 2019). Mineralogy is rapidly entering the data-driven era, tackling previously unanswerable questions while demonstrating the need and opportunity for a symbiotic relationship between mineralogy and the fields of data science and informatics.

Data-driven efforts in mineralogy have been gradually increasing in the past decades, and there are some promising studies that have helped researchers uncover patterns hidden in the data—patterns that have led to scientific discoveries (Morrison et al. 2017, 2020; Gregory et al. 2019; Hazen et al. 2019; Prabhu et al. 2019; Hazen and Morrison 2020, 2022; Zhao et al. 2020; Boujibar et al. 2021; Hystad et al. 2021). While still nascent, the application of data science and data analytics methods in mineralogy shows a promising trajectory, though the development of these methods and advances in the past have been somewhat ad hoc in nature. However, development of mineral informatics can be guided in a more deliberate and systematic way by considering the underpinnings from information theory and data science advances, as exemplified by collaborations in other fields, including biology, medicine, chemistry, and astronomy. We believe this is the start of a new era in mineralogy, where utilizing data-driven methods to answer mineralogical (and broader scientific) questions takes center stage.

In this paper, we take a high-level look at our vision for “Mineral Informatics,” the underpinnings that led to its conception, as well as the needs, challenges, and opportunities for this emerging field. We also discuss the implications such advances will have on the field of mineralogy.

Informatics is the study of the structure, algorithms, behavior, and interactions of natural and artificial systems that store, process, access, and communicate information (Fox 2011). The term informatics has often been used in conjunction with the name of a domain/discipline, for example, Bioinformatics, Geoinformatics, Astroinformatics, and Cheminformatics. In the past, researchers with expertise in a specific domain worked on processing and engineering information systems designed for that domain only. But in the last decade, informatics has gained much wider visibility across a range of disciplines (Prabhu 2019). This wider visibility is in large part due to successful efforts at systematizing the core (i.e., discipline neutral) aspects of informatics, for example, use cases, human-centered design, iterative approaches, and information models (Fox 2020). The core methods of informatics are used as a foundation to explore raw data and extract information from the data that lead to scientific discoveries. As the volume and complexity of the data increase, so does the need for utilizing the solid foundations provided by informatics methods and combining them with the needs of the specific domain to pursue data-driven scientific discoveries.

Mineral informatics is a nascent approach compared to fields like Bioinformatics, Medical Informatics, and Geoinformatics that have been pursued for decades (Collen 1986; Fox et al. 2006; Sinha et al. 2010; Gauthier et al. 2019). The intention of this paper is not to create a new specific field or a sub-field as a separate silo but to think of and document the needs of researchers studying minerals in various contexts and how data science and informatics methods are a key next step in mineralogical studies. We also need to learn from the successes and failures of more mature domains that have applied the informatics approach. Lastly, a very important factor to keep in mind is the truly interdisciplinary and important questions that can be explored by studying minerals. So, while the term “mineral informatics” may seem to be creating a new subclass of geoinformatics, we assert that we are instead tying together various disciplines that use minerals as a key part of the pursuit to answer big science questions.

In this paper, we present a general methodology for mineral informatics (see Fig. 1). This methodology, adapted from Fox and McGuinness’ Semantic Web Methodology (Beaulieu et al. 2017), includes all the steps typically followed in a data-driven scientific exploration. This approach was created for mineral informatics but, as is the case with many data science and informatics approaches, is transferable and applicable to other domains.

Most informatics explorations start in one of two ways: (1) scientists have a research question they want to answer, or (2) scientists have data ready to be explored. In the second case, we perform preliminary data exploration, which helps generate new hypotheses and research questions based on interesting trends and anomalies in the data.

Once a specific research question has been selected for scientific exploration, we start by dividing the large problem into smaller, more tractable parts. Next, we iteratively develop use cases for every one of these parts. A “use case” is a documented collection of possible sequences of actions and interactions between a system and its users in pursuit of a particular goal. Identification and development of use cases help to define the needs (e.g., data, personnel, infrastructure) for this data-driven approach. The next steps in the methodology include creating an (or assigning roles to an already established) interdisciplinary team to conduct the data-driven research.

Next, we inventory the preliminary data set and/or existing mineral data resources to determine if they are what is necessary for the desired exploration. In some cases, we need to collect, compile, and extract data from other repositories or sources, including scientific literature, websites, digital PDFs, and experimental results. We then create an information model to better understand and mediate data from heterogeneous sources and data types, which provides a holistic picture of the relationships among the various data sources, types, and attributes. The information model allows us to extract the data sets and data attributes most relevant to answering the desired research question. Note that this step differs from the statistical and machine learning approaches used for feature selection.

We then begin applying data analytics methods (i.e., data visualization as well as descriptive, predictive, and prescriptive analysis) to identify and explore patterns and anomalies seen in the data. A team of domain and data scientists iteratively examine the results of the analytics methods and use their respective expertise to: (1) provide interpretations and/or insight, and/or (2) recommend changes to the analysis. The data analysis and scientific interpretation are usually done over multiple iterations with small modifications to the approach, algorithms, and/or code to explore different aspects of the data.

If scientists come to an agreement that parts of the analysis would be widely used in the larger community, then they can choose to generalize and adapt their work into a system, technology, or infrastructure. This development can include the creation of tools, code snippets, reusable workflows, R packages, Python libraries, and other resources. Irrespective of whether there is a decision to create a general tool, technology, or package, we recommended using rapid prototyping coding practices (Gordon and Bieman 1995) for data science and informatics activities.

After obtaining the desired results from our data analysis, it is important to disseminate and effectively communicate the research products generated by mineral informatics explorations. Research products can include data sets, code, scientific literature, and executable workflows. Establishing best practices for disseminating research products is an ongoing effort, especially in the geoscience community. Data sets can be published as part of a data paper, or they can be assigned their own DOIs by data repositories such as Zenodo, Dryad, Figshare, or Dataverse (Assante et al. 2016). Existing mineral data repositories, including the EarthChem Library (ECL), Astromat, and the Open Data Repository (ODR) also provide DOIs for data sets deposited by researchers. Additionally, some journals host data associated with their publications. Similar to releasing data used in scientific exploration, code can be maintained and released in many ways, including Github (with a persistent identifier pointing to the repository), Figshare, or Zenodo. Saving executable code for an experiment in an interactive environment like Jupyter or R notebooks adds to the reproducibility of the code and of the scientific workflow in general (Prabhu and Fox 2021). Dissemination of scientific advances through scientific publications has been practiced for more than 300 years (Fyfe et al. 2015). In addition to journal publications, conference proceedings, preprint servers (such as arXiv, ESSOAr, and EarthArXiv), and even press releases associated with publications have considerably improved the landscape of disseminating research products.

The final stage of our informatics methodology follows the sharing of the research products. If researchers follow FAIR (Findable, Accessible, Interoperable, and Reusable) and Open Science practices (Wilkinson et al. 2016; Stall et al. 2019; Ramachandran et al. 2021) not only for the dissemination of their scientific results but also during the use case development, information modeling, and analysis stages, then it becomes easier to evolve, improve, redesign, or adapt their work. Ongoing research and recommendations on designing FAIR and Open scientific workflows will help improve the methodology of data-driven exploration (Sandve et al. 2013; Kluyver et al. 2016; Prabhu and Fox 2021).

It is important to evaluate the outcomes at almost every stage of the informatics methodology. The evaluation method or metric used at each stage will be significantly different, but it is important to stop at the end of every stage and assess not only the progress made but also lessons learned for future iterations in the same exploration or the beginning of a different exploration. For example, a data collection/resource may be evaluated based on a set of quality criteria (e.g., Prabhu et al. 2021b), but results from the data analysis may need to use quantitative metrics to evaluate results from a descriptive, prescriptive, or predictive model (e.g., Statnikov et al. 2008; Hossin and Sulaiman 2015; Tomašev and Radovanović 2016; Zhou et al. 2021). Established evaluation methods exist for each stage of the informatics methodology, and we recommend following those established best practices and standards set by the scientific community. Issues found during the evaluation will need to be documented in the use case and thus improve the data-driven exploration during the next iteration or redesign of the approach.

Mineral informatics methods not only systematize the mineral data landscape, but also provide a path to answering longstanding interdisciplinary scientific questions. Figure 2 gives an example of the domains influenced by the research questions being broached with mineral informatics methods. In the following section we outline some significant scientific questions that can be addressed with mineral informatics.

Can complex chemical and physical attributes of mineral specimens reveal their paragenetic modes and function as proxies for biosignatures?

Minerals record the physical, chemical, and, in some cases, biological conditions of their paragenetic modes (i.e., formational and alteration environments). This information is stored in the myriad attributes of mineral specimens, including major, minor, and trace elements; stable isotopes and their ratios; solid and fluid inclusions; texture, twinning, exsolution, and other structural characteristics; grain size and shape; and much more. Therefore, conditions of mineralization, including whether or not there was biological input, can be characterized with cluster analysis performed on the various properties of mineral samples (Gregory et al. 2019). Furthermore, robust classification schemes can be developed from the clustering models that will enable prediction not only of the geologic environment of formation but also of any biogenic origins (Hazen 2019). Therefore, this work will deconvolve our understanding of the minerals that formed in environments influenced by life from those that formed under strictly abiotic conditions.

Is a planet’s diversity and statistical distribution of mineral species influenced by the presence of life?

Life creates unique niches of chemical disequilibrium for minerals to exploit. These processes likely drove a significant fraction of the mineral diversity we see on Earth today, influencing the spatial and temporal patterns of mineral distribution (Hazen 2018; Morrison et al. 2020; Hazen and Morrison 2022). These trends on Earth and other planetary bodies can be modeled, compared, and used to develop statistical biosignatures and abiosignatures that are reflected in the diversity and distribution of mineral species across a planetary body (Hystad et al. 2019) and provide models for planetary-scale mineralogical biosignatures of inhabited worlds.

Can we predict mineral occurrences on other planets given limited planetary data?

From orbital infrared spectroscopy, we have obtained global or near-global data sets of the mineralogy of other terrestrial worlds, including Mars, Mercury, Vesta, and Ceres (Murchie et al. 2009; De Sanctis et al. 2012; Ehlmann and Edwards 2014; Namur and Charlier 2017; Prettyman et al. 2019). Informatics methods, such as association analysis, can be used to predict the existence of minerals that cannot be detected from space. By understanding mineral affinities for assemblages, localities, and geochemical parameters, we may be able to use a sparse mineralogical data set to anticipate future discoveries (Prabhu et al. 2019), but first a robust small/sparse-data framework must be developed. Enhancing predictive capabilities will help to prioritize landing sites for future landers and rovers with broad science goals that relate to mineralogy, like understanding planetary history or searching for signs of life. Such predictions would be strategically important because interplanetary missions cost hundreds of millions to billions of dollars and take years to decades to develop, build, and launch.

We also have geochemical indicators of the mineralogy of the ice-covered ocean world Enceladus from plume flybys and E-ring analyses performed by the Cassini spacecraft (Postberg et al. 2008; Waite et al. 2017; Glein and Waite 2020). Mineral informatics methods can help predict the mineral composition of ice-covered ocean worlds, whose mineralogy is planetologically and perhaps astrobiologically relevant but cannot be accessed directly in the near future.

Co-occurrence of minerals and life: do minerals enable or shape the metabolic landscape?

Minerals play a key role in biological redox transformations. Many microorganisms (e.g., of the genus Geobacter) are able to use metals in their environment to power their metabolisms (Childers et al. 2002). Several studies have suggested deep similarities between minerals and metalloenzymes (Nitschke et al. 2013; Zhao et al. 2020; McGuinness et al. 2022). Thus, minerals may play an important role in shaping the metabolic landscape of ecosystems by providing electron donors/acceptors or raw materials (Novikov and Copley 2013) that organisms assimilate to create metalloenzymes. If minerals and their structures are found to be critical in shaping which metabolisms occur/do not occur in certain environments, these mineralogical data may allow for the prediction of metabolisms in terrestrial and extraterrestrial environments for which we have mineralogical data.

What role did minerals play in the origin of life?

Several studies have posited that minerals played critical roles in the emergence of life on Earth, whether by catalyzing critical biomolecular reactions, templating the formation of biopolymers, influencing the homochirality of organic molecules, or performing redox transformations and carbon fixation (Hazen and Sholl 2003; Hazen 2005; Hazen and Sverjensky 2010; Nitschke et al. 2013; Russell 2018). Others have even suggested that clays and other minerals with layered structures may have been the first self-replicating entities (Cairns-Smith and Hartman 1986; Cairns-Smith 1990; Greenwell and Coveney 2006; Brack 2013), though these hypotheses have not been confirmed experimentally (Bullard et al. 2007; Krivovichev et al. 2011). Mineral informatics, combined with phylogenetics, geology, and laboratory experiments, could be informative for deducing the likely role(s) that minerals played at the origin of life in Earth’s deep past. If certain minerals are found to be uniquely critical to the emergence of life on Earth, then this discovery would have profound implications for the emergence of life on other planetary bodies where those minerals may or may not occur. The origin of life from a non-living substance involves considerable jump in the informational (static) complexity of the underlying molecular structures, which should be considered in any possible scenario of molecular evolution/revolution that led to the appearance of self-replicating living entities. The sudden rise in structural complexity corresponds to the drop in configurational entropy (Krivovichev 2016). Can the (local) entropic changes associated with the origin of life be measured quantitatively and understood using mineral informatics data?

Can mineral networks serve as a planetary-scale biosignature?

Roughly half of all known minerals are mediated by biology and 34% are exclusively biotic (Hazen et al. 2021, 2022; Morrison et al. 2021; Hazen and Morrison 2022). Many of these minerals are formed when life opens up a new compositional space for the planet, such as the Great Oxidation Event (Hazen et al. 2008; Sverjensky and Lee 2010). However, some of this biogenic chemical space may be abiotically accessed on other worlds. Abundant atmospheric O2, for instance, may be abiotically generated by various star–planet interactions (Meadows et al. 2018) and references therein. Earth and planetary mineral network analysis may reveal whether mineral networks of environmental, biological, geochemical, and mineralogical attributes can distinguish living from nonliving worlds.

Can mineral networks serve as a proxy for the extent of planetary evolution?

Mineralogical evolution occurs when processes create new pressure–temperature–compositional regimes where solids can form (Hazen et al. 2008, 2021; Hazen and Morrison 2020; Cleland et al. 2021). Each stage of mineral evolution expands the network of mineralogy through the introduction of new minerals, localities, and paragenetic modes. The network of martian mineralogy, therefore, is thought to be a subset of the network of Earth’s mineralogy, due to the halting or slowing of mineral-generating geological processes on Mars. One can consider Mars and Earth to be two points along a spectrum of terrestrial worlds whose geological (and biological) activities have differed in temporal extent. A hypothetical world where plate tectonics was sustained for ~1 Gyr but then ceased should have a mineral network that surpasses Mars’s mineral diversity, but is still a subset of Earth’s. In this way, mineral informatics helps us interpret the extent of a planet’s mineralogical network as a record of ancient and extinct processes, revealing a planet’s geological history.

When considering exoplanetary systems where element ratios (e.g., C:O or Mg:Si) differ greatly from those of our own solar system, this linear spectrum on which Mars and Earth lie becomes a multidimensional phase space (Unterborn et al. 2016; Hinkel and Unterborn 2018; Unterborn and Panero 2019; Putirka et al. 2021). Understanding mineral networks from an informatics point of view may help to predict how planetary mineralogy might evolve in vastly different geochemical contexts.

Did the emergence and evolution of life play a role in the increase of average mineral structural complexity on Earth through deep time?

It has been shown that complexity of Earth’s mineral kingdom increased gradually during planetary evolution (Krivovichev et al. 2018), but it is unclear whether this trend is related to the contemporaneous increase in complexity in the course of biological evolution. The average structural complexity of minerals on the abiotic Moon, for example, does not follow the same trend of increasing complexity through time. Minerals are relatively less complex than biological organisms, both in terms of their static (Krivovichev 2013, 2015) and functional (Hazen et al. 2007) complexities. However, since life and the mineral kingdom co-evolved, the character of the evolution of mineral complexity on Earth (Krivovichev et al. 2018) may have been influenced by biological activity, and is thereby a potential bio-signature.

Strategies for future advances in mineral informatics are informed by previous efforts—“use cases” that have applied data science analytics and visualization to tackle key mineralogical problems. In the following section we review five of these recent and ongoing studies.

The evolution of mineralizing environments, as characterized by their myriad, complex attributes

Mineralization, and associated formational environments, vary significantly across Earth and neighboring planetary bodies, as well as throughout the different historical stages of planetary evolution. These stages and environmental parameters dictate the types of mineralization that occur and, likewise, leave their mark in the complex chemical and physical attributes of the resulting mineral specimens. Understanding the changing characteristics of mineralizing environments spatially and temporally across our planetary systems requires the examination of huge volumes of mineralogical information. The beginning steps of this work included a survey of all formational environments of ~5700 known mineral species, resulting in a compiled data set ripe for exploration (Hazen and Morrison 2022; Hazen et al. 2022). Initial exploration has led to the discovery that: (1) more than 80% of all mineral species formed through processes that involved water; (2) 50% of minerals formed through processes directly or indirectly related to biology, with 34% of minerals forming exclusively through biotic processes; (3) 42% of minerals contain one or more rare elements (e.g., REE, PGE, As, Mo, Sn), elements which all together represent only 0.01% of crustal atoms; and (4) most minerals have only one (59%) or two (24%) modes of formation, with a few notable exceptions, including pyrite with the most modes of formation at 21 (Hazen and Morrison 2022).

An additional component of this work involves analyzing those myriad attributes of mineral specimens via cluster analysis to relate their complex characteristics to their modes of formation, thereby determining the natural kind clustering of these mineral systems. There are many such projects underway, including those examining the formation of pyrite (Gregory et al. 2019; Zhang et al. 2019), garnet minerals (Chiama et al. 2020, 2022), spinel oxide phases (Hindrichs et al. 2022), and presolar moissanite (SiC) (Boujibar et al. 2021; Hystad et al. 2021). Boujibar et al. (2021) performed cluster analysis on a range of isotopic data from presolar SiC grains to examine and compare the origins of these materials. This study made several exciting discoveries—while the clustering model agreed with previously defined grain types and origins in several aspects, there were notable and important deviations, including (1) a division of one previously defined grain type into three distinct kinds based on the varying metallicity of the parent star; (2) the arbitrary nature of certain prior divisions in systems that in fact are continuous rather than discrete; (3) the observation that asymptotic giant branch (AGB) stars with narrow ranges of mass and metallicity tend to have enhanced production of SiC; and (4) enrichments in 15N and 26Al that are not explained by existing AGB models.

Next steps. This exploration of mineralizing environments and their characteristics not only provides an opportunity to integrate data from heterogeneous sources and types (e.g., X-ray diffraction, electron microprobe analysis, inductively coupled plasma mass spectrometry), but also to link data from different fields of science to better understand mineral paragenesis. Handling heterogeneous data are a challenge (Reichman et al. 2011; Wang 2017) and many researchers have been actively working on using heterogeneous data for their analysis by creating methods, approaches, and pipelines to seamlessly clean, integrate, process, and analyze data (Wiederhold 1999; Beneventano and Bergamaschi 2004; Wang 2017; Zhang et al. 2018; Nazábal et al. 2020). Additionally, the exploration conducted by Boujibar et al. (2021) provided another use case to test machine learning methods on sparse data sets, thereby aiding in the eventual development of a sparse data framework.

Mineral association analysis

Prediction of the locations of as yet undiscovered mineral deposits has long been a point of great scientific and economic interest. Mineralization and mineral co-occurrence across the varied geologic terrains of Earth and other planetary bodies has a level of complexity that makes prediction of mineral locations, or even the mineral inventory at a locality of interest, difficult. However, recent advances in the mineral locality data resources (e.g., mindat.org and the Mineral Evolution Database) have provided an opportunity to begin tackling this tough problem with machine learning. Association analysis can be used to create a recommender system (Burke et al. 2011; Shah et al. 2017) that generates association rules based on known co-occurrences, and these rules can be queried to determine the likelihood of currently unknown co-occurrences. In the case of minerals, we can query our mineral association rules to predict: (1) previously unknown locations of a mineral species; (2) previously unknown locations of mineral assemblages, including those that represent analog environments for study; and (3) the mineral inventory at a locality of scientific interest. The mindat.org team have conducted preliminary explorations using pairwise associations to predict the occurrence of certain minerals on Earth.

Next Steps. Mineral association analysis provides a powerful approach to new types of data problems. We need to modify the association analysis algorithms to better handle larger mineral occurrence data sets. For example, our models can currently handle only 2473 minerals occurring in 87 306 localities (Prabhu et al. 2019), but there are at present more than 5800 mineral species in the International Mineralogical Association’s (IMA) list of approved mineral species (https://rruff.info/ima/, accessed 17 January 2023), which occur in more than 375 000 localities (https://www.mindat.org/stats.php, accessed 20 December 2022). To increase the scalability of the association analysis algorithm, we plan to introduce threshold checks and additional parameters during the association rule generation process, so that the number of rules generated is controlled. In addition to improving the scalability of association analysis methods, we also need to work on the dimensionality and reducing the minimum support of our method. For example, our method currently develops rules containing 4 minerals at a time, but there are localities with more than 50 coexisting minerals. Therefore, an important next step in our research is to increase the dimensionality of the association analysis method to handle more complex mineral assemblages. We plan to reduce the number of rules in a rule base by better identifying redundant rules or similar rules, thus leaving more disk space for higher dimensional rules. We also need to adapt our methods to enable inclusion of rarer mineral species that are known to occur in 17 or fewer localities (Prabhu et al. 2019). We plan to include rarer mineral species by weighting the mineral occurrence by other factors including tonnage, its paragenetic mode diversity, and criticality of the mineral. Lastly, we are currently developing a new approach to evaluate association rule mining methods (Prabhu et al. 2021a).

Martian crystal chemistry

The scientific payload onboard the NASA Mars Science Laboratory (MSL) rover, Curiosity, is the one of the most advanced instrument suites ever landed on another planet. Part of this payload is the CheMin X-ray diffraction (XRD) instrument, which is used to characterize the mineralogy of rock and soil samples. CheMin is capable of identifying mineral phases present in samples, as well as their abundances and, for phases with an abundance ≥1 to 3 wt%, their unit-cell parameters. While there are instruments that analyze the bulk composition of martian samples, there is no instrument that directly measures the chemical composition of these mineral phases. However, in compiling data resources on mineral unit-cell parameters and compositions measured on Earth, the CheMin XRD patterns and resulting mineralogical data are used to predict the composition of the mineral phases observed on the martian surface (Morrison et al. 2018a, 2018c).

These initial studies, as with many investigations predating it, used unit-cell parameters to predict mineral composition in chemically limited systems, generally 2- or 3-element systems such as Fe-Mg olivine or Mg-Fe-Ca pyroxene (Morrison et al. 2018a, 2018c). This limitation was due to the complexity of the compositional and structural parameter space when four or more elements are considered together. One way to develop a model that accounts for the complexity associated with multi-component systems and predicts the chemical composition of crystalline phases based on their crystallographic parameters is by using Label Distribution Learning (LDL) (Geng et al. 2013, 2014; Geng 2016). LDL is a machine learning algorithm originally created for facial recognition applications. When the approach was adapted for application to crystallographic and chemical parameters, it resulted in a model that accurately predicted the multi-component chemical compositions (up to 12 elements, in some mineral systems) of samples based solely on their unit-cell parameters (Morrison et al. 2018b). This crystal-chemical method has expanded the capability of XRD on spacecraft to that of a powerful chemical analysis tool, such as an electron microprobe, and has dramatically deepened our understanding of the geologic history of Mars.

Next steps. This exploration was the initial inspiration that motivated us to create a framework for small and sparse data. In addition to our work developing a framework for small and sparse data, we will also need to develop methods to evaluate the accuracy of predictions made by our data models. This evaluation will attempt to address sources of uncertainty and how that affects our predictions. The LDL evaluation method being developed will address uncertainty of measurement (instrument errors), uncertainty from sampling (various sampling strategies to train predictive models), and most interestingly, scope compliance (Kläs 2018) of the LDL method.

Machine learning majorite barometer

Diamond-hosted majoritic garnet inclusions provide important insights in processes that occur in Earth’s deep mantle. Majoritic garnets provide the most accurate estimates for diamond formation pressures because laboratory experiments have shown that garnet chemistry varies as a function of pressure (Akaogi and Akimoto 1977; Irifune 1987; Collerson et al. 2010; Wijbrans et al. 2016; Beyer and Frost 2017; Thomson et al. 2021). Thomson et al. (2021) show that none of the available barometers in the literature reliably reproduces the pressures of experimentally synthesized majoritic garnet over the entire pressure-temperature-composition space investigated. Hence, they developed a barometer by using machine learning algorithms (specifically random forest regression) and experimental training data. This machine learning approach, tested with various cross-validation methods, produces a barometer with a much-improved fit to the experimental data, especially at the highest pressures and at extremes of composition space, and thus provides more reliable estimates of formation pressures of diamond-hosted majoritic inclusions. Applying the machine learning barometer to the global database of diamond-hosted inclusions reveals that their formation occurs over specific depth intervals that can be related to melting and decarbonation of subducted oceanic crust.

Next steps. While the machine learning approach improved the fit to the available experimental data, it also revealed regions in pressure, temperature, and most critically, composition space where the experimental data set is sparse. Because many of the mineral inclusions have compositions lying near or within sparse data regions, uncertainty remains as to whether the barometer is accurately capturing their pressure (and depth) of origin. Experiments can now be targeted to these specific P-T-X regimes for an even more improved barometer. Machine learning methods also can be used to predict the compositional variables that correlate most strongly with changes in pressure, leading to an improved crystal chemical and thermodynamic understanding of pressure-sensitive substitutions in garnet. These methods can also be applied to other mineral thermometers and barometers where large experimental data sets are fitted to extract thermodynamic solution parameters.

Comparison of mineral and protein metal clusters

Understanding the evolutionary stages of biology on a geological timescale is hampered by the propensity of organic matter to degrade within thousands of years without leaving physical fossil records. To understand how life evolved over the course of billions of years, proxy data are required.

At least five observations suggest that minerals can act as a source of proxy data from which to infer how biology evolved: (1) biology and geology are intimately connected, for instance, cellular organisms excrete minerals as metabolic end products [hazenite (Yang et al. 2011); greigite (Gorlas et al. 2018)]; and cellular organisms transmit electrons to and from minerals (Shi et al. 2016); (2) cellular organisms and minerals use transition metals (Fe, Mn, Co, Mo, Cu, V, W, Ni) to perform electron transfer reactions; (3) mineral surfaces are hypothesized and shown to be capable of prebiotic reactions similar to those that extant proteins perform (Wächtershäuser 1988; Novikov and Copley 2013); (4) minerals are similar to the rings of a tree in that they provide information (e.g., temperature, humidity, etc.) about the environment of formation; and (5) metal cluster structures of extant proteins were observed to be so similar to the structure of bulk mineral metal clusters as to be considered vestiges of minerals that were co-opted and assimilated into biological systems (Russell and Hall 1997; Nitschke et al. 2013; Zhao et al. 2020).

Access to large mineral and protein structure databases allows the potential to understand how mineral and protein metal clusters are connected. Connecting the mineral world with biology will allow a deeper understanding of how geology and biology co-evolved. Directly quantifying metal cluster similarity between minerals and proteins is a challenge due to comparing the finite protein cluster to a periodic lattice of a mineral. Solutions using graph-based methods have been proposed (Zhao et al. 2020; McGuinness et al. 2022). Each solution compared subgraphs of mineral and protein metal clusters, however without including metal coordination, and mineral dimensionality (2D layer vs. 3D lattice) metal clusters were quantified as being highly similar (Zhao et al. 2020). Subsequent studies, building off the pioneering quantitative work of Zhao et al. (2020), included these chemically important characteristics and found FeS minerals and protein were significantly less similar (McGuinness et al. 2022) than previously proposed (Russell and Hall 1997; Nitschke et al. 2013) Even though McGuinness et al. (2022) show that FeS mineral lattices and protein metal clusters are not structurally similar, this method has not been applied to other metal types such as Ni or Cu. Applying the method developed by McGuinness et al. (2022) to additional metal types may help understand the extent to which proteins and minerals co-evolved as cellular metabolism and minerals became more complex (Moore et al. 2017; Krivovichev et al. 2018).

Next steps. An additional step toward a potentially clearer understanding of how minerals and proteins are related is to compare mineral surface and protein metal cluster structures. Mineral surfaces expose the chemically active components that may have catalyzed biologically relevant products under hydrothermal conditions on early Earth (Novikov and Copley 2013). Comparing the surface properties of minerals to the chemical properties of protein metal clusters might elucidate the extent to which minerals acted as primitive enzymes at the dawn of life. Did biology co-opt the chemical configuration of the chemically active surface of minerals to reproduce the reactions that were possible abiotically? Or did biology incorporate and reconfigure metal building blocks (e.g., Fe2S2) to meet growing cellular needs? Answering these questions is challenging because mineral surfaces are complex, i.e., they are subject to structural relaxation, chemically active, display complexly irregular surface topologies, and are affected by many solution conditions (pH, salinity, temperature, etc.) Alternatively, there also exists the possibility that protein metal clusters do not bear any significant resemblance to minerals (neither surface nor lattice structure), suggesting an alternative pathway and relationship between mineralogy and biology in which biology acts independently, only relying on minerals for the feedstock (i.e., metals) to nucleate the information-rich systems that remain far from equilibrium.

Table 1 is a non-exhaustive list of open access mineral data resources that are among the most widely used in the community. Note that many other useful and important mineral data resources are not yet available as open resources.

The global research community of mineralogy has made impressive progress on information models for database construction and data sharing in the past decades. From the point of view of data management, a good information model should be correct, complete, and consistent. An effective way for information modeling in real-world practice is to follow or adapt existing community agreements or standards on mineralogy, such as those on the physical, chemical, and biological characteristics of minerals. For instance, the Database of Mineral Properties (https://rruff.info/ima/) maintained by the International Mineralogical Association (IMA) keeps an up-to-date list of mineral species. The main components in the information model include mineral name, chemistry, mineral groups, origins, paragenetic mode, IMA status, relevant references, and links to external sources such as mindat.org, Google Images, and Wikipedia.

As open data and data-driven studies are increasingly accepted in the geoscience community, many databases in the field of mineralogy also help in increasing the visibility of their information model and building machine interfaces for data query, access, and download. For instance, the RRUFF database (https://rruff.info, accessed 21 January 2023) has integrated records of Raman spectra, X-ray diffraction, and chemistry data for minerals. The user interface enables data query through mineral name and chemistry includes/excludes. Interested users can also contact the database manager for batch data download and sharing. Mindat.org (https://www.mindat.org, accessed 21 January 2023) is another widely used database in the field of mineralogy. Its construction and maintenance follow a crowd-sourcing style. Besides the physical and chemical attributes of mineral species, a unique attribute on mindat.org is a comprehensive list of the localities where that mineral species has been found. In the past years, many research activities have benefited from the open data shared by mindat.org. As each of those open databases has its own focus and information model, scientists in large-scale research activities often need to collect data from multiple sources. Recently, researchers in geoinformatics and data science also discussed the need for a more comprehensive mineral information model to document the extensive facets of mineral data, such as the Global Earth Mineral Inventory (GEMI) proposed by (Prabhu et al. 2021b). Complementing these efforts are initiatives using semantic technologies to build knowledge graphs for mineral species, as a preparation to explore new ways for annotating and discovering mineral data shared on the Internet (Brodaric and Richard 2020).

The FAIR (findable, accessible, interoperable, and accessible) data principles (Wilkinson et al. 2016) are now widely accepted in geoscience. Information models are an important part of FAIR data. More community efforts, such as through IMA, the Mineralogical Society of America (MSA), and the Geoinformation Committee of the International Union of Geological Sciences (IUGS-CGI), are needed to promote the quality and usefulness of the model outputs.

The previous sections of this paper (and many other informatics papers focusing on various domains) have clearly emphasized the value that informatics methods provide to their respective domains (Collen 1986; Lord et al. 2004; Goble and Stevens 2008; Gauthier et al. 2019; Heberling et al. 2021). However, a point often missed or overlooked in scientific literature discussions is that innovations in data science and informatics are usually driven by diverse data sets available in various domains and the needs of the use-cases utilizing those data sets. In this section we discuss some of the interesting data science challenges we have observed while working with mineral data to try to answer some of the unanswered questions in geoscience.

In the following section we summarize four examples of mineral data challenges that provide interesting and unique problems that limit the usability of existing machine learning methods meant to extract meaningful information from data.

Small and sparse data framework

It has been widely publicized that we live in the “Age of Big Data” (Borgman et al. 2008; Lohr 2012; Wise and Shaffer 2015; Yu 2016; Wachter 2019), and understandably there has been a lot of research done into scaling-up algorithms, methods, software, and hardware needed to enable the exploration and use of very large data sets to gain valuable information. This focus has led to the creation and constant improvement of “big data frameworks,” which provide a roadmap on how to work with large data sets. However, mineralogy, along with many other fields in Earth and planetary sciences, provide a plethora of small and sparse data sets that do not fall into the realm of big data. These data sets therefore require the application of methodologies that lie outside the focus of traditional big data researchers. The next major hurdle for mineral informatics (and geoinformatics in general) is to work toward creating a framework for small and sparse data.

For example, mineral data collected by the CheMin X-ray diffractometer onboard the Mars Science Laboratory (Morrison et al. 2018c; Rampe et al. 2018) have few data points, having analyzed ~40 samples, each with around a dozen mineral species (as of January 2022). The CheMin team used small (on the order of dozens to a few hundred data points) data sets of mineral composition and associated unit-cell parameters to build models capable of predicting the basic chemical composition of major mineral phases observed on Mars based solely on their unit-cell parameters (Morrison et al. 2018b, 2018c). However, the team wished to push their chemical prediction further, to predict complex, multi-element mineral compositions for the martian crystallographic data. To do so, Morrison et al. (2018b) assembled data sets of laboratory-analyzed complex, multi-element mineral compositions and unit-cell parameters, which contained only a few hundred data points for each of the major mineral groups identified by CheMin. Morrison et al. (2018b) used the small data Label Distribution Learning approach to predict complex chemical compositions (up to 12 elements, in some mineral systems) of mineral samples collected by the CheMin instrument based on the unit-cell parameters of these samples. Significantly more work can be done here to increase the accuracy and performance of these models and such complex data sets with small sample sizes provide an interesting and rare challenge to data scientists.

Mineral geochemistry often contains information related to the geologic, chemical, and/or biological processes and materials that went into their formation and any subsequent weathering and alteration. However, geochemical data are inherently sparse due to chemical variability in geologic deposits and materials, different elemental affinities among different mineral species, and analytical bias introduced by research aims or instrument limitations. The resulting frequency of “missing values” makes many geochemical data sets unsuitable for use with existing algorithms designed for complete or near-complete data sets. A prime example of the sparseness of geochemical data are the garnet data set compiled by Chiama et al. (2020), which contains over 95 000 geochemical analyses of garnet group mineral samples collected from various sources, ranging from large repositories (EarthChem, RRUFF, MetPetDB) to individual peer-reviewed literature. Even a compiled and curated data set such as this is considered sparse, largely due to the chemical variability among the various garnet mineral species, resulting in missing values in the chemical compositions of these samples (Chiama et al. 2022). For example, of the 95 000 analyses compiled, only five major elements (Mg, Fe, Ca, Al, and Si) are present and/or reported in most samples, while other elements, including Mn, Cr, and Ti, are much less common throughout the data set. An additional contribution to this sparseness is that studies may not analyze for all elements in a sample (e.g., limited to elements of interest, difficulty measuring light elements), resulting in missing values for which it is not known whether that element is present. Thus, while analyzing these data (using descriptive, prescriptive, or predictive methods) we need to consider these missing values and their effect on the results. Sparse data are not a problem new or unique to mineral data (Greenland et al. 2000, 2016; Sweeting et al. 2004; Rogers et al. 2018), but, as is the theme for the rest of this paper, we must learn from the successes and failures of other domains in addressing sparse data (Katz 1987; Shepperd and Cartwright 2001; Uzuner 2009; Derczynski et al. 2013).

Other examples of small and sparse data challenges can be encountered in efforts to understand other planets and moons including Venus and Titan through their mineralogy and geo-chemistry. Frigid Titan’s exotic mineralogy, with water ice as a principal rock-forming mineral, oceans of liquid hydrocarbons, and varied postulated organic minerals, is mostly understood through laboratory analogs (Fegley et al. 1992; Bullock and Grinspoon 1996; Hashimoto and Abe 2005; Treiman and Bullock 2012; Gilmore et al. 2017; Hazen 2018; Maynard-Casely et al. 2018; Zolotov 2018; Cable et al. 2021).

Small and sparse data sets are a common occurrence in Earth and planetary science. Despite the limitations of the available information, the answers to key scientific questions are tied to these data sets. Therefore, an effort to create a framework to handle small and/or sparse data will be highly beneficial to scientific research in Earth and planetary science. Many researchers are working on “high-dimensional, small sample size” (HDSSS) or “high-dimensional, low sample size” (HDLSS) and its use in data analytics (Hall et al. 2005; Golugula et al. 2011; Yata and Aoshima 2012; Liu et al. 2017; Shen et al. 2016). However, this area of research has received much less attention compared to its big data counterpart, and hence has lacked the synthesis and generalization that comes with the popularity and maturity of well established fields. The aforementioned examples clearly demonstrate how such a framework would open paths for exploring very important scientific questions within and beyond mineralogy.

Data discovery

An increasing trend of data science in recent years is doing research with open data shared by others (Fox and Hendler 2014). Several recent scientific advances in mineral informatics also reflect that trend (Hazen et al. 2019). From the point of view of data users, an ideal situation is that they can efficiently find data portals on the Internet, data sets on the portals, or subsets of the data. In comparison, from the point of view of data providers and data managers, they need to organize the data with shared community standards, detailed metadata, and persistent and stable facilities to increase the reusability. As illustrated in the FAIR data principles for open data (Wilkinson et al. 2016), the first two key points to consider are the findability and accessibility of data. Correspondingly, three key technical items arise here. The first item is the metadata schema for describing the data sets. While there are many common-purpose metadata schemas, such as the Dublin Core, for describing data sets, for domain-specific data such as those in mineralogy there can also be specific metadata elements. The second item is the identifier for the data sets. Similar to the digital object identifier (DOI) for publications, data sets shared on the Internet should also have specific identifiers to enable persistent and stable discoverability. The third item with respect to findability and accessibility is the protocol for retrieving metadata through the identifier of data sets. Community efforts such as DataCite (Brase 2009) have made solid progress toward that goal. Nevertheless, the wide implementation of those best practices for open data in geosciences, including mineralogy, still need more time. It is also important to remember that appropriate scientific credit must be given at every stage of informatics methodology, from the acquisition of data to data analytics, and finally the dissemination of the research products produced by the data analysis.

A very recent technical development regarding data discovery is the Data set Search Engine released by Google (Brickley et al. 2019), which is able to index millions of data sets on thousands of data portals, including their identifiers or web links. End users of the data set search engine (https://datasetsearch.research.google.com, accessed 21 January 2023) have integrated access to thousands of data portals. When a data set is found on the engine, users can go to its original data portal page through the identifier or web link and then download. The Google Data set Search Engine is built on top of Schema.org, which is designed as a comprehensive metadata schema for annotating digital objects on the web. The annotated objects, such as data sets, will then be indexed by the search engines. As its usage is expanding, Schema.org also provides space for extending the metadata elements of certain objects. A potential here is to have specific metadata elements designed for data sets of mineralogy, and this should be based on community collaboration. In the past few years, the EarthCube community has leveraged a list of open geoscience data portals to develop the GeoCODES search engine (https://geocodes.earthcube.org, accessed 21 January 2023). It is also based on Schema.org but has made extensions specifically for the registration and discovery of geoscience data. Any future efforts on the findability and accessibility of open mineralogy data can significantly benefit from the technical structure and experience of GeoCODES. Community agreements and standards, such as those developed by IMA, MSA, and IUGS-CGI, as well as best practices in existing data portals, such as those in RRUFF and mindat.org, will also be helpful to enrich the metadata of open mineralogy data.

Data processing

Dozens of data repositories contain a wealth of mineralogical information from which large data resources can be extracted. Web-scraping algorithms allow for the retrieval and storage of large amounts of data from web sources (Glez-Peña et al. 2014; Zhao 2017). Scraping algorithms in scripting languages such as Python or R allow users to extract and compile large amounts of data from web sources or journal articles in minutes or seconds, but the structure (or lack thereof) of webpages can slow the production of new data resources. Open-access mineral databases tend to be very contributor friendly; thus, users can pick and choose which data to include for a particular entry. Recognizing the inconsistencies in the storage and representation of mineral attribute data within and across different mineralogical databases is essential when compiling large mineral data sets from open data sources.

Webpages associated with Webmineral and Mindat have hierarchical structures made up of Hypertext Markup Language (HTML), Extensible Markup Language (XML), or Cascading Style Sheet (CSS) elements that allow for the selection of nodes that can contain specific data needed by the user (Gunawan et al. 2019). The ubiquitous occurrence or rarity of a mineral, relative interest among the scientific community in a mineral, as well as the age of discovery cause significant differences in the amount of information available for a mineral, driving the differences in the structure of these web pages. Webpages and digital PDFs associated with the Handbook of Mineralogy (Anthony et al. 1990–2003) have very little structure, which places more importance on the use of keywords (e.g., space group or crystal system) or separators (e.g., each mineral attribute or property introduced may have a semicolon preceding the associated description) in the compilation of data. Nested conditional statements (i.e., if-else statements) are useful for compiling data from web databases that have variable or no structure, but this approach can be more time-consuming and prone to error. Some headers may be reused, such as “beta (β),” which is used as a descriptor of the refractive indices in biaxial minerals (Frazier et al. 1963; Gunter and Ribbe 1993), and it can also refer to the geometry of the unit-cell of dimensions (Grove and Hazen 1974; Nesse 2013).

Quantifying and correcting bias

Critical to all of these aspects of data resource development and use is an understanding of and, where possible, modeling of the biases that exist in each of these systems. For example, significant biases occur in mineral sampling based on the physical appearance of the phase (e.g., large, brightly colored, euhedral crystals), the economic value, the scientific interest, proximity to major universities or research centers, and analytical technology. Such biases can be corrected with models of each of these parameters (Hazen et al. 2015; Grew et al. 2017; Hystad et al. 2019). Natural preservational biases are more complex, as it involves geologic history and mineral properties (e.g., chemistry, solubility, hardness), but work is underway to begin unraveling the history of preservational biases in mineral systems on Earth and other planetary bodies (Liu et al. 2019).

Research in the field of informatics is heavily dependent on interactions between data scientists and domain scientists (e.g., mineralogists and planetary scientists) (Ma et al. 2017). Conducting and applying informatics research is very much a socio-technical system (Fischer and Herrmann 2011). It is as much about the researchers, their interactions, the hypotheses generated, and the interpreting of results from visualizations or models as it is about the data, the algorithms, and the models. Collaborations in informatics include many iterations between data and domain scientists, starting from data explorations and problem formulation to interpreting the results and documenting the scientific insights learned from the data.

We recommend starting an informatics exploration with an in-person or virtual “datathon” (Anslow et al. 2016; Fritz et al. 2020). During this datathon, which usually lasts a day or two, collaborators mainly focus on nine aspects, as follows. (1) Interactions and discussions among data scientists and domain scientists to frame their goals and expectations. (2) Document the research questions to be explored. (3) Collate the data resources required to explore the documented research questions. (4) Explore the methods needed to examine the data (both analytically and visually). (5) Construct a roadmap for dividing the research question and tasks into smaller, more tractable parts. (6) Leverage descriptive, prescriptive, and predictive methods to gain preliminary insights from the data. (7) Form short-term and long-term goals based on the preliminary results. (8) Document the shortcomings of the methods explored and why these roadblocks hamper scientific exploration. And (9) document the innovation needs of both data science and domain science methods to overcome the previously documented hurdles.

Not all of these steps need to be done during the two-day datathon; steps 1 and 2 can be completed beforehand. The main goal of conducting a datathon is to expedite and streamline the initial data exploration to gain preliminary results that can be examined by the domain scientist, while also allowing the data scientist to explore and understand the intricacies of the data at hand. Additionally, all collaborators gain an understanding of the shortcomings, needs, and opportunities of their data and of the current methods to address the desired scientific questions. This inventory of needs and opportunities in both the data science and domain science can result in a datathon output of a list of projects and publications spurred by the creative and iterative processes of this closely collaborative effort.

After the initial datathon, each collaborator (or group of collaborators) has a plan of action for the projects and subtasks within the project they are leading. Subsequent communication and collaboration usually follow the preferred working model of the team. For example, weekly meetings between the group to discuss advances in the project or email communications between the team for the same purpose. The steps taken after the datathon and methods to communicate and collaborate change depending on the work style and comfort levels of the collaborators. General recommendations for this step include “science of team science” best practices advocated by many communities (National Research Council 2015).

Durable and information-rich, minerals are the only ancient relics that offer direct, solid glimpses of eons of planetary transformation (Hazen et al. 2022). It is important to extract the abundance of information contained in these mineral samples to improve our understanding of the evolution of our planet, our solar system, and the role our planet’s evolving geosphere played in the origin and proliferation of life. Key synergistic aspects of the ongoing paradigm shift in mineralogy includes systematic efforts to collect and curate mineralogical information in data resources that enable open and widespread dissemination, and the use of those data to make scientific discoveries.

As mentioned earlier in this paper, informatics methods have been followed, implemented, and improved upon in other fields over the past decades. The concept of “X-informatics” has also been around since its first conceptualization in 2007 (Gray and Szalay 2007; Hey 2009), and over the past decade there has been a steady decline in researchers conducting informatics research in the silos of their respective fields. When planning for a new paradigm like mineral informatics, it is important to learn from successes and failures of more mature fields of informatics (Lord et al. 2004; Goble and Stevens 2008; Heberling et al. 2021) and modify the methods developed by past researchers to apply them to comprehensively address our needs as a community.

Over the last decade, there have been some efforts at collating various data resources in the geosciences and providing these data to researchers with minimal barriers and maximum interoperability. These efforts include OneGeology (Jackson 2010), OneGeochemistry (Chamberlain et al. 2021; Wyborn et al. 2021), and OneStratigraphy (Wang et al. 2021). The OneGeochemistry initiative also includes plans to develop best practices for FAIR geochemical data, governance models to ensure participation and trust, and a business model to ensure long-term sustainability (https://www.earthchem.org/communities/onegeochemistry/; accessed 21 January 2023). Efforts to improve the access, usage, and impact of mineral data resources can learn from the successes and challenges faced by such global initiatives. Developing a set of best practices and recommendations for creating, linking, and releasing mineral data would improve the mineral data landscape and make it easier for researchers to produce and use mineral data without too many barriers.

Just as increasing the findability, accessibility, interoper-ability, reusability, and other important aspects of mineral data management and stewardship, obtaining scientific insights from mineral data using data-driven methods are another key facet of mineral informatics. For this, too, we can look to and learn from the success and failures of other domains by applying informatics methods to answer their research questions. We hope the research directions for informatics and other fields like mineralogy, planetary science, and other related fields using mineral data that have been documented in this paper act as an initial step toward the ultimate goal of systematizing data-driven scientific exploration using mineral data.

Mineralogy is facing new opportunities and challenges with the increased interest in and applications of data-driven methods. We believe the next paradigm for the field of mineral-ogy is that of mineral informatics. Mineral informatics focuses on deciphering the patterns and trends hidden in mineralogical, geochemical, and related data and using these patterns to answer scientific questions, thus making important new discoveries. In this paper, we have shown how the study of minerals is essential to improving our understanding of the evolution of our planet, our solar system, and more. We present a broad methodology for the study and use of mineral informatics methods and document the needs of the field and important scientific questions that may be answered using mineral informatics. We reiterate the symbiotic relationship between data scientists and domain scientists (e.g., mineralogists, planetary scientists, biologists) to make continuous and sustainable scientific progress.

In summary, our vision for the next decade of mineralogical research is built upon the systematic and coordinated study of mineral data and of the data science methods used to gain scientific insights.

We thank Editor Don Baker, Associate Editor Jennifer Kung, and the two anonymous reviewers for their thorough, thoughtful, and constructive reviews. This publication is a contribution to the 4D Initiative and the Deep-time Digital Earth (DDE) program. Studies of mineral evolution and mineral ecology have been supported by the Alfred P. Sloan Foundation, the W.M. Keck Foundation, the John Templeton Foundation, NASA Astrobiology Institute (Cycle 8) ENIGMA: Evolution of Nanomachines In Geospheres and Microbial Ancestors (80NSSC18M0093), a private foundation, and the Carnegie Institution for Science. Any opinions, findings, or recommendations expressed herein are those of the authors and do not necessarily reflect the views of the National Aeronautics and Space Administration.

Open access: Article available to all readers online.