Faced with ongoing depletion of near-surface ore deposits, geologists are increasingly required to explore for deep deposits or those lying beneath surface cover. The result is increased drilling costs and a need to maximize the value of the drill hole samples collected. Laser ablation-inductively coupled plasma-mass spectrometry (LA-ICP-MS) analysis of pyrite is one tool that is showing promise in deep exploration. Since the trace element content of pyrite approximates the composition of the fluid from which it precipitated and the crystallization mechanism, the trace element characteristics can be used to predict the type of deposit with which a pyritic sample is associated. This possibility, however, is complicated by overlapping trace element abundances for many deposit types. The solution lies with simultaneous comparison of multiple trace elements through rigorous statistical analysis. Specifically, we used LA-ICP-MS pyrite trace element data and Random Forests, an ensemble machine learning supervised classifier, to distinguish barren sedimentary pyrite and five ore deposit categories: iron oxide copper-gold (IOCG), orogenic Au, porphyry Cu, sedimentary exhalative (SEDEX), and volcanic-hosted massive sulfide (VHMS) deposits. The preferred classifier utilizes in situ Co, Ni, Cu, Zn, As, Mo, Ag, Sb, Te, Tl, and Pb measurements to train the Random Forests. Testing of the Random Forests classifier using additional data from the same deposits and sedimentary basins (test data set) yielded an overall accuracy of 91.4% (94.9% for IOCG, 78.8% for orogenic Au, 81.1% for porphyry Cu, 93.6% for SEDEX, 97.2% for sedimentary pyrite, 91.8% for VHMS). Similarly, testing of the Random Forests classifier using data from deposits and sedimentary basins that did not have analyses in the training data set yielded an overall accuracy of 88.0% (81.4% for orogenic Au, 95.5% for SEDEX, 90.0% for sedimentary pyrite, 73.9% for VHMS; insufficient data was available to perform a blind test on porphyry Cu and IOCG). The performance of the classifier was further improved by instituting criteria (at least 40% of total votes from the Random Forests needed for a conclusive identification) to remove uncertain or inconclusive classifications, increasing the classifier’s accuracy to 94.5% for the test data (94.6% for IOCG, 85.8% for orogenic Au, 87.8% for porphyry Cu, 95.4% for SEDEX, 98.5% for sedimentary pyrite, 94.6% for VHMS) and 93.9% for the blind test data (85.5% for orogenic Au, 96.9% for SEDEX, 96.7% for sedimentary pyrite, 84.6% for VHMS).

The Random Forests classification models for pyrite trace element data can be used as a predictive modeling tool in greenfield terrains by providing an accurate indication of ore deposit type. This advance will assist mineral explorers by allowing early implementation of predictive ore deposit models when prospecting for ore deposits. Furthermore, the ability of the classifier to accurately identify pyrite of sedimentary origin will allow researchers interested in paleoenvironmental conditions of ancient oceans to effectively screen prospective samples that are affected by a hydrothermal overprint.

The correct classification of ore deposits in the early stage of an exploration project can greatly enhance the efficiency of exploration, as it allows for the early application of predictive geologic models. This improvement is especially important when exploring beneath cover due to the increased costs of drilling deep drill holes and when the surface geology or geochemistry fails to reveal details about the deposit at depth. For example, minor disseminated pyrite in a sericite alteration zone intersected in a drill hole under cover could be related to a porphyry Cu outer halo, a volcanic-hosted massive sulfide (VHMS) system footwall alteration zone, a high-sulfidation epithermal Au zone, or barren pyrite unrelated to an ore system. Each of these mineralization types demands a different approach to exploration. Knowing which type is present can save exploration time and money.

Laser ablation-inductively coupled plasma-mass spectrometry (LA-ICP-MS) allows for the determination of the trace element content of individual minerals. These data are useful because different ore deposit types have different fluid sources, metal sources, and depositional mechanisms, all of which can significantly affect the trace element content of the minerals that precipitate from them (Gregory et al., 2014; Tardani et al., 2017). Furthermore, these trace elements can be preserved in their mineral hosts during successive hydrothermal and metamorphic events. In this study we focus on pyrite because it is present in many different types of ore deposits, its trace element content can be preserved up to midgreenschist facies (Large et al., 2009), and there are large data sets available that provide (background) trace element contents of pyrite formed in sedimentary environments without hydrothermal inputs (Large et al., 2014, 2015a; Gregory et al., 2015a). To achieve our objective, LA-ICP-MS analyses of pyrite from a series of different deposit types (iron oxide copper-gold [IOCG], orogenic Au, porphyry Cu, sedimentary exhalative [SEDEX], VHMS deposits, and barren sedimentary pyrite) were used to train a Random Forests classifier to predict deposit type using pyrite LA-ICP-MS analyses. The utility extends to the paleoceanography community, because the presence or absence of hydrothermal overprints/contributions are often unclear (Gregory et al., 2017), thus eroding confidence in reconstructions of ancient conditions in the oceans.

Random Forests, a supervised classification algorithm, has proven to be an ideal choice for accurately predicting categories from multivariate input features across a wide range of data sets (Fernández-Delgado et al., 2014), but it has only rarely been applied to economic geology problems. While notable exceptions exist, such as identifying zones of hydrothermal alteration and host-rock types (Cracknell et al., 2014) and modeling of mineral prospectivity (e.g., Rodriguez-Galiano et al., 2014; Carranza and Laborte, 2015), many other opportunities remain untested. Additionally, only one previous study (O’Brien et al., 2015) used Random Forests analysis of the trace element contents of individual mineral phases (i.e., gahnite), despite the large amount of multielement geochemistry data generated in recent years by LA-ICP-MS. In this contribution we provide a proof of concept—that is, we show how the Random Forests method can be used to classify ore deposit type both as an exploration tool and as a means of identifying samples most representative of primary marine conditions uncompromised by secondary overprints.

Supervised classification

The concept of supervised classification can be thought of as linking input features to target classes via a discrimination function y = f(x). Input features x are represented as m vectors of the form {x1,…,xm}, and y is a finite set of c class labels {y1,…,yc}. Given N instances of x and y, supervised classification attempts to train a classification model f based on a limited number of training samples (Gahegan, 2000; Hastie et al., 2009; Kovacevic et al., 2009).

In general, there are three stages to supervised classification: (1) data preprocessing, (2) classifier training, and (3) prediction evaluation (Cracknell and Reading, 2014). Data preprocessing involves compiling, correcting, and transforming inputs to a representative set of features containing information relevant to the classification problem (Guyon, 2008; Hastie et al., 2009). Classifier training usually requires the adjustment and selection of one or more parameters, specific to a given supervised classifier, that optimize performance on a given set of input features and target classes (Guyon, 2009). The selection of relevant features necessarily reduces the dimensionality of the input data, thus speeding up processing time while also facilitating interpretations of the relationships between categories and features (Cracknell et al., 2014). Prediction evaluation is vital for assessing the validity of classification outcomes and is typically carried out using a test data set not previously seen by the classifier. An assessment of test data and blind test classifications through a confusion matrix and standard classification metrics—such as overall accuracy, recall, and precision—provides an unbiased indication of the performance of trained classifiers (Congalton and Green, 1998).

Random Forests

Random Forests (Breiman, 2001) is an ensemble supervised classifier that generates predictions based on a majority vote cast by multiple randomized decision trees, known as a forest. Randomness is introduced by randomly subsetting a number of input features to split at each node of a decision tree and by bagging (bootstrap aggregation). Bagging (Breiman, 1996) generates training data for a single decision tree by sampling, with replacement, a number of samples equal to the number of instances in the training data. The Gini index is used by Random Forests to determine a best split threshold at each node of a decision tree. The Gini index is defined as


where gc is the probability or the relative frequency of class c at node j and is given by


where nc is the number of samples belonging to the class c, and n is the total number of samples within a particular node. For each candidate split, the threshold that defines maximum reduction in class heterogeneity of the resulting child nodes is selected (Breiman, 1984; Waske et al., 2009).

In addition to a label indicating a predicted class for a given sample, Random Forests produces class membership probabilities. These occur in the form of a vector p comprising probabilities for individual predictions representing the proportion of decision trees that predict candidate classes.

Data preprocessing was primarily executed in standard spreadsheet software (Microsoft Excel), with Random Forests classifier training and prediction evaluation conducted in the open source data mining software platform Orange version 3.18 (Demsar et al., 2013).

LA-ICP-MS data sources and preprocessing

This project arose from two major programs of pyrite analysis funded by the Geological Survey of Western Australia (Belousov et al., 2016) and the Geological Survey of South Australia (D. Gregory, unpub. report, 2015), where pyrite from a large number of ore deposits in both states was analyzed. Additional data from various ore deposits have been analyzed subsequently, leading to the current database of 3,579 pyrite analyses (Figs. 1, 2). LA-ICP-MS data has been provided from a number of different sources, including published peer reviewed manuscripts (Maslennikov et al., 2009, 2017; Large et al., 2014, 2015b; Revan et al., 2014; Gregory et al., 2015a, b, 2016, 2017; Gadd et al., 2016), project reports (G. Davidson, unpub. report, 2005; D. Gregory, unpub. report, 2015), Ph.D. theses (Maier, 2011), and new, previously unreported data from the Chalkidiki porphyry Cu district, Greece, and the Lady Loretta SEDEX deposit, Australia.

All pyrite analyses except those taken from Gadd et al. (2016) were conducted at the LA-ICP-MS facility located at the University of Tasmania, Australia; however, spot size and the number of standards varied. Detailed analytical procedures are available in the references in Table 1. All samples (except for the Gadd et al., 2016, data, which lacked Te and Au) were analyzed for Co, Ni, Cu, Zn, As, Mo, Ag, Sb, Te, Au, Tl, and Pb, and these are the elements emphasized here. When analyses were below detection limits, either half the detection limit was used or the value was inserted from the referred literature source. Because Gadd et al. (2016) did not report Te or Au, we used average values for these elements from the Lady Loretta SEDEX deposit. These data were assumed to be reasonable estimates, as these elements are commonly below detection in SEDEX deposits. Analyses were conducted on 2.5-cm-diameter polished laser mounts.

Beam size varied from 10 to 100 μm, depending on the size of pyrite analyzed and the goals of the relevant study. For each analysis, background was measured for 30 s prior to a 40- to 60-s laser ablation period. The analyses were conducting in a pure He atmosphere, and Ar was added to the gas stream prior to injection into the ICP-MS to improve aerosol transport. No correction was applied for doubly charged species, because these species were kept at low levels (below 0.2%). Standards were analyzed at the start and end of each sample change and approximately every 25 analyses in between. The standard STDGL2b2 (Danyushevsky et al., 2011) was used to analyze the elements of interest (except those taken from Gadd et al., 2016).

The locations, pertinent references, and number of analyses used for Random Forests training, testing, and blind testing are given in Table 1. To limit the influence of trace elements from microinclusions of other minerals that might be included during the ablation of pyrite, the data was screened to ensure that no analyses had higher than 1% Zn, 2% As, 1% Cu, 1% Ni, and 2% Co. Also, for analyses on which matrix corrections were preformed, samples with higher than 20% matrix were removed. This combination of newly acquired and compiled data yielded a total of 3,579 analyses from 70 different deposits and sedimentary units. Of these, 2,898 analyses from 43 individual deposits/sedimentary formations were used to train and initially test the Random Forests classifier to identify five distinct ore deposit types: IOCG, orogenic Au, porphyry Cu, SEDEX, and VHMS. In addition to these mineral deposit types, barren sedimentary pyrite was included as a class in the training data in an attempt to avoid misclassification of nonmineralized pyrite as from an ore deposit.

The remaining 681 analyses from 27 different deposits/sedimentary formations were used as blind tests of the trained classifier. These data are referred to as blind because analyses from these deposits/sedimentary formations were not present in the training or test data sets.

Data distributions

The geometric mean, multiplicative standard deviation, median, and median absolute deviation (MAD) values of element concentrations for the different ore deposit types from the training and total data sets are provided in Tables 2 and 3. The geometric mean and the median are both presented, because they provide robust summaries of the data, depending on their distributions. Where data are log-normally distributed, the geometric mean and multiplicative standard deviation provide a more useful summary. However, when data are not log-normally distributed, the median and MAD are more appropriate (Reimann and Filzmoser, 2000).

With the exception of the VHMS and IOCG deposits, the training data set used equal numbers of analyses from each deposit. Therefore, the training data set is less biased by the number of analyses preformed on the different deposits (i.e., the classifier will skew toward picking the deposit that has more data points in the training set). VHMS and IOCG deposits did not have sufficient analyses from a variety of deposits to have equal numbers of analyses from each deposit in the training data set. Additionally, of the reported statistics, we assert that the medians of trace element content for the different ore deposit types from the training data set should be used rather than total data set statistics for comparisons in future studies. This is because the training set geometric mean and median attempt to represent equal contributions from the different deposits instead of being overly representative of one deposit from which we have more data.

Random Forests training and evaluation

To train and test the Random Forests classifier, we used a total of 3,579 analyses of pyrite that passed the screening process: 159 IOCG, 436 orogenic Au, 416 porphyry Cu, 863 SEDEX, 1,223 sedimentary pyrite, and 482 VHMS. The pyrite trace element data were then split into three groups for classifier training, testing, and blind testing. The 681 analyses used for the blind test were removed (Table 1): orogenic Au (118 from four deposits), SEDEX (66 from three deposits), sedimentary pyrite (451 from 17 formations/basins), and VHMS (46 from three deposits). From the remaining data, a total of 120 analyses from each ore deposit type were used to train Random Forests. To avoid bias toward classes with more analyses, an equal number of analyses from each deposit were randomly selected, except for VHMS and IOCG deposits, because some deposits lacked a sufficient number of analyses to have equal numbers of analyses (Table 1). The remaining data (2,178 analyses) were used as the initial test of the classifier. A total of 500 trees were used, and splitting was halted if there were five or fewer instances in the resulting child node.

The mean decrease in Gini index, a measure of the contribution of a given variable to correctly classify training data, was used to determine the relevance of different elements during Random Forests classifier training (Fig. 3). This measure of variable importance compares the average total decrease in node impurity (based on the Gini index) when splitting on a given variable, weighted by the proportion of samples in that node. Nickel, As, and Co generated the lowest mean decrease in Gini index values (0.069, 0.069, and 0.062, respectively). To assess if any or all of these elements could be excluded from classifier training, different combinations of these elements were removed from the training data. Random Forests classifiers were also trained with different combinations of these elements removed (Co, Ni, As, Co-Ni, Co-As, Ni-As, and Co-Ni-As). The classifier was also tested with Te and Au removed, because these elements have significant numbers of analyses below detection limits, which could bias classifier training, due to detection limit correlations with the analyses of individual deposits. In the end the Co, Ni, Cu, Zn, As, Mo, Ag, Sb, Te, Tl, and Pb were chosen as the preferred input variables.

Random Forests generates class predictions based on a majority of votes cast by all decision trees. Associated class membership probabilities provide an opportunity to evaluate the confidence of individual classifications (Cracknell and Reading, 2013). To assess the effectiveness of the trained classifier with respect to ambiguous classifications, a range of class membership probability thresholds for the winning class were tested: >33, >40, and >50% of votes. Higher probability thresholds remove increasingly uncertain predictions (this is a requirement of votes needed for a single analysis to be classified conclusively). Additionally, rather than requiring a 50% or greater proportion of analyses from a given deposit to consider a deposit correctly identified, we require that ≥65% of analyses must be classified as the deposit for a conclusive identification (this is a requirement for number of analyses from a deposit to be correctly identified). When the number of analyses is ≤35% of the target deposit type definition, it is termed incorrect, and between 65% and 35% is inconclusive.

Ore deposit type classification

Mineralization type classification outcomes for the test data are summarized in Table 4. The Random Forests classifier was run 10 times with different random selections of training data to assess the effectiveness of the classifier with different random seeds of data (App. 1). Random Forests correctly identified the ore deposit type from pyrite trace element analyses with an overall accuracy of 91.0 ± 0.8%. Recall statistics for individual ore deposit types range from 76.9 ± 5.4 to 95.0 ± 0.8%. IOCG, orogenic Au, and porphyry Cu test data samples were predicted with recalls of 86.9 ± 5.0, 76.9 ± 5.4, and 84.2 ± 2.4%, respectively. SEDEX, sedimentary pyrite, and VHMS test data were predicted with noticeably better recalls of 95.0 ± 0.8, 92.8 ± 2.1, and 94.4 ± 1.8% respectively. These results show that the different random selections for the training data produce similar results. As such the same training data set (the one that produced the results in Table 4) was used for the following experiments.

Table 5 indicates that by removing a small percentage (7.8%) of Random Forests predictions with class membership probabilities of less than 40%, ambiguous classifications can be eliminated. This correction results in an increase in overall accuracy of 3.1% to a total of 94.5%. Similarly, the range of class recalls for individual ore deposit types increased by between –0.3 and 7.0%, with, IOCG, orogenic Au, porphyry Cu, SEDEX, sedimentary pyrite, and VHMS deposits having adjusted individual recalls of 94.6, 85.8, 87.8, 95.4, 98.5, and 94.6%, respectively.

Blind test results indicate the Random Forests classifier generated predictions with an overall accuracy of 88% with class-dependent recalls between 73.9 and 95.5% (Table 6). The orogenic Au, SEDEX, sedimentary pyrite, and VHMS samples were classified with proportions of correct classification of 81.4, 95.5, 90.0, and 73.9%, respectively. More accurate results were again obtained when excluding predictions with maximum class membership probabilities of less than 40% (Table 7). Increases in recall ranged from 1.4 to 10.7%, resulting in class recalls for orogenic Au of 85.5%, SEDEX of 96.9%, sedimentary pyrite of 96.7%, and VHMS of 84.6%.

Different class membership thresholds were trialed (33, 40, and 50%). The 40% threshold was chosen because it led to an increase in recall rates of 3.1% (importantly, this includes an increase in orogenic Au recall of 7.0% and porphyry Cu recall of 6.7%) while preserving approximately 92.2% of the number of original analyses in the test data. The 33% threshold only increased the recall rates by 1.1%, and the 50% threshold only had an increase in recall rates of 5.2% and required removal of 18.3% of the data. The results of these experiments are in Appendix 2.

A series of Random Forest classifications were rerun with different combinations of Te, Co, As, and Ni removed from the data. This exercise was included because Te has several analyses below detection limits and the data set from Gadd et al. (2016) did not include Te. The value of Co, As, and Ni was tested because these had the lowest mean decreases in the Gini index (Fig. 3). While the removal of one of these elements did not cause large changes in the ability of the Random Forests classifier to predict deposit type in general, it did significantly affect the ability to classify individual deposit types; thus, all these elements were included in the preferred classifier.

It has been proposed that pyrite trace element content is reset when trace elements are forced out of the pyrite lattice at metamorphic grades higher than midgreenschist facies (Large et al., 2009; Thomas et al., 2011). To test this assertion, we put LA-ICP-MS pyrite trace element data (n = 93) from orogenic gold deposits that have been metamorphosed to greater than midgreenschist facies through the classifier (Belousov et al., 2016). This returned only 67.7% correct identifications, 11.1% less than the lower metamorphic-grade orogenic gold deposits. Similarly, when inconclusive (less than 40% of votes) analyses are removed, this only increases to 70.9% correct identifications, 14.9% less than the lower metamorphic-grade orogenic gold deposits (Table 8).

One of the drawbacks to using Random Forests is that it will always give an answer, even if the actual class of an unknown pyrite sample is not within the training data set. To test how the classifier will react to pyrite that does not fit the types we have included in the training data, we attempted to classify the data presented by Gregory et al. (2016) from the St. Ives Au district, which includes four different types of pyrite not included in the training data (note that the orogenic Au pyrite from this study has been included in the training and test data sets of the classifier). Gregory et al. (2016) presented LA-ICP-MS analyses of sedimentary pyrite (py1 and py2; n = 143), nonmineralization-related hydrothermal pyrite (py3, py4, and py5; n = 37, 8, and 17, respectively), orogenic Au pyrite (py6; n = 117), and greenstone-related pyrite (py7; n = 20). Of these, sedimentary pyrite and orogenic Au pyrite had 97.5 and 84.9% of the analyses correctly identified. Similarly, these classifications only had 16 and 9% of the analyses removed as inconclusive (received less than 40% of the votes). Py5 had 76% of its analyses removed as inconclusive, and Py3 and Py7 both had only 62.5% of their analyses chosen as the one that had the highest percentage classification. Py4 only had 38% of the analyses removed as inconclusive, and 80% of the analyses were identified as orogenic Au. These results are summarized in Table 9.

Conventional X-Y element scatter plots

Conventional element scatter plots of pyrite chemistry have been used with some degree of success to differentiate pyrite from different ore types. However, X-Y scatter plots are less useful when discriminating pyrite from more than two other deposit types. Examples are given in Figure 4 for the pyrite training data set from this study. In general terms, pyrite in the ore zones from medium- to low-temperature hydrothermal deposit types (VHMS and SEDEX) tend to contain higher concentrations of most trace elements compared to pyrite from higher-temperature hydrothermal deposit types (porphyry Cu, IOCG, and orogenic Au). This relationship is illustrated in Figure 4A through C and F (Zn-Cu, Mo-As, Ag-Pb, and Tl-Sb scatter plots). Sedimentary pyrite also contains high concentrations of most trace elements and plots in the same vicinity as data for SEDEX and VHMS deposits. Porphyry Cu, IOCG, and orogenic Au pyrites by comparison generally contain lower levels of Zn, Cu, Mo, Ag, Pb, Tl, and Sb. Commonly, the data for different deposit types exhibit strong overlaps such that it is virtually impossible to distinguish ore type based on simple trace element scatter plots (e.g., Fig. 4D).

By simultaneously using several different elements, Random Forests allows us to go beyond what is possible with traditional X-Y plots, but visualization of the distinctions can be challenging. By assessing the overall element concentrations of classified ore deposit types, however, some of the Random Forests decision boundaries can be depicted. For this discussion, we use training data median values, as they are less affected by imbalances in the number of samples from each deposit type compared to complete or test data sets, and they provide a reasonable estimate of the central tendency of populations that are not normally distributed. Copper and Zn can be used to separate SEDEX (medians of 495.49 ppm for Cu and 95.95 ppm for Zn) and VHMS (medians of 1,002.64 ppm for Cu and 180.02 ppm for Zn) deposits from the other deposit types, as they are one to two orders of magnitude more enriched in these elements (Fig. 4). Conversely, distinctly low As values (median 2.15 ppm) can be used to separate IOCG and, to a lesser extent, porphyry Cu mineralization (median 53.36 ppm). Enrichments in molybdenum are known to occur in a number of sedimentary settings, particularly when euxinic conditions are present (Lyons et al., 2003; Tribovillard et al., 2006; Scott et al., 2008; Lyons et al., 2009). Therefore, it follows that high Mo can be used to identify SEDEX and sedimentary pyrite (medians of 23.38 and 28.38 ppm Mo, respectively), both of which formed in marine settings. Similarly, VHMS deposits have low but above detection Mo (median of 0.98 ppm), presumably due to the association of VHMS deposits with seawater and deposition at or near the sea floor. SEDEX (medians of 23.88 ppm for Ag and 963.86 ppm for Pb) and VHMS (medians of 22.00 ppm for Ag and 320.41 ppm for Pb) pyrite is enriched in silver and Pb.

Interestingly, Co and Ni in sulfide minerals, which have long been used to determine pyrite source (Loftus-Hills and Solomon, 1967), were among the lowest ranked elements in terms of mean decrease in Gini index (Fig. 3). Nevertheless, porphyry Cu-related pyrite is enriched in Ni compared to the other deposit types (median of 590.40 ppm), and IOCG is very enriched in Co (median of 1,735.28 ppm). Even Au, which was left out of the favored Random Forests classifier due to concerns about the number of analyses that were below detection limits, is potentially significant for identifying orogenic Au (median 0.16 ppm) and VHMS (median 0.41 ppm) deposit types. However, the strength of the Random Forests method lies with its ability to combine all observations rapidly.

Ore deposit type predictions

The results of Random Forests predictions for test (91.4% correct predictions) and blind test (88%) data (Tables 4, 6) prove the efficacy of Random Forests analyses of pyrite databases to predict ore deposit type. The classification can be further refined by removing the analyses that did not meet the threshold of obtaining 40% or more of the votes from the Random Forests. This adjustment increased the accuracy of experiments with Au removed to 94.5% with 7.8% of data removed for the test data and to 93.9% with 11.3% of data removed for the blind test data (Tables 5, 7).

The very high proportion of correct predictions (98.5% for test data and 96.7% for blind test data) for sedimentary pyrite is particularly important. Specifically, those data represent the only nonmineralized pyrite samples investigated in this study, suggesting that Random Forests classification is able to accurately discriminate pyrite formed from mineralized systems from that formed at low temperature in the water column and in shallow marine sediments. There is often disagreement in the paleoceanographic community in discussions about whether hydrothermal overprints or ocean conditions are responsible for metal enrichments in the rock record. The Random Forests classifier developed here may facilitate the identification of hydrothermal overprints on sedimentary pyrite in future studies.

As there is a disparate number of analyses from different deposit types, it is possible that the classifier is only working well for the deposits that have larger amounts of data. To test whether this is the case, we checked the individual results of the classifier (with a >40% vote threshold) for each deposit from the test and blind test data set (Tables 10, 11). Of these, all but one of the deposits were conclusively (greater than 65%) correctly identified. The deposit that was inconclusive, the Youanmi orogenic Au deposit, still had 60% of the votes and only had 10 analyses to classify, so it may be that the pyrite trace element content was not accurately represented by the sample. This demonstrates that the Random Forests classifier can identify analyses from the deposits used in developing the classifier.

Effects of metamorphic grade on classifier predictions

To test and assess how high-grade metamorphic overprint will affect the ability of Random Forests to identify ore deposit type, we used analyses from Belousov et al. (2016) that were from upper greenschist or higher-grade metamorphic facies. These data resulted in a total decrease of over 10% effectiveness of the classifier (Table 8) and importantly resulted in 50% of the deposits being inconclusively or misclassified (Table 8) using the initial results or 37.5% after inconclusive analyses (analyses that received less than 40% of the votes) were removed. This suggests that pyrite trace element content can give spurious results in high metamorphic-grade settings. The exact reason for this variation in trace element content is beyond the scope of this study; however, it is interesting to note that the Ni median is higher in the high metamorphic-grade orogenic gold deposits (258 ppm) and lower in the Sb (0.49 ppm; Table 12) more similar to high-temperature pyrite varieties such as porphyry deposits (Table 2; Franchini et al., 2015). This may reflect pyrite dissolution and reprecipitation or recrystallization of the pyrite at high temperatures imparting a chemistry more indicative of magmatic processes.

Identification of pyrite that has a source not included in the classifier

One of the limitations of using Random Forests to predict unknowns in a geologic setting is that it will always give an answer that corresponds with the input designations of the training data set. Because there is a wide variety of different deposits and pyrite sources not associated with economic mineral deposits, there is a risk that the classifier will assess everything as coming from a mineralized deposit. To check how a classifier will respond to barren, nonsedimentary pyrite, we used pyrite data from sedimentary pyrite, orogenic gold-related pyrite, and four pyrite generations unrelated to the mineralization from the St. Ives Au district (Gregory et al., 2016). The sedimentary and orogenic Au pyrite was conclusively, correctly identified (note that the orogenic Au pyrite was included in the training data set earlier), while three of the nonmineralized pyrites returned inconclusive results (Table 9). The fourth was incorrectly conclusively identified as orogenic Au. This shows that most barren pyrite can be identified correctly by calculating the proportion of analyses that are inconclusive and by establishing criteria for how many inconclusive identifications are present in a given sample or set of samples. At the same time, it serves as a reminder that this classifier still needs a large number of analyses from many of the deposit types listed, deposit types currently not represented in the classifier, and other types of nonmineralized pyrite before it can be confidently utilized in the mineral exploration industry. Furthermore, it also shows that the classifier has the potential to be used as one of several tools when making decisions regarding priority of drill targets but not as a replacement for traditional tools, such as petrography, when determining the paragenesis of an ore deposit.

Caveats and future work

The pyrite data investigated in this study were obtained from analyses collected over 10 years as part of a number of different projects with contrasting objectives. In addition, the LA-ICP-MS technology has continued to develop over this time, and detection limits for all trace elements vary significantly. This has resulted in a range of detection limits throughout the data, including SEDEX deposits with anomalously high limits for Se, Cd, Au, and Te (Maier, 2011). In the case of the data from Gadd et al. (2016), some of these elements were not analyzed (or reported). Cadmium and Se results were omitted from our training data for this reason but should be included in future analyses, as both these elements accumulate in pyrite and could be useful for discriminating ore deposit type.

Similarly, the optimal Random Forests classifier was refined to not include Au. Tellurium, however, was not omitted from this classifier despite the lack of Te data from SEDEX deposits. The classifier has difficulty identifying orogenic Au mineralization because Te is commonly associated with Au mineralization (Belousov et al., 2016). Because the Random Forests classifier requires all trace elements in the table to contain nonmissing values, the averages from the single SEDEX deposit that had good-quality Te and Au data (Lady Loretta) were used for all the SEDEX analyses. This has probably overestimated the ability of the classifier to identify SEDEX analyses, because the same value for Te was used by all the SEDEX samples. However, because SEDEX pyrite also has distinctly higher Cu, Mo, Sb, Tl, and Pb concentrations compared to most other deposits, it is thought that Te is not particularly important for SEDEX classification. Furthermore, concentrations of Te in SEDEX samples only differ significantly from those in orogenic Au, porphyry Cu, and VHMS samples. To further test this reasoning, the favored classifier test data (omitting Au) was rerun to exclude Te. The results are summarized in Table 13. This experiment showed that, indeed, the SEDEX results were enhanced by substituted Te values; however, the SEDEX analyses without Te were still correctly identified most of the time with a recall of 74.2% correct (for test data). The classifier will be strengthened by addition of new SEDEX analyses with viable Te data, but until those data are available, the average Te concentration from Lady Loretta is used for SEDEX analyses with high detection limits or missing data.

Tin and W may be useful discriminators, as has been shown for VHMS (high Sn) and orogenic Au deposits (high W; Belousov et al., 2016). These elements were not included in the classifier because of a general lack of data in some data sources. As W and Sn have been proven effective for discriminating between some deposit types, future pyrite analyses should include W and Sn to further assess their utility.

A further weakness of the current classifier is the variability in the number of deposits for which data are available and the amount of data from those sites. Data are available from two porphyry Cu districts and two IOCG deposits. This gap may mean that pyrite trace element concentrations for those deposit types are not fully representative of the ranges likely to be found in mineralized systems. Therefore, additional data from porphyry Cu and IOCG deposit types need to be collected so the variability observed between different deposits of the same type can be better represented.

While an attempt was made to include as many different deposits as possible, we concede that several important deposit types were missing, such as epithermal Au, Carlin-type Au, and Ni/platinum group element deposits. Future iterations of this classification experiment should include these and other deposit types. Similarly, in its current state, the classifier only includes one type of barren pyrite—sedimentary pyrite. Future work should include barren metamorphic and igneous pyrite.

The Random Forests classifier developed here, based on the concentrations of Co, Ni, Cu, Zn, As, Mo, Ag, Sb, Te, Tl, and Pb in pyrite, was found to correctly classify both test data and blind test data. These results yielded an overall accuracy for the test and blind test data of 94.5 and 93.9%, respectively, when inconclusive analyses (less than 40% of votes) are not considered. We can conclude that Random Forests classifiers developed from microanalyses of individual minerals are potentially useful for identifying ore deposit type and should be considered a viable geochemical exploration tool, although it should be stressed that this approach should be regarded as a preliminary positive result; before it can be widely applied in mineral exploration additional ore-related and non-ore-related pyrite varieties need to be added to the classifier. Furthermore, we stress that this should be regarded as one of many tools rather than a single stand-alone classification method. Parties who are interested in using the classifier on their own data sets are encouraged to contact the lead author, who can arrange the processing of LA-ICP-MS pyrite data.

By testing how well the classifier can identify ore deposit type on pyrite that has passed through the midgreenschist facies metamorphic window, we have found that at least in some areas the trace element composition of pyrite has been significantly altered such that the classifier can no longer identify the original pyrite type conclusively. This supports the assertion that pyrite chemistry can be altered at these metamorphic grades.

These results are also important for fields of geology not interested in ore deposits or exploration for ore deposits. The high degree of effectiveness of the classifier for identifying sedimentary pyrite not associated with hydrothermal fluids has created an additional opportunity for recognizing hydrothermal overprints on sedimentary deposits included in paleoceanographic studies.

We would like to acknowledge the Western Australia and South Australia geological surveys for their support of the initial studies that accumulated much of the initial data that this project arose from. We also thank the University of Western Australia Centre for Exploration Targeting (UWA CET) for providing a sample set from Western Australia orogenic gold deposits. Funding for the compilation of additional data and the refining of the classifier was provided by the National Science Foundation Frontiers in Earth System Dynamics (NSF FESD) program and the National Aeronautics and Space Administration (NASA) Astrobiology Institute under cooperative agreement NNA15BB03A issued through the Science Mission Directorate. This study also benefited from data collected as part of the Australian Mineral Industry Research Association (AMIRA) International project P1060, Enhanced Geochemical Targeting in Magmatic-Hydrothermal Systems. The authors gratefully acknowledge Alan Goode and Adele Seymon (AMIRA International) and all the industry sponsors of P1060 for their generous sponsorship of this research. We also thank Artur Deditius and Denis Fougerouse for valuable suggestions on the manuscript.

Daniel Gregory is an assistant professor in economic geology at the University of Toronto, Canada. He worked as an exploration geologist in the Yukon Territory, Canada, before he moved to Australia to complete his Ph.D. degree in economic geology and geochemistry at the Centre for Ore Deposit and Earth Sciences (CODES), Tasmania. Daniel held postdoc positions at CODES and the National Aeronautics and Space Administration (NASA) Astrobiology Institute at the University of California Riverside (UCR) investigating basin-scale whole-rock geochemistry and mineral chemistry using macro- and nanoanalytical techniques. He focuses on in situ trace element analyses to understand the fluids related to ore deposit formation. Dan is testing machine learning techniques to identify ore deposit style and vector toward economic mineralization.

Gold Open Access: This article is published under the terms of the CC-BY 3.0 license.