A novel method of estimating the silica (SiO2) and loss-on-ignition (LOI) concentrations for the North American Soil Geochemical Landscapes (NASGL) project datasets is proposed. Combining the precision of the geochemical determinations with the completeness of the mineralogical NASGL data, we suggest a ‘reverse normative’ or inversion approach to first calculate the minimum SiO2, water (H2O) and carbon dioxide (CO2) concentrations in weight percent (wt%) in these samples. These can be used in a first step to compute minimum and maximum estimates for SiO2. In a recursive step, a ‘consensus’ SiO2 is then established as the average between the two aforementioned SiO2 estimates, trimmed as necessary to yield a total composition (major oxides converted from reported Al, Ca, Fe, K, Mg, Mn, Na, P, S and Ti elemental concentrations + ‘consensus’ SiO2 + reported trace element concentrations converted to wt% + ‘normative’ H2O + ‘normative’ CO2) of no more than 100 wt%. Any remaining compositional gap between 100 wt% and this sum is considered ‘other’ LOI and likely includes H2O and CO2 from the reported ‘amorphous’ phase (of unknown geochemical or mineralogical composition) as well as other volatile components present in soil. We validate the technique against a separate dataset from Australia where geochemical (including all major oxides) and mineralogical data exist on the same samples. The correlation between predicted and observed SiO2 is linear, strong (R2 = 0.91) and homoscedastic. We also compare the estimated NASGL SiO2 concentrations with a sparser, publicly available continental-scale survey over the conterminous USA, the ‘Shacklette and Boerngen’ dataset. This comparison shows the new data to be a reasonable representation of SiO2 values measured on the ground over the conterminous USA. We recommend the approach of combining geochemical and mineralogical information to estimate missing SiO2 and LOI by the recursive inversion approach in datasets elsewhere, with the caveat to always validate results.

Rock, sediment and soil chemical and mineralogical characterizations are fundamental to the discipline of geochemistry, particularly when it comes to applications in the fields of mineral exploration, environmental management, agronomy, horticulture and forestry, and landuse decision making. Most rocks, sediments and soils on Earth contain minerals that include silicon (Si) and oxygen (O) in their structure (e.g. silicates), and often also hydrogen (H; e.g. phyllosilicates) and, perhaps less commonly, carbon (C; e.g. carbonates) (e.g. Finkl 1981; Deer et al. 2013; Schaetzl and Anderson 2007). Traditionally, total analyses for major elements have been reported as oxides (the term ‘major’ is used here to include components with generally greater than 0.1 percent abundance), typically obtained using X-ray fluorescence (XRF) as the analytical method, which determines total content (regardless of the host, speciation or oxidation state of each major element). The analysis of rock, sediment and soil samples by XRF is often complemented by the gravimetric determination of ‘loss-on-ignition’ (LOI) obtained by heating the sample to a set temperature (e.g. 900°C) and measuring the mass loss compared to the starting sample at standardized temperature, pressure and humidity. LOI has several components, including adsorbed water (H2O; e.g. interlayer water in clay minerals), combined H2O (e.g. hydrated minerals and labile hydroxyl-compounds), carbon dioxide (CO2; e.g. from carbonates and organic matter) and volatile elements (e.g. Hg).

One advantage of reporting the major components of a rock, sediment or soil sample as oxides is that their sum, when complemented by LOI and trace elements (TEs), should add up to 100 weight percent (wt%) of the sample. Having a complete sample analysis, or at least as complete as practically feasible, is important to give confidence that the sample is well characterized, which implies that the composition is closed or full and not a subcomposition. This has implications in subsequent data processing, including in the development and application of Compositional Data Analysis methods (e.g. Chayes 1960; Aitchison 1986; Scealy et al. 2015).

Another benefit of a complete sample analysis is the direct relationship between the geochemical and mineralogical compositions, via the knowledge (or modelling) of the minerals’ stoichiometric compositions. Deriving the most plausible mineralogy from geochemistry is a non-unique inversion problem known as ‘normative analysis’ (e.g. Caritat et al. 1994; Aldis et al. 2023). This is a useful way to ensure that chemistry and mineralogy of a sample are mutually compatible, especially for finer-grained samples where optical or even electronic microscopic techniques reach their resolution limit to helpfully identify minerals.

In recent decades, another family of analytical methods has gained in popularity, mainly due to its high precision and multi-elemental capability, namely inductively coupled plasma atomic emission spectroscopy (ICP-AES) and -mass spectrometry (ICP-MS). The ICP-based methods typically require a digestion of the rock, sediment or soil sample to present it to the instrument as a liquid phase (laser ablation is an alternative input mode not discussed here). This digestion, which can range from near-total to weak, and from selective to nonselective, is crucial to document in detail as it controls how to interpret the geochemical results (e.g. Mann 2010). Analytical data generated by ICP often consist of a list of 30 or more elements, generally reported in parts per million mass/mass (ppm m/m; equivalent to mg kg–1 and μg g–1). These element concentrations for the major elements (Al, Ca, Fe, etc.) can readily be converted to oxide equivalents (yielding a ‘pseudo-XRF’ result).

Generally a full or complete analysis of a rock, sediment or soil, expressed as oxides, consists of Al2O3, CaO, FeO, Fe2O3, K2O, MgO, MnO, Na2O, P2O5, SiO2, SO3 and TiO2 concentrations typically expressed in wt%, either directly obtained by XRF or converted from ICP elemental data. Often, the two Fe analytes, reduced and oxidized Fe, are reported together as Fe2O3tot. Added together and supplemented by LOI and TE concentrations, the full analysis is considered complete and should sum to (or close to) 100 wt%. Any discrepancy represents components not analysed for and/or uncertainty.

The North American Soil Geochemical Landscapes (NASGL) project is a recent continental-scale geochemical survey of the conterminous USA (Smith et al. 2013, 2014, 2019; see the recent project review in Smith 2022). It sampled soil from three levels (0–5 cm depth, A horizon and C horizon) at 4857 sites, the <2 mm fraction of which was analysed for 45 major and trace element concentrations by methods yielding ‘total or near-total’ elemental content. The chemical elements reported were Ag, Al, As, Ba, Be, Bi, C, Ca, Cd, Ce, Co, Cr, Cs, Cu, Fe, Ga, Hg, In, K, La, Li, Mg, Mn, Mo, Na, Nb, Ni, P, Pb, Rb, S, Sb, Sc, Se, Sn, Sr, Te, Th, Ti, Tl, U, V, W, Y and Zn. Elements were mostly analysed and quantified by ICP-AES or ICP-MS, after a four-acid (hydrochloric, nitric, hydrofluoric and perchloric acids) digestion of the milled samples at a temperature of between 125 and 150°C (see Smith et al. 2013 for more detail). Note that Si was not included in the contracted analytical package. As ICP-based analytical techniques cannot quantify O and H present (in fact, abundant) in most if not all rock, sediment, or soil samples, the sum of all its analytes (Al, …, Zn) falls well short of one million ppm (a complete composition); indeed in the C horizon dataset, for instance, the sum of all ICP analytes ranges from 1134 to 390 740 ppm (average 144 869 ppm). Table 1 presents a brief statistical summary of the geochemical composition of NASGL C horizon soil samples for the major elements. The C horizon dataset is used herein to illustrate our method.

The NASGL project also analysed and quantified mineralogy in those samples. The minerals quantified were quartz, K-feldspars, plagioclases, (total feldspars), 14 Å clays, 10 Å clays, kaolinite, (total clays), gibbsite, calcite, dolomite, aragonite, (total carbonates), analcime, heulandite, (total zeolites), gypsum, talc, hornblende, serpentine, hematite, goethite, pyroxene, pyrite, other and amorphous (phases in parenthesis are summations of other minerals). The amorphous phase typically consists of material that is poorly diffracting; this will generally include clay minerals, various forms of micro-quartz, Fe-, Mn- and Al-oxyhydroxides, organic matter, volcanic glass, etc. (e.g. Tan et al. 1970; Smith et al. 2018; Tsukimura et al. 2021). Minerals were analysed by X-ray diffraction (XRD) and quantified using a Rietveld refinement method (Smith et al. 2013). Unlike the geochemical data, the XRD data are ‘complete’ in the sense that they add up to 100 wt% (range 99.6 to 100.2 wt%, average 100.03 wt%, for the C horizon). Table 2 presents a brief statistical summary of the mineralogical composition of NASGL C horizon soil samples.

One of the shortcomings of the NASGL project is that neither Si/SiO2 nor LOI were reported in the released geochemical datasets. The present paper aims to propose and test a method for estimating those missing, yet crucial, parameters.

As the NASGL project did not use XRF analysis, we first need to convert the 10 reported major elements (Al, Ca, Fe, K, Mg, Mn, Na, P, S and Ti) into oxides, which is readily achieved by dividing each elemental concentration by the atomic weight of the element, multiplying this by the molecular weight of the oxide, and adjusting for any unit change (e.g. dividing by 10 000 to convert from ppm to wt%). These oxides are hereafter referred to as other_oxides to indicate that they do not include SiO2. For one of the most common soil components, Si, no reported elemental or oxide concentration exists and it must thus be estimated. The proposed method for estimating SiO2, which draws upon both the geochemical and the mineralogical analyses of the NASGL samples, is described below and the workflow illustrated in Figure 1. A worked example is provided in a Microsoft Excel™ spreadsheet (see the section below on Datasets).

Initially, two estimates for SiO2 are calculated by inverting mineralogical information; neither is ideal, as the first is likely to give a minimum, the second a maximum value for SiO2. Next, a ‘consensus’ SiO2 concentration is obtained recursively from the two aforementioned estimates. Finally, the LOI is calculated to obtain a closed full composition at 100 wt%. The detailed steps are described below.

  • Step 1: Data preparation. The geochemical and mineralogical data for soil samples of the conterminous USA (A and C horizon datasets) were downloaded from https://mrdata.usgs.gov/ds-801/. Samples (rows) which had either incomplete or missing geochemical or mineralogical quantification (e.g. insufficient sample material) were removed. Analytes (columns) with excessive censored values (below detection/reportable limit) were removed (e.g. Ag, Cs, Te; Grunsky et al. 2018). Concentration units were unified (ppm) and censored data were imputed using the zCompositions package (lrEM function) in the R computing environment (Palarea-Albaladejo et al. 2014). Note that the imputation step is not critical to the present estimation workflow and other ways of handling censored data may be applied. After imputation, the major elements were converted to oxides and all analytes were expressed as wt%.

  • Step 2: Inverting ‘normative’ SiO2 due to silicate minerals. The ‘normative’ SiO2 is the amount of SiO2 each sample must contain to be consistent with its mineralogy (technically this is a reverse normative or inversion approach). This ‘normative’ SiO2 calculates and sums the contributions in SiO2 of each Si-bearing mineral (silicate), for example, 1 * quartz + 0.6476 * K-feldspar + … + AVERAGE (0.483, 0.5985, 0.5549, 0.5173) * pyroxene. The multipliers are the mass proportions of the relevant oxide (e.g. SiO2) in each mineral (e.g. K-feldspar above), and were sourced from https://webmineral.com. Where more than one end-member mineral exists for a group (e.g. a solid solution), the average of the (most common) end-members is used (e.g. pyroxene above). Table 3 summarizes the proportional multipliers used in this paper. This first estimate of SiO2 does not consider the phases ‘other’ and ‘amorphous’. Amorphous has a median abundance of 17.5 wt% and a maximum of 95.2 wt% in the NASGL C horizon dataset. It is likely to contain forms of microcrystalline silica, such as opal-A; e.g. Achilles et al. 2018), and therefore the ‘normative’ SiO2 calculated here could, and most likely does, underestimate the real SiO2 concentration.

  • Step 3: Inverting LOI due to hydrate and carbonate minerals. The ‘normative’ H2O and ‘normative’ CO2 components of LOI in each sample were calculated to be consistent with the mineralogy (e.g. amounts of gypsum and calcite). This is done in a similar way as described above, but applied to all O- and H-bearing (hydrate) minerals and all C-bearing (carbonate) minerals, respectively. The relevant proportional multipliers used are also given in Table 3. As for SiO2, the ‘normative’ H2O and CO2 contents of the amorphous phase are not known and likely important (e.g. Achilles et al. 2018). Thus this method could, and most likely does, underestimate the real LOI concentration.

  • Step 4: Calculating a second estimate for SiO2. A second (maximum) SiO2 estimate is calculated by the difference 100 wt% – Sum(other_oxides, TEs, ‘normative’ H2O, ‘normative’ CO2). It could, and most likely does, overestimate the real SiO2 concentration because LOI is almost certainly underestimated (see above). Note that in some instances, the first estimate of SiO2 is larger than the second, which we interpret to result either from uncertainty in the mineralogical quantification (amounts of silicate, hydrate and carbonate minerals are not consistent with the geochemistry), or from overestimation of ‘normative’ H2O (‘normative’ CO2 being well constrained by carbonate minerals).

  • Step 5: Recursively estimating a ‘consensus’ SiO2. A ‘consensus’ SiO2 is then calculated recursively by first taking the average of the above two SiO2 estimates. For some samples, this SiO2 estimate results in the Sum(all_oxides, TEs, ‘normative’ H2O, ‘normative’ CO2), where all_oxides include the ‘consensus’ SiO2 determined at Step 4, to exceed 100 wt%; in these cases, the SiO2 estimate is trimmed so that this sum is 100 wt%.

  • Step 6: Calculating total LOI. Finally, the LOI_rest, that is volatiles other than the ‘normative’ H2O and ‘normative’ CO2 calculated at Step 3 above, are calculated as the difference 100 wt% – Sum(all_oxides, TEs, ‘normative’ H2O, ‘normative’ CO2). This LOI_rest is likely to comprise H2O and CO2 in the amorphous phase as well as any other volatiles not specifically accounted for above. From here, total LOI or LOItot is calculated as Sum(‘normative’ H2O, ‘normative’ CO2, LOI_rest). Note that in a few cases where LOItot is zero it is replaced by 0.0001 wt% to allow log-transformation.

Distributions of the SiO2 and LOI estimates

The resultant final estimates for SiO2 in the C horizon samples from the NASGL project have a distribution as represented in the Tukey boxplots (Tukey 1977) of Figure 2, which seem reasonable compared to the distribution of the other oxides. SiO2 is clearly the most abundant major oxide in the NASGL soil samples, as is both expected and consistent with other regions (e.g. Australia; see Caritat and Cooper 2011a). The distribution of LOI is also illustrated in Figure 2. Table 4 summarizes the statistics of the estimated SiO2 and LOI concentrations derived herein for both the A and C horizons.

Figure 3 shows the cumulative frequency distributions of all major oxides, including SiO2 and LOItot estimated here. Note that in Figure 3, the concentration data have been Box–Cox transformed (Box and Cox 1964) to improve normality and homoscedasticity according to
where the exponent λ is optimized for each variable yi and reported in Table 5.

Minerals not specifically quantified in the NASGL project are assumed to be included in the category ‘other’, so for instance if soils contain halite (NaCl) the mineralogy still adds up to 100 wt% but the geochemistry will not be explicitly accounting for halite. The contained Na will be part of the reported total Na, or converted Na2O wt%, but the Cl, generally not be reported in an ICP-based analysis, is assumed to be part of the ‘missing’ composition (estimated LOI in this case).

Application to selected NASGL samples

Figure 4 shows the major oxide, including the SiO2 estimated as described above, TEs, ‘normative’ H2O, ‘normative’ CO2 and LOI_rest of five selected samples from the NASGL C horizon dataset. Those samples were deliberately chosen to span the range of soil compositions in the dataset: sample from Site ID 7327 (California) is an Al-rich sample, 972 (Texas) is Ca-rich, 444 (Maryland) is Fe-rich, 12779 (Colorado) is K-rich and 3808 (Florida) is Si-rich. Without the estimates for SiO2 and LOI (and its components), only between 0.2 (3808) and 54 wt% (972) of those samples would be geochemically characterized; the rest would be unknown. This unknown ‘gap’ is shown by the present estimation technique to comprise widely varying proportions of SiO2 (from silicates), H2O (mainly from silicates), CO2 (from carbonates) and other volatile phases (from the amorphous phase and possibly other volatile components). It is thus important to provide estimates for each sample that honour the known mineralogical characteristics rather than apply a one-size-fits-all estimation of these parameters. Table 6 shows the mineralogy and geochemistry, including the ‘gap’-filling SiO2 and LOI estimates, for these five samples.

For instance, sample 7327 contains significant clays (40.5 wt% kaolinite) and thus has not only elevated Al2O3, but also SiO2 and LOI (H2O) concentrations. Sample 972 holds significant carbonates (64.1 wt% calcite) as reflected not only by the elevated CaO, but also CO2 concentrations. Sample 444 comprises significant amorphous material (41 wt%) as well as notable clay (24.2 wt% of combined 14 Å clay and kaolinite), pyroxene, talc and hematite contents, imparting a significant Fe2O3tot, MgO, moderate SiO2 and relatively low LOI concentrations. Sample 12779 contains 60 wt% combined K-feldspar and plagioclase and some 10 Å clay, translating into an SiO2-, Al2O3- and K2O-rich geochemical makeup. Finally, sample 3808 contains 98.5 wt% quartz and 1.5 wt% K-feldspar, giving a geochemical composition overwhelmed by SiO2 (estimated at 99.6 wt%); it probably also contains trace amounts of anatase or other Ti-bearing phase(s), undetected by the XRD method applied, to account for (some of) the 0.13 wt% TiO2 reported geochemically.

Spatial distributions of the SiO2 in NASGL A and C horizons

Maps of the distributions of estimated SiO2 concentrations in the NASGL A and C horizons are shown in Figure 5a and b, respectively. The data are classified in 10 quantile (decile) classes and coloured as per the mapping convention of Smith et al. (2014, 2019). The spatial distributions (circles in Fig. 5a, b) show strong similarities with the interpolated quartz distribution maps in the A and C horizons (rasters in Fig. 5a, b from figures 140 and 141 in Smith et al. 2014, respectively), reflecting a dominant mineralogical control on the SiO2 concentrations.


Application of inversion approach to other dataset

The proposed method to estimate the missing SiO2 and LOI data was validated by applying it to an Exploring for the Future (EFTF; http://www.ga.gov.au/eftf) dataset (as yet unpublished) from Australia, which comprises geochemistry by XRF (including SiO2) and ICP-MS, and mineralogy by XRD. The dataset of 260 samples from the National Geochemical Survey of Australia (NGSA; Caritat and Cooper 2011a; Caritat 2022) crosses the continent from the temperate coast of South Australia (SA), through semi-arid parts of the Northern Territory (NT) and Queensland (Qld), to the tropical Gulf of Carpentaria coast, defining the SA–Qld–NT study area (large crosses in Fig. 6). Thus a wide range of geological, geomorphological and climate conditions are intersected by this dataset, making it a suitable comparison to the NASGL dataset.

Figure 7 shows the correlation between measured and estimated SiO2 as per the recursive inversion method described herein for the SA–Qld–NT dataset. The correlation is linear, strong (R2 = 0.91), with a slope close to unity (0.96) and a small intercept (1.7 wt%). The distribution is also fairly homoscedastic. We interpret this to mean that the method to estimate SiO2 in NASGL should be robust and widely applicable.

Comparison with measured SiO2

A second validation approach was tested whereby the spatial distribution obtained for the estimated SiO2 in the C horizon of the NASGL samples was compared with measured SiO2 in an independent dataset. The most extensive available dataset with SiO2 concentrations in the USA to our knowledge is that of Shacklette and Boerngen (1984), albeit at a much lower spatial density than the NASGL project. They reported inorganic chemical analyses of soil and other regolith collected across the conterminous USA mostly during the 1960s and 1970s (no mineralogy is reported). The target medium was the subsoil at ∼20 cm below surface to avoid any surface contamination; this depth commonly is within the range of a soil's B horizon, a zone of element accumulation (Boerngen and Shacklette 1981). Although more than 1300 sites were sampled in total, only 407 were analysed for Si (by emission spectrography of the <2 mm grainsize fraction) in ‘Phase two’ (∼1969–1975) of the project (Shacklette and Boerngen 1984). The Si (wt%) concentrations were converted to SiO2 (wt%) before use here.

Firstly the empirical distribution functions of the ‘Shacklette and Boerngen’ and NASGL datasets were compared using a Kolmogorov–Smirnov (K-S) test of distribution similarity (Kolmogorov 1933) (Fig. 8). This non-parametric test quantifies the distance D separating an empirical distribution function from the cumulative distribution function of a reference distribution and an n-scaled critical value (CV). The null hypothesis (H0) being tested is that the two populations are indistinguishable and is quantified at a given probability p. The K-S test applied to the ‘Shacklette and Boerngen’ measured SiO2 concentrations and NASGL C horizon-estimated SiO2 concentrations yields D = 0.0557, which is smaller than CV = 0.0703, therefore justifying accepting the H0 at p < 0.05 (AAT Bioquest 2023).

Secondly a more rigorous test than comparing the general distributions, namely checking spatial consistency of SiO2 values, was applied. It has to be cautioned that (1) we are comparing two different soil horizons, B horizon in the ‘Shacklette and Boerngen’ dataset v. C horizon in the NAGSL dataset; and (2) soil heterogeneity is present at all scales (e.g. Pedersen et al. 2015), implying that comparing sample pairs distant by even a few metres can give substantially different concentration values. Nonetheless we extracted the closest NASGL site to each ‘Shacklette and Boerngen’ site and filtered out those pairs where the distance between the two was 0.04 degrees of latitude/longitude (∼4 km) or greater. The scatterplot and linear regression for the resulting subset are shown in Figure 9. The regression is surprisingly strong (R2 = 0.79) and with a slope close to unity (0.96). The relative standard deviation (RSD) on these pairs of samples (5%) is only marginally greater than the RSD on field duplicates obtained in the NGSA for SiO2 (4%; table 1 of Caritat and Cooper 2011b), which is remarkable given the spatial distance between these ‘Shacklette and Boerngen’ and NASGL sites (∼1 to 4 km) and their different sample media (B v. C horizons).

Overall, the above validation assessments provide confidence that the SiO2 estimates for the NASGL dataset, and by inference the recursive inversion methodology in general, compare favourably with ground-based measurements (‘Shacklette and Boerngen’ dataset) and mineralogical/geochemical data from a large independent study area (SA–Qld–NT dataset).

Uncertainty propagation

The uncertainty of the SiO2 estimates was investigated by propagating the errors as reported in Smith et al. (2013). As the SiO2 estimate is computed from a summation of measurements or elements ej, its uncertainty, uSiO2, was determined using the root-sum-squares method following Ellison et al. (1997) and Taylor (2005):
where e1… ek are the errors on the k elements that make up the variable. If we assume the error e of any measurement to be equal to three standard deviations (SDs) of that measurement,
we get for n minerals involved in a measurement:
The SiO2 estimates in the NASGL datasets rely on the quantification of 18 silicates (to determine ‘normative’ SiO2), 12 hydrated minerals (to determine ‘normative’ H2O), and three carbonates (to determine ‘normative’ CO2) (Table 3). Thus 33 minerals overall are included in the estimates. Conservatively utilizing an SD of 0.82 wt% for mineral quantification, the largest SD of any mineral reported by Smith et al. (2013, table 10), we derive
which gives an uncertainty for the SiO2 estimates of 14.1 wt%. In comparison, the uncertainty of the XRF-based SiO2 quantification in the NGSA is estimated at 7.3 wt% (three times the RSD of 0.04 × 61.06 wt% quoted in table 1 of Caritat and Cooper 2011b).


The current method of estimating SiO2 when missing from a dataset can be applied to other situations, e.g. where ICP analysis has been used and Si not determined. However, the methodology currently relies on mineralogical data being available for the same samples. Another limitation is that total or near-total geochemical analytical methods have to be used to ensure internal consistency with the mineralogical data, as such weak or partial digestion/leach techniques for sample preparation do not lend themselves directly to being compared with bulk mineralogy. Despite these limitations, there are many cases where (near-)total geochemical analysis and mineralogy have been determined, for instance in industry and government datasets.

Future work

In a complementary approach in progress, we are developing a machine-learning approach using linear regression and random forest algorithms to estimate SiO2 where it is missing, based on geochemical information, mineralogical information, and both geochemical and mineralogical information. This method will be tested on various datasets, including the NASGL and SA–Qld–NT datasets, to ensure its universal applicability and will be reported separately (Grunsky et al. 2018)

The original geochemical and mineralogical data for soils of the conterminous USA (A and C horizon datasets) were downloaded from https://mrdata.usgs.gov/ds-801/. The ‘Shacklette and Boerngen’ dataset was downloaded from https://mrdata.usgs.gov/ussoils/. A worked example for the five selected samples of Figure 5 is available as a Microsoft Excel™ spreadsheet (NALG_Ch_oxides_with_estimated_SiO2_LOI_worked example.xlsx) here: https://doi.org/10.5281/zenodo.8191287. The new datasets including sample identification, coordinates, converted major oxide concentrations, and the concentration estimates for SiO2 and LOI in wt% for the A and C horizon datasets from the NASGL project are available as comma-separated value files (NALG_Ah_oxides_with_estimated_SiO2_LOI.csv and NALG_Ch_oxides_with_estimated_SiO2_LOI.csv) here: https://doi.org/10.5281/zenodo.8191287.

We provide a novel method for estimating the concentrations of silica (SiO2 wt%) and loss-on-ignition (LOI wt%) in the NASGL project datasets. These datasets include comprehensive elemental and mineralogical compositions, determined mostly by four-acid digestion ICP-AES or ICP-MS, depending on the element, and Rietveld refinement XRD, respectively. Unfortunately, neither Si/SiO2 nor LOI are quantified, both of which are significant components of most soils. Our estimation method combines the precision of the ICP determinations with the completeness of the XRD data. As the NASGL samples contain up to 95 wt% amorphous material of unknown geochemical or mineralogical composition, it is not possible to directly calculate SiO2 or LOI contents from mineralogy alone. However, a recursive inversion approach, i.e. calculating geochemistry from mineralogy, can be invoked to calculate minimum SiO2, H2O and CO2 concentrations. Thus, we inverted an estimate for SiO2 by adding up the SiO2 contributions from all Si-bearing minerals (silicates). This ‘normative’ SiO2 represents a minimum estimation of the total SiO2 in each sample. Similarly, we inverted estimates for H2O by adding up the H2O contributions from all OH-bearing minerals (hydrates), and for CO2 by adding up the CO2 contributions from all C-bearing minerals (carbonates). Combining the latter two components gives a minimum estimate for LOI. Thus, 100 wt% – (all major oxides from ICP + TEs from ICP + ‘normative’ H2O + ‘normative’ CO2) yields a maximum estimate of the total SiO2 in each sample. The final or ‘consensus’ SiO2 estimate is then calculated as the average between the two aforementioned estimates, trimmed as necessary to yield a total composition (all major oxides from ICP + estimated SiO2 + TEs from ICP + ‘normative’ H2O + ‘normative’ CO2) of no more than 100 wt%. For most samples, the above sum falls below 100 wt% and the difference is taken to represent LOI not otherwise accounted for in the quantified hydrate and carbonate minerals. The source of this LOI contribution likely includes H2O and CO2 in the amorphous phase as well as other volatile components present in soil. We examine the statistical distributions of the SiO2 and LOI estimates and validate the technique against a separate dataset from Australia where XRF, ICP and XRD data on the same samples exist. The correlation between predicted and observed SiO2 is deemed strong (R2 = 0.91). Further, we compared the estimated NASGL C horizon SiO2 estimates with an independent dataset covering the conterminous USA, the ‘Shacklette and Boerngen’ dataset. The distributions of these two datasets are shown by a Kolmogorov–Smirnov test not to be statistically different. Spatially we demonstrate that the closest NAGSL sites and ‘Shacklette and Boerngen’ sites have highly correlated SiO2 concentrations (R2 = 0.79). Together, these validation assessments give us the confidence to recommend the approach of combining geochemical and mineralogical datasets to estimate missing SiO2 and LOI in datasets elsewhere. However, as each situation is different, any estimation results ideally should be ground-truthed.

The estimation method and Australian validation study were conducted as part of Geoscience Australia's Exploring for the Future (EFTF; https://www.eftf.ga.gov.au/) programme, which provides precompetitive information to inform decision-making by government, community and industry on the sustainable development of Australia's mineral, energy and groundwater resources. By gathering, analysing and interpreting new and existing precompetitive geoscience data and knowledge, the EFTF is building a national picture of Australia's geology and resource potential. This leads to a strong economy, resilient society and sustainable environment for the benefit of all Australians. This includes supporting Australia's transition to net zero emissions, strong, sustainable resources and agriculture sectors, and economic opportunities and social benefits for Australia's regional and remote communities. The EFTF programme, which commenced in 2016, is an eight year, $225 m investment by the Australian Government. The authors appreciate comments on a draft version of this manuscript by Laurel G. Woodruff (United States Geological Survey) and Clemens Reimann (Geological Survey of Norway, retired), as well as internal reviews by Philip Main, Tara Webster and Anthony Schofield (Geoscience Australia). We thank journal referees Alecos Demetriades and Ryan Noble, as well as Editor-in-Chief Scott Wood, for their detailed and constructive reviews of the paper. PdC publishes with permission from the Chief Executive Officer, Geoscience Australia.

PdeC: conceptualization (lead), data curation (lead), formal analysis (lead), investigation (lead), methodology (lead), validation (lead), visualization (lead), writing – original draft (lead), writing – review & editing (lead); ECG: data curation (supporting), formal analysis (supporting), methodology (supporting), writing – review & editing (supporting); DBS: data curation (supporting), formal analysis (supporting), methodology (supporting), writing – review & editing (supporting)

This study was funded as part of Geoscience Australia's Exploring for the Future (EFTF; https://www.eftf.ga.gov.au/) programme.

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

The datasets generated during and/or analysed during the current study are available in the Zenodo repository, https://doi.org/10.5281/zenodo.8191287

This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License (http://creativecommons.org/licenses/by/4.0/)