## ABSTRACT

The nonuniqueness of the solution to the geophysical inverse problem can lead to misinterpretation while characterizing the subsurface. To tackle this situation, ground-truth information from excavations and wells can be used to improve, calibrate, and interpret inverted models. We refer to quantitative interpretation as the decision analysis based on probability theory, which is focused on solving a classification problem. First, we present a probabilistic approach to classify the different types of materials or “categories” observed in borehole logs using multiple data sources: inverted 2D electrical resistivity tomography and induced polarization data and the positions (*x*, *z*) of these boreholes. Then, using the Bayes’ rule and permanence of ratios, we compute the joint conditional probabilities of each category, given all data sources in the whole inverted model domain. We validate this approach with synthetic data modeling for a complex anthropogenic-geologic scenario and using real data from an old landfill. Afterward, we assess the performance of the probabilistic approach for classification and compare it with the machine learning algorithm of multilayer perceptron (MLP). In addition, we analyze the effect that the different data sources and the number of boreholes (and their distribution) have on both approaches with the synthetic case. Our results indicate that the MLP performance is better for delineating the different categories where the lateral contrasts in the synthetic resistivity model are small. Nevertheless, the classification obtained with the probabilistic approach using real data seems to provide a more geologically realistic distribution. We conclude that the probabilistic approach is robust for classifying categories when high spatial heterogeneity is expected and when ground-truth data are limited or not sparsely distributed. Finally, this approach can be easily extended to integrate multiple geophysical methods and does not require the optimization of hyperparameters as for MLP.

## INTRODUCTION

Geophysical imaging has been widely used to derive insights into subsurface structures and processes by mapping physical parameters through noninvasive techniques. After processing and interpretation of geophysical measurements, this information can be translated into geologic structures, quantification of volumes, and geometries, or can be used to provide insight into groundwater processes (Romero-Ruiz et al., 2018; Whiteley et al., 2019; Slater and Binley, 2021). To obtain these images or physical models, an inverse problem, which is usually ill-posed and presents a nonunique solution, is solved (Aster et al., 2018). Inverse problems may be solved using prior information in the form of structural or geostatistical constraints in the higher dimensional space of the model parameters (Kaipio et al., 1999; Chasseriau and Chouteau, 2003; Caterina et al., 2014) and more recently in lower dimensional spaces using machine learning methods provided that generative models may be trained to enforce consistent spatial patterns (Lopez-Alvis et al., 2021, 2022). Interpretation of geophysical images can be achieved with several approaches ranging from a qualitative analysis that might be validated by correlation with ground-truth data (Yannah et al., 2017; Magiera et al., 2019) to automated quantitative processes to directly assist interpretation when a large amount of data are available, for example, using deep learning approaches (Wang et al., 2018) or when the translation from bulk geophysical properties to properties of interest is sufficiently linear (e.g., Hermans et al., 2012). In the following, we focus on the quantitative approaches to improve data interpretation.

In geosciences, there is a recent increase in the use of machine learning, deeply rooted in applied statistics, in which computational models are built using inference and pattern recognition (Dramsch, 2020). In this context, supervised and unsupervised learning algorithms of linear and nonlinear methods also have been developed and adapted (Scheidt et al., 2018).

For example, Moghadas and Badorreck (2019) used neural networks to successfully link time-lapse electrical resistivity tomography (ERT) data to soil moisture, collecting reference data via an excavated pit and using the reference electrical conductivity and temperature values as data sources or inputs to train the supervised learning algorithm. To solve a classification problem also using supervised learning, Lysdahl et al. (2022) applied the algorithm of multilayer perceptron (MLP) to extract the depth to bedrock from airborne electromagnetics and sparse drillings. Training data were pairs of known depth points and resistivity data. The former approach was used in a field case with postglacial geomorphology, yet the authors concluded that the geologic complexity was the main limitation on the MLP performance.

Combining two or more geophysical methods based on different physical properties can greatly reduce the ambiguities inherent to each method (Hellman et al., 2017), improving the interpretation and geophysics-based characterization. Paasche et al. (2006) adopted a statistical approach to integrate the physical models from individually inverted georadar and seismic data into one multiparameter model and to estimate the spatial distribution of petrophysical parameters (from limited geophysical and petrophysical databases) using fuzzy c-means clustering. Another example of unsupervised learning to solve a classification problem is given by Whiteley et al. (2021), who use a Gaussian mixture model algorithm to classify geophysical data into cluster groups to build a ground model and characterize landslide materials. They use three geophysical variables as data sources: resistivity, P- and S-wave velocities, and a spatial variable: depth from the ground surface. In the context of landfill investigations, Inauen et al. (2020) apply several algorithms of supervised learning to classify geoelectric and seismic data according to the materials observed in several trial pits and boreholes. The main goal was to derive a model of a solid waste landfill, for which the algorithm of MLP presented a good classification performance.

In the literature, we find fewer applications for the interpretation of geophysical data based on statistics or probability theory exclusively. For example, Dewar and Knight (2020) develop a methodology for estimating the top of the saturation zone from airborne electromagnetic data (1D resistivity models) and measurements from nearby wells. The methodology included the optimization of two parameters: (1) a search radius to integrate resistivity data within this area and (2) statistical properties of the resistivity distribution that best captures the transition from an unsaturated to a saturated zone, i.e., the minimum, the maximum, the difference between the minimum and maximum, the mean, the difference between the 75th and the 25th percentiles, and the standard deviation.

In most of the cases discussed previously, the interpretation of geophysical data provides a single model representing the physical reality in which the uncertainty often is not considered. In this regard, the Bayesian framework has become one of the leading paradigms to quantify uncertainty in geophysical modeling, inversion, and data interpretation, which can be translated into more rigorous decision making in subsurface systems, especially under noisy data (Ray et al., 2018; Scheidt et al., 2018; Bobe et al., 2020; Parsekian et al., 2021). The Bayesian analysis or inference refers to all procedures that use the Bayes’ theorem, where a quantitative relation is introduced to link predefined knowledge to new observations, thus comprising the computation of posterior distributions of a set of priors given a likelihood function (Piana Agostinetti and Bodin, 2018). For instance, Wellmann et al. (2018) combine the preexisting geologic modeling with additional geologic considerations and gravity data simulations, applying the Markov chain Monte Carlo to evaluate and sample from the posterior distributions to obtain suitable geophysical models. This approach successfully addressed uncertainty and optimized the geologic model of a sandstone greenstone belt. More recently, Fossum et al. (2022) use the ensemble randomized maximum likelihood to update the subsurface uncertainty in earth models and simulate electromagnetic logs generated with generative adversarial networks and a forward deep neural network, respectively.

In line with the Bayesian framework, another approach for the interpretation of already inverted geophysical data was used by Hermans and Irving (2017), where the inverted parameters were expressed in terms of categories defined by the probability distributions of hydrofacies. They assess the use of ERT to identify and classify the hydrofacies in alluvial aquifers, using colocated inverted data and boreholes records, integrating the effect of the sensitivity spatial variation.

In this contribution, we use a probabilistic approach as an alternative to performing a rapid quantitative interpretation by classifying geophysical data according to the materials observed in a limited number of colocated borehole logs (hereafter referred to as categories or classes). The method is based on the previously mentioned approach used by Hermans and Irving (2017). We extend it to account for more than one geophysical model (i.e., ERT and time-domain induced polarization [IP]) and to include spatial trends in the colocated data. For comparison, we also apply the supervised machine learning algorithm of MLP and include the same data sources or input for training. We apply both approaches in a synthetic case of study and analyze the effect that the data sources and the number of boreholes (and their distribution) have on the probabilistic approach and MLP. Finally, we compare and validate the approaches using the geophysical data acquired in an old landfill with available colocated trial pits. Such systems are notably difficult to characterize with geophysics due to the strong heterogeneity and physical contrasts encountered and with boreholes due to the increased contamination risks. We observed that the probabilistic approach is suitable to classify inverted models that are highly variable along the whole domain, with predictions consistent with the prior sampling information and the integration of the prediction uncertainty. Note that a robust characterization is crucial in decision making for sustainable management, e.g., estimate the internal structure of a landfill to assess the potential for resource recovery and to prevent or evaluate the associated environmental pollution and prioritize (re)development scenarios (Jones et al., 2018; Van De Vijver et al., 2020, 2021). Next, we first present the methods used for the classification after inversion, the results are then presented for the synthetic case and the field data, followed by the “Discussion” and “Conclusion” sections.

## METHODS OF CLASSIFICATION: PROBALISTIC APPROACH AND MLP

### Input data

To compare the probabilistic approach and MLP, we used the same input or training data. These are the colocated inverted resistivity $\rho $, chargeability *C*, the position (*x*, *z*) of the boreholes, and target categories. In these data, a relative cumulative sensitivity threshold is used to keep only parts of the tomograms ($\rho $ and *C*) that are sufficiently well covered (e.g., Beaujean et al., 2014). Consequently, this also contributes to keeping reliable training data and reduces misclassification in the final model (Hermans and Irving, 2017).

### Probabilistic approach

In the classification of the probabilistic approach, we first use the training data to define individual probability distributions, and then we compute the joint conditional probabilities of each category, given $\rho $, *C*, *x*, and *z* in the whole domain of the inverted models. The results are given in terms of probability maps that belong to each predefined category which can be translated into classes. In the following sections, we first introduce the permanence of ratios, which is an alternative to computing joint conditional probabilities of different sources in the presence of data interdependence (Journel, 2002). Then, we describe the procedure of the probabilistic approach where the permanence of ratios is used.

#### Principle of permanence of ratios

*A*be an unknown event that can be assessed with two data events from different sources,

*B*and

*D*, through its conditional probability $P(A|B,D)$. For instance,

*A*may represent a category, such as inert waste, and events

*B*and

*D*represent resistivity and chargeability. The easiest way to recombine these probabilities is to assume independence of the data events in which case the joint probability is the product of the marginal probabilities. However, this is a strong hypothesis as

*B*and

*D*are related to the common event

*A*. To take this into account, Journel (2002) proposes an alternative to combine probabilities of different sources based on the permanence of updating ratios while guaranteeing all limit conditions (e.g., $P(A|B,D)\u2208[0,1]$) even in the presence of complex data interdependence. The principle of permanence of ratios indicates that the rates or ratios of increments are typically more stable than the increments themselves. For simplicity, let us consider only two data events from different sources

*B*and

*D*, then the logistic-type ratio of the marginal probability of the unknown event

*A*is

*A*. And similarly,

*a*can be seen as a measure of prior uncertainty about

*A*:

*a*= 0 if

*A*is certain to occur and $a=\u221e$ if

*A*is an impossible event. Similarly,

*d*can be seen as the distance to

*A*occurring after observing the data event

*D*. The ratio

*d*/

*a*is then the contribution of

*D*to that distance starting from the prior distance

*a*. Finally,

*X*would be the distance to

*A*occurring after observing events

*B*and

*D*, and the ratio

*X*/

*b*is the incremental contribution of

*D*starting from the distance

*b*. The permanence of ratio assumes

*D*to the knowledge of

*A*is the same after or before knowing

*B*. Then, the joint conditional probability of the two events

*B*and

*D*can be expressed as $P(A|B,D)=1/(1+X)=a/(a+bd)$.

*n*data events $Gi$, $i=1,\u2026,n$. Denoting by $|Gi$, $i=1,\u2026,n$ the joint conditioning to all

*n*data events, the conditional probability provided by a succession of $(n\u22121)$ permanence of ratios is

*n*elementary single data event-conditioned probabilities $P(A|Gi)$, which can be evaluated independently one from another using colocated data.

#### The procedure of the probabilistic approach

*C*,

*x*, and

*z*in the whole inverted model domain. In the next step, we compute the conditional probability of each material using the Bayes’ rule. For instance, to compute the conditional probability of a category $Ai$, given the resistivity, we use

*C*,

*x*, or

*z*.

At this stage, we need to combine the prior probabilities $P(Ai)$ with the conditional probabilities of $Ai$ given the data sets into the joint conditional probabilities $P(Ai|\rho ,C,x,z)$. To this aim, we use equation 4 where the event *A* becomes the categories $Ai$, *n* = 4 and the data events $Gi$ are $\rho $, *C*, *x*, and *z*. Therefore, the ratios $gi=(1\u2212P(A|Gi))/(P(A|Gi))$ will be given in terms of the marginal probabilities $P(Ai|\rho )$, $P(Ai|C)$, $P(Ai|x)$, and $P(Ai|z)$ for each category. The joint conditional probabilities are then normalized (divided by $\u2211iP(Ai|\rho ,C,x,z)$) to ensure the closure condition, i.e., $P(A|(\xb7))+P(A\u02dc|(\xb7))=1$. Then, the results are given in terms of probability maps for each category in the whole inverted model domain, and therefore may be used to assess the interpretation uncertainty. Finally, we also may derive a map in terms of the categories by comparing the normalized joint conditional probabilities of $Ai$ and selecting the category corresponding to the largest probability value, i.e., classification model.

### Supervised machine learning: MLP

Classification and regression methods are part of statistical learning and machine learning for which numerous methods have been developed, from simple linear regression to nonlinear methods such as neural networks or deep learning (Scheidt et al., 2018). Here, we focus on the multiclass classification problem, where we want to predict discrete class labels or categories for unlabeled patterns based on observations. We used the algorithm of MLP or a feedforward neural network, which proved to have a good performance for classification in the context of landfill investigations (Inauen et al., 2020).

#### Description of MLP

As explained by Goodfellow et al. (2016), the goal of MLP is to approximate some function $f*$. For a classifier, the function $Ai=f*({\xb7})$ maps input data ${\xb7}$ to a category $Ai$. This algorithm defines a mapping $Ai=f({\xb7},\theta )$ and learns the value of the parameters $\theta $ (weight and bias coefficients of the transformation function) that result in the best approximation. MLP or feedforward neural networks are models where the information flows through the function from the input data, through the intermediate computations (linear and nonlinear data transformations followed by an activation function) used to define $f$ and finally to the output $Ai$. They are networks because they can be composed of many different functions connected in a chain, i.e., $f({\xb7})=f(3)(f(2)(f(1)({\xb7})))$, where superscript 1 refers to the first layer of the network, 2 refers to the second layer, and the final layer is called the output layer. The overall length of the chain gives the depth of the model: the more layers the “deeper” the model. In this contribution, we are dealing with a small amount of geophysical data, thus as shown in the next sections, a simple neural network (few layers in the chain) proves to be enough.

The neural network makes use of training data to drive $f({\xb7})$ to match $f*$. These data provide approximations of $f*$ evaluated at different training points, which are accompanied by a category $(Ai)$. Then, the learning algorithm decides how to use the other layers to produce the desired output, and as we cannot see the intermediate output of each of these layers, they are referred to as the hidden layers. Finally, each hidden layer of the network is composed of several units or neurons that can act in parallel representing a vector-to-scalar function.

In the multiclass classification, the output layer receives the values from the last hidden layer and transforms them into different classes, commonly with the Softmax function. This activation function of the last layer normalizes the data and transforms them into an output probability vector, based on which the output classes are selected (e.g., Williams and Barber, 1998). Therefore, the output is quite similar to the probabilistic approach.

#### MLP architecture

To design and optimize the architecture of MLP in terms of the hyperparameters (such as the number of hidden layers, neurons, or regularization), the total training data are divided into a validation data set (10%–20%) and a remaining training data set (70%–80%). The partition is done randomly but preserves the relative frequency of the categories.

To tune the hyperparameters, we trained the MLP algorithm using combinations of different numbers of hidden layers (from 1 to 10), a number of neurons (1–100), a solver for weight optimization, and an activation function for the hidden layers computing the accuracy score or the fraction of correct predictions in the validation data set. Then, we select three MLP architectures from which the highest scores were obtained and compare them with the accuracy scores of the prediction in the training data set for several regularization values. We select the regularization parameter where the gap between the accuracies (from validation and training) is reduced while still preserving a relatively large accuracy score, i.e., generalization. In addition, as we have a multiple-classes problem, we first applied one-hot encoding to define each class (Potdar et al., 2017; Fu et al., 2019; Liu et al., 2021). This means we encoded the categories as a binary (one-hot) numeric array. Once the hyperparameters are selected and the architecture of the algorithm is defined, the training data set is the same as the one used for the probabilistic approach (or validation plus training data as indicated here).

### Classification performance assessment

## RESULTS

### Synthetic case study

#### Model generation and inversion

The synthetic model is inspired by a real near-surface scenario composed of anthropogenic materials deposited on a limestone quarry. The data were simulated for resistivity and chargeability distributions composed of five regions derived from nonorganic waste deposits, soil, lime, and backfill on a limestone quarry (Figure 1). In this geometry, we defined a surficial and continuous layer of waste, as this material was observed on the ground surface in the real landfill. We included two types of backfills: neutral soil and crushed limestone (here referred to as the backfill) to investigate if these materials were discriminated using ERT and IP data. Then, we defined two bodies of lime at different positions (*x*, *z*) and with different thicknesses to test the effect that including spatial trends has on the classification approaches. Finally, the upper limit of the bedrock has a step-like shape, which might be close to a limestone quarry structure. The resistivity and chargeability values that we used for the different regions are shown in Table 1. Then, similar to the field measurements, we created a dipole-dipole acquisition scheme with 64 electrodes spaced 1.5 m. For the numerical modeling of ERT and time-domain IP data sets, we used the open-source library pyBERT, which is based on the framework of pyGIMLI (Rücker et al., 2017).

The ERT data were modeled by adding a 3% voltage dependent noise plus 1 μV absolute error (e.g., Costall et al., 2020). Then, the apparent chargeability was modeled following Seigel’s formulation, carrying out two DC resistivity forward models: the inverted resistivity of the medium and a decreased resistivity modified by the intrinsic chargeability (chargeability model, Table 1) (Seigel, 1959; Oldenburg and Li, 1994).

Finally, the synthetic data were inverted using the commercial software RES2DINV (Loke, 1997, 2004) to avoid the pitfall of using the same forward solver in the reconstruction algorithm (Lionheart, 2004). Here, we incorporated the data noise estimate for the apparent resistivity by subtracting the synthetic data modeled with and without added noise. We used a robust least-squares inversion with the Gauss-Newton method and an initial damping factor of 0.25. The inverted ERT and TDIP models are shown in Figure 2 together with the normalized sensitivity represented in the logarithmic scale and the real interfaces from Figure 1.

In the resistivity model, low values delineate the shallower lime deposit but the deeper lime cannot be imaged. There is no clear contrast of resistivity between the heterogeneous waste and the soil and backfill underneath. The upper limit of the bedrock is better imaged from $x\u223c20\u2009\u2009m$ to the end of the profile. Nevertheless, the resistivity values of the bedrock have a strong lateral variation as an effect of the regularization (there was a tradeoff between the damping factor and the root-mean-square [rms] error) and low sensitivity.

In the chargeability model, the surficial layer of waste is well delineated with large values. Nevertheless, the horizontal interface of the backfill/soil and the lime cannot be clearly distinguished. Artifacts of large chargeability are present at the locations of the lime and larger artifacts in the bedrock area centered at $x\u223c10$ and 30 m.

For the assessment of the inverted models’ reliability, we used the normalized sensitivity in logarithmic scale (hereafter referred to as the sensitivity), which shows how the data are influenced by the respective resistivity of the model cells. In Figure 2, we can observe a general sensitivity decrease with depth, particularly below the shallowest lime deposit.

In addition, we present the crossplots of the inverted resistivity $(\rho )$ and chargeability *I* values (Figure 3). For comparison, we also plot the mean of the inverted data $\mu i=(\rho \mu ,C\mu )$ together with the initial values for modeling (Table 1) for each category. First, we can notice that the mean of the inverted data for the bedrock presents a largely underestimated value of resistivity compared with the initial value. The second category where we can see a large variation is the lime, where the mean of the inverted resistivity was larger than the initial modeling values. In addition, we notice that all the clusters’ categories are largely overlapping, especially the bedrock and lime, whose $\rho $ and *C* values are widely distributed. This gives an insight into the ability of ERT and IP for resolving the features of this complex anthropogenic scenario and the uncertainty associated with the inversion process.

#### Synthetic borehole sampling

In this synthetic case, we assume a sampling scenario composed of several equidistant boreholes along the inverted 2D section. Notice that in real study cases, we want to reduce the number of excavations to mitigate costs and environmental and health risks. Then, to select and justify an optimum number of boreholes we follow a statistical analysis. We used the mean of the inverted data for each class $i$, i.e., $\mu i=(\rho \mu ,C\mu )$ (see Figure 3, represented with triangles), and computed the mean of the inverted data within a variable number of boreholes *b*. For each class $i$, this was represented as $\mu i(b)$. Then, we compute a summation of the difference between $\mu i$ and $\mu i(b)$ over all the classes $i$, i.e., $\u2211i(\mu i\u2212\mu i(b))$. Figure 4 shows the plot of the summation versus the number of boreholes *b*. As *b* increases (and tends to cover the entire domain), the summation is closer to zero as $\mu i(b)\u2192\mu i$. The first points of this plot vary depending on the location of the selected boreholes, and for $b<6$, the summation’s rate decreases more significantly. Therefore, in the following sections, we start by using five boreholes equidistantly distributed (Figure 5) and test the effect of changing *b* in the probabilistic approach and the MLP model.

#### Interpretation — Classes prediction using a probabilistic approach

The training data were composed of ${\rho ,C,x,z}$ at the borehole locations, which resulted in a matrix of 158 × 4, and their respective known categories (vector of 158 × 1). In this case, we used a sensitivity threshold of 10^{−1.7} to keep only parts of the tomograms that are sufficiently reliable. The threshold was chosen based on the maxima of the sensitivity gradient, taking the average sensitivity values located beneath the area of the shallowest local maxima (below the conductive zone corresponding to the lime deposit). This threshold leads to the use of 38% of the data, which supports the assumption of testing this approach on heterogeneous models (few reliable data).

For each category observed in the boreholes’ logs, $Ai$ = soil, waste, backfill, lime, and bedrock, we estimate a prior probability $P(Ai)$ based on the area that they occupy on the boreholes (see Figure 5). The proportions of $Ai$ are 9.57%, 5.5%, 2.5%, 7.3%, and 74.9%, respectively. Note that if the boreholes are limited in depth, one would naturally assume the vertical continuity of the bedrock once encountered. Then, we computed unimodal Gaussian distributions of the training data given each category $Ai$ and computed the corresponding conditional probabilities according to equation 7. We selected this type of distribution as it can integrate the overall uncertainty from data noise, inversion artifacts, and scarcity of the colocated data. If a lot of colocated data are available, empirical distribution can be built without requiring any assumption about their shape. Figure 6 shows the conditional probabilities with and without considering the sensitivity threshold.

The distributions of Figure 6 show the impact of the large prior probability considered for the bedrock. This category presents the largest probability values for $log10\u2009\rho >0.5$, for $log10\u2009C>0$, at the largest depths *z* < −6 and nearly along the whole model domain in *x*. Yet, it can be observed that the resistivity model was able to discriminate between the bedrock and the lime. With the chargeability model, we can only discriminate the soil from the bedrock, due to the heterogeneities of large values distributed in the bedrock. This is the reason why it was not possible to clearly distinguish the waste (material with the largest chargeability modeling values) from the other categories.

The conditional probabilities given the spatial coordinates show a trend on the vertical (*z*) and lateral (*x*) distribution of the materials according to the location of the boreholes, i.e., the probabilities are impacted by the spatial distribution and the number of boreholes. Given the depths, several categories are clearly resolved: bedrock at the largest depths, the soil at intermediate depths, and waste at the shallowest zone. However, these distributions cannot discriminate between different categories at similar depths. In the distributions given the distance *x*, we can roughly differentiate between the soil (maximum probability at shorter *x*) and the backfill with larger probabilities at the end of the profile. Note the impact that the small sampled region of backfill has on its conditional probability (large probabilities in a reduced range of *x*).

Afterward, we computed the joint conditional probabilities for each category $P(Ai|\rho ,C,x,z)$ using equation 4 and normalized them so that they sum to one. Hereafter, we refer to the normalized joint conditional probability as a joint probability. We work with the ratios of each data event, $gi=(1\u2212P(A|Gi))/(P(A|Gi))$, whose corresponding conditional probability can be independently estimated, i.e., $P(Ai|\rho )$, $P(Ai|C)$, $P(Ai|x)$, and $P(Ai|z)$. The results are presented as probability maps, see, for instance, Figure 7, where the joint probabilities of the waste and the bedrock are represented in the whole model domain. The surficial layer of waste was accurately delineated, whereas the upper limit of the bedrock was overestimated at the beginning of the profile *x* < 20 m.

We derived a map in terms of the categories by comparing the joint probabilities of the materials and selecting the category corresponding to the largest probability value (Figure 8). In this map, we added transparency which indicates the decrease in probability values: total opacity represents a probability of 1 and the strongest transparency represents a minimum probability of 0.25 for four categories. Finally, we computed the classification scores comparing the results with the test data, i.e., categories of the synthetic model (excluding the data at the boreholes). We obtained an accuracy score of 0.87 and the confusion matrix is shown in Figure 8.

First, we can notice that the bedrock could be predicted along the entire model and especially in the deeper areas, although its occurrence was slightly overestimated. The waste deposit could be accurately delineated, whereas the soil and the lime deposit in the central part of the profile were roughly delimitated laterally and vertically. This is represented in the confusion matrix, as we found that the bedrock obtained the largest number of correct predictions followed by the categories of waste, soil, and lime. The backfill could only be predicted in the area immediately around the borehole. This is mostly an effect of integrating the horizontal tendency *x* of the boreholes in the joint probabilities, and this is the reason why in the confusion matrix the lowest percentage of correct classifications corresponds to backfill. The largest percentage of incorrectly predicted categories corresponded to the backfill which was misclassified as bedrock.

The second lime body, close to the origin of the profile, could not be detected, first because the inverted geophysical models could not resolve this feature and second because this category has considerably lower prior probability values compared with the bedrock. This is the reason why this zone was predicted as bedrock.

The transparency added in Figure 8 allows us to identify the zones where there is a larger uncertainty in defining the categories, e.g., the lateral interface between the soil and the lime.

#### Interpretation — Classes prediction using MLP

We apply the MLP algorithm using the python library of scikit-learn (Pedregosa et al., 2011). The training data were composed of ${\rho ,C,x,z}$ at the borehole locations, resulting in a matrix of 158 × 4, and their respective known categories (vector of 158 × 1). Similar to the probabilistic approach, we consider a sensitivity threshold of 10^{−1.7} on the selected training data. To optimize the architecture of MLP, we use a validation data set that is composed of 15% of the total training data and the remaining 85% of the data are used to train the algorithm. Several combinations of hyperparameters showed equally large accuracy scores. From the hyperparameters that showed the highest scores, we selected the simplest, two hidden layers of 100 neurons each, a regularization parameter of 0.15, a stochastic gradient-based optimizer, and the rectified linear unit as the activation function of the hidden layers.

At this step, we used the training data and the validation data to train the algorithm, i.e., the same input data as for the probabilistic approach. The category predictions in the whole model domain are shown in Figure 8, in which we also added transparency which represents the probability values from the activation function of the output layer (Softmax). Total opacity represents a probability of one, whereas total transparency represents a probability of zero. In Figure 8, we can see that the waste, the soil, one lime deposit, and the backfill were well delineated. The lime body located at the origin of the profile was partially imaged. To assess the performance of this algorithm, we also used the real model defined in Figure 1 as the test data set (excluding the data on the boreholes) and obtained an accuracy score of 0.95. Figure 8 shows the confusion matrix, where the largest numbers of correct predictions is the ones from waste, bedrock, and soil. The category that was incorrectly classified the most was the lime (predicted as bedrock). Yet, we can observe that the classification of the lime and backfill improved using MLP.

#### Effect of data sources as input

In this section, we show the effect the data sources or elements of the training data have on the probabilistic approach and MLP. Figures 9, 10, and 11 show the category predictions of both methods and the respective confusion matrix when we use only the resistivity inverted data $\rho $ colocated with the boreholes, the resistivity together with the chargeability inverted data $\rho $ and *C*, and these two inverted data sets together with the depth *z*, respectively.

First, we can notice that when we only use $\rho $ or $\rho $ and *C* (with a sensitivity threshold of 10^{−1.7}) as training data in MLP, the predictions are strongly influenced by the artifacts from the model inversion leading to several misclassified zones (see Figures 9 and 10). Some categories remain correctly identified such as the waste deposit and part of the soil. However, the probabilistic approach overestimates the distribution of the bedrock (largest prior probability). It partially distinguishes one lime deposit when using only $\rho $ and delineates most of the waste layer and partially the soil when using $\rho $ and *C* (Figures 9 and 10). Although the probabilistic approach is not able to predict all the categories in the whole domain, the results are still in line with the material proportions estimated from sampling and thus more realistic. This is not the case for MLP, where the waste, soil, backfill, and lime are predicted at larger depths.

When we use additionally the depth from the colocated boreholes (i.e., $\rho $, *C*, and *z*), the results of both methods largely improve (see Figure 11), and the categories are well delineated overall. Yet, the probabilistic approach indicates a larger uncertainty (more transparency) in the deposits of backfill, and partially, the soil. In addition, the classifications of both methods present few locations where the soil and backfill are misclassified. This is the improvement that we can observe when we use all the data sets: $\rho $, *C*, *z*, and *x*, especially for MLP where the soil is only predicted at *x* < 50 m and the backfill only for *x* > 75 m (see Figure 8).

The probabilistic approach presents some changes when using {$\rho $, *C*, *z*} and {$\rho $, *C*, *z*, *x*}. The method proves better for classifying the backfill deposit when using only {$\rho $, *C*, *z*} (accuracy score of 0.89). However, when *x* is included, the probabilistic approach is strongly impacted and, even though it reduces misclassification between the soil and backfill at a few locations, the backfill is only resolved roughly in the area of a borehole and the accuracy score is reduced to 0.87.

In general, including spatial information on the training data should be done carefully. In the probabilistic approach, the conditional probabilities are clearly impacted by the spatial distribution of the boreholes. Therefore, highly localized sampling of certain materials may lead to small classification zones in the immediate vicinity of a borehole. Although MLP presents large accuracy scores using the spatial training data, which indicates that the distribution of the boreholes reflected the real material distribution of the synthetic case, this is rarely the case in the field (Gahegan, 2000; Cracknell and Reading, 2014; Baasch et al., 2018), especially in heterogeneous landfills where abrupt vertical and lateral variations of materials may be present.

#### Effect of borehole sampling

In this section, we assess the effect that the number of available boreholes and their distribution has on both methods. We use the training data set {$\rho $, *C*, *z*}, which led to a larger accuracy score in the probabilistic approach as compared with {$\rho $, *C*, *z*, *x*}. First, we assume a sampling survey with boreholes uniformly distributed along the whole domain (Figure 5). Second, we assume a survey composed of boreholes whose distribution does not map all the model domains in the *x-* and *z*-directions, i.e., boreholes might be concentrated in an area of the model and/or boreholes might not go deep enough to map the bedrock upper interface (see Figure 12).

Figure 13 shows the plot of the accuracy score against the number of boreholes for both sampling scenarios. Despite the fact that the minimum number of boreholes to capture all categories is two, this number leads to a highly variable accuracy depending on the distribution of the boreholes. Yet, this variability is largely reduced when *b* > 3. In Figure 13, we can note that the accuracy scores of the probabilistic approach using the uniform or nonuniform sampling scenarios are very similar. However, MLP predicts correctly the classes at most locations of the model under the uniform borehole distribution. Nevertheless, if the boreholes do not cover the whole domain in *x* and *z* and, therefore, the training samples do not reflect the real distribution of the materials (preferential sampling), the classification performance decreases. In addition, we observed that MLP can lead to unrealistic classifications that underestimate the bedrock distribution even if the accuracy scores are similar to those of the probabilistic approach. See, for example, Figure 14, where we show the comparison between both approaches using three boreholes of a nonuniform sampling scenario (shown in Figure 12). Note that the lime deposit was predicted at the deepest regions of the model.

### Field case study: Onoz landfill

#### Site description

The study site is in a former limestone quarry in Onoz (Walloon Region, Belgium) that produced lime until 1967. At the end of the quarry activities, the eastern part of the site was filled with slaked lime and fly ash. The area of interest here is the central part of the site, which was used as a landfill where different types of waste were deposited: inert waste, household, industrial waste, backfill, etc. (see Figure 15).

Since the landfill’s closure, several sampling surveys mostly composed of trial pits and different geophysical measurements have been conducted. Here, we focus on one high-resolution 2D profile which has ERT and IP data (profile P2) and presents the largest number of colocated excavations where bedrock was reported. Profile P3 has only ERT data and profiles P1 and P3 present a very similar distribution of resistivity and chargeability (Caterina et al., 2019).

#### Data acquisition and inversion

ERT and IP measurements were collected with an ABEM Terrameter LS. The profile presented here was acquired using 64 electrodes at 1.5 m electrode spacing. We used a dipole-dipole array configuration and a stack of *n* = 2 and a protocol sorted to limit electrode polarization. For the IP measurements, the electrical current was injected for 2 s, using an integration window of 1.7 s for the electrical resistance measurements, and the decay of electrical potential after current shut off was measured for 3 s.

Data were first filtered by removing the resistance measurements that presented variations larger than 5% from the measurements repeated two times. This is commonly referred to as the repetition error (e.g., Robert et al., 2011). On the IP data, the curves of inconsistent decay were also removed, which represented 18% of the original data points. Inversion of data was performed with RES2DINV (Loke, 1997, 2004). We also used a robust least-squares inversion with the Gauss-Newton method and an initial damping factor of 0.25. The inverted ERT and IP models are shown in Figure 16 together with the normalized relative sensitivity. We obtained an rms of 4.74% and 3.36% for the resistivity and chargeability models, respectively, after seven iterations.

The inverted resistivity section shows a conductive body on the top of a resistive horizon and a strong lateral contrast at approximately *x* = 30 m (Figure 16). The IP model presents scattered bodies of large chargeability on the surface and a smoother lateral contrast beneath. Similar to the synthetic model, the sensitivity of these inverted data presents a vertical decrease in the central part, below the conductive body.

#### Sampling

Several trial pits have been excavated in the landfill area (Figure 15). To test the probabilistic approach and the MLP algorithm, we consider the six trial pits that are colocated with the ERT/IP profile, and which do not cover the entire model domain. Only two shallow pits, not reaching the bedrock, are in the first half of the profile. We assume that once the bedrock is reached, it extends further at depth (in *z*) (see Figure 17).

#### Interpretation — Classes prediction using a probabilistic approach

As observed in the synthetic case, the training data set {$\rho $, *C*, *z*} led to a larger accuracy score in the probabilistic approach and MLP. Therefore, here we interpreted the inverted models using as input data the values of {$\rho $, *C*, z} colocated with the boreholes and ignore the variable *x*. This defines a matrix of 149 × 3 and a corresponding vector of categories of 149 × 1. We divided the input data in a training data set (87%) and a test data set (13%) to assess the classification performance. First, we define the material proportions $P(Ai)$ based on the area of the trial pits. These were 11.33%, 1.78%, 4.67%, and 82.20% for waste, soil, lime, and bedrock, respectively. Then, we computed the joint probabilities $P(Ai|\rho ,C,z)$ and derived the classification maps. Note that the MLP algorithm intrinsically uses the prior distribution in its training step (if one class has a higher probability, the algorithm will more often classify an unknown point in that category).

#### Interpretation — Classes prediction using MLP

Similar to the probabilistic approach, the training data were the values of {$\rho $, *C*, z} colocated with the boreholes, which defined a matrix of 149 × 3 and a corresponding vector of categories of 149 × 1. The training data set was then divided into 72% of the total data, the validation data set to optimize the hyperparameters was 15%, and the test data set was the remaining 13%. From the three MLP architectures that showed the largest classification performance, we selected the simplest one. It was composed of one hidden layer with 50 neurons, a regularization value of 0.01, the rectified linear unit as an activation function of the hidden layers, and a quasi-Newton-based optimizer as the solver.

#### Comparison between the probabilistic approach and MLP

We use the same training data for the probabilistic approach and MLP (training and validation after tuning hyperparameters) and we use the same test data set to assess their performance. Figure 18 shows the category predictions for the probabilistic approach and MLP, the accuracy scores (which were 0.63 and 0.68, respectively), and the confusion matrices. In both approaches, the confusion matrix indicates that only the categories of waste and bedrock could be partially predicted.

First, we can observe that the MLP predicted model is strongly influenced by the inversion. The bedrock is predicted in the largest resistivity values and the lime is roughly predicted in the area of very low resistivities. The soil is predicted at some small areas of high chargeability (which might be artifacts in the inverted model), and as it was the class found at the largest depth of a pit at $x\u223c\u20090\u2009\u2009m$, then the soil was predicted in the area nearby at larger depths. This is a consequence of including spatial training data that are not distributed over the entire survey area (Baasch et al., 2018) and which may not reflect the real distribution of the materials deposited in the landfill. The waste also is predicted at larger depths in the model and might be influenced by the intermediate resistivity values. In addition, this model presents large areas of transparency representing probability values of approximately 30%. This means that there are large uncertainties in the prediction of classes in nearly the whole model except in the area of the predicted bedrock.

The classification derived from the probabilistic approach is composed of a much simpler model. It mainly presents three nearly continuous deposits: a surficial layer of waste (with interspersed soil), an underlying layer of lime, and the bedrock at the bottom, which is continuous along the whole model. Here, the zones of transparency are distributed in the shallowest layers along the profile, at a depth corresponding to the intersection between the lime and the waste. Nevertheless, this classification might present a more realistic geology, with continuous bedrock and a nearly continuous surficial layer of heterogeneous waste on the top of a lime deposit. In general, the anthropogenic-geologic scenario from the probabilistic approach might be more realistic as it agrees with the additional trial pit logs excavated near the profile’s origin, where bedrock was found at depths similar to those presented here.

## DISCUSSION

We analyzed the performance of both approaches using a realistic synthetic benchmark and a field case where only few ground-truth data are available. First, the MLP algorithm requires a previous optimization of the hyperparameters using a validation data set (typically approximately 15% of the available data set). Thus, when little data are available, optimizing the hyperparameters with a validation set can be difficult and yield highly variable results. For instance, several MLP architectures for the synthetic case led to large accuracy scores using the validation set (which may be a consequence of having little data). Nonetheless, the use of the algorithm on new data requires again an optimization of hyperparameters using a validation data set (e.g., Aszemi and Dominic, 2019; Yu et al., 2020). Oppositely, in the field case, the accuracy scores were, in general, smaller during the optimization of hyperparameters and only a few architectures presented scores of more than 0.7. In both cases, we selected relatively simple neural networks with a small number of hidden layers and neurons to remain in agreement with a small number of training data and to be comparable with the probabilistic approach where no data transformations are performed. In addition, during the optimization of hyperparameters, we observed larger accuracy scores in neural networks with smaller hidden layers and neurons.

Second, in the synthetic case, the entire model domain was available to assess the performance of the algorithm. This was not the case for the field site, where the test data were only 13% of the ground-truth data. For the latter, the classification predictions were compared with the observation from excavations nearby. Finally, the transparency applied in this classification is derived from the probabilities of the output layer and derived from the Softmax function. Note, however, that this function transforms the data into a probability vector with values between zero and one and whose elements add up to one. As it is a normalized exponential function, then the largest values are transformed into values close to one, whereas the smallest are into values close to zero. Therefore, the results may not represent exact probabilities as derived from the probabilistic approach (e.g., Gal and Ghahramani, 2016).

The probabilistic approach does not require an initial tuning of parameters. The prior probability of the different categories or the categories’ proportion can be defined from the volumes obtained in the excavations. This information is essential to ensure that the conditional probabilities of each category are in line with the ground-truth data. If one category dominates the prior probability, whereas the geophysical data are poorly informative, this can lead to a final classification favoring the most probable facies. This is the reason why, in the field case, the bedrock lies across most of the model domain (especially in comparison with MLP classification). Finally, the transparency applied in this classification shows joint probabilities of the selected category, indicating specific zones of larger classification uncertainty. The probabilistic approach also is less dependent on the location and number of the colocated data.

The integration of more (inverted) data that increase the training data set is likely to improve the performance of MLP while not necessarily the performance of the probabilistic approach. When the physical properties of the categories in the inverted model(s) are highly variable along the whole domain mapped, the classification uncertainty is likely to increase. Nevertheless, the predictions would still be in line with the material proportions or prior information from sampling. Note that the better the materials are resolved in the conditional probabilities given the geophysical or spatial variables, the better the performance of the probabilistic approach. Consequently, in highly heterogeneous environments, the probabilistic approach is likely to improve when it is applied locally, i.e., per profile if multiple profiles are available or in areas of a 3D model, depending on the heterogeneity observed in the conditional probabilities. Regardless, a representative sampling based on geophysical data (e.g., Van De Vijver et al., 2019) could improve data interpretation and therefore the classification performance of both methods.

Another option to increase the training data of the field case may be to use the data of the synthetic case. It is a common practice to create synthetic data to train artificial neural networks (Yu and Ma, 2021). Nevertheless, this contribution aims to present a probabilistic approach as an alternative to performing a rapid quantitative interpretation of site-specific anthropogenic environments incorporating ground-truth data.

We also studied the effect that the use of different data sources as training has on the probabilistic approach and MLP. Here, we can notice that the category predictions derived from MLP are highly influenced by the spatial heterogeneity of the inverted models when we only use the geophysical data as input. This leads to several misclassification zones, most of which are nonrealistic. However, because the probabilistic approach relies on a Bayesian framework, it integrates the uncertainty related to data noise and inversion artifacts overall into the results. Therefore, the probabilistic approach is less sensitive to the heterogeneities of the inverted models when using only $\rho $ and *C*, although it overestimates the distribution of the bedrock due to the large prior probability of this category. When we also include spatial training data (*x*, *z*), both approaches largely improve the classification, in particular MLP. However, including position *x* in the probabilistic approach leads to a high degree of influence on the sampled location.

In addition, we analyzed the effect that the number of boreholes and their distribution has on both approaches using the synthetic case. The distribution of a low number of boreholes can lead to variable accuracy scores (in particular for MLP). The accuracy scores of the probabilistic approach are very similar under the uniform and nonuniform sampling scenarios.

Here, we only compared the probabilistic approach with one algorithm of supervised learning, both of which presented (overall) similar classification models. Unsupervised learning also has been used for data interpretation in cases with minimal prior knowledge or where few ground-truth data are available (Delforge et al., 2021; Sabor et al., 2021; Whiteley et al., 2021). These approaches have proven useful for the interpretation of geophysical data in geologic environments composed of layered models or when the geophysical method(s) resolve zones or structures that can be evidenced in intrusive data. The motivation of this probabilistic approach is to quantitatively interpret geophysical data in complex anthropogenic environments with extreme heterogeneity not only in terms of the spatial distribution of deposited wastes but also in terms of the high contrasts in physical properties that may lead to noisy data and artifacts in the inverted models. Because it provides probability values, its integration within other model types or in decision-making is relatively straightforward (e.g., Hermans et al., 2015).

The probabilistic approach and MLP classify zones with large uncertainty. If the classification improves when adding the *x* and *z* information, this also can lead to local improvement while degrading the overall accuracy score. Another option is then to improve the inverted model by adding the prior information from boreholes in the inversion process (Linde et al., 2015; Ronczka et al., 2015). For example, adding the depth of the bedrock, located in a low sensitivity zone, can improve the interfaces in the other part of the model (Caterina et al., 2014; Thibaut et al., 2021). However, adding such information bears the same limitation as it is highly dependent on the available boreholes and can lead to erroneous inverted models (Caterina et al., 2014). Nevertheless, using more advanced inversion methods does not prevent the use of the probabilistic approach or the MLP algorithm for post-inversion classification. Hermans and Irving (2017) show that an appropriate regularization could lead to an increase in the confidence of the classification (higher probabilities).

## CONCLUSION

In this study, we presented a probabilistic approach that can be used in a classification problem including uncertainty estimation from site-specific multiple geophysical data sets and when only a few ground-truth data are available. The classification is based on two geophysical models (ERT and IP) and spatial data colocated with boreholes or trial pits. We compare this approach with a machine learning approach, the MLP, in a synthetic model and in a real field case. In addition, we tested the effects that the types of (training) data sources and the borehole sampling have on both approaches.

The probabilistic approach has proven to provide robust results regarding the position and number of ground-truth data and the presence of artifacts of inversion, in contrast to the MLP algorithm whose performance is largely related to the number of training data. Therefore, we recommend the use of the probabilistic approach in complex anthropogenic-geologic scenarios or other site-specific environments where: (1) geophysical inverted models present spatial heterogeneities (laterally and vertically) or artifacts, (2) only few ground-truth data are available, and (3) ground-truth data might not be sparsely distributed nor covering most of the area of study. The approach can be easily extended to integrate geophysical data from multiple methods or in three dimensions. It represents a suitable alternative to performing a rapid quantitative interpretation of geophysical data by using a probabilistic classification. Finally, as it integrates uncertainties in the prediction results, these can be used to complement decision support tools in sustainable landfill management. In contrast, when a large number of training data are available, the MLP algorithm is expected to outperform the probabilistic approach.

## ACKNOWLEDGMENTS

This work has been financed by the European Union’s program of Interreg North-West Europe and the Wallon Region within the framework of the multidisciplinary projects of RAWFILL and NWE-REGENERATIS. We thank J. Whiteley, B. Baasch, an anonymous reviewer, and editors for their valuable comments and suggestions.

## DATA AND MATERIALS AVAILABILITY

Codes necessary to reproduce the classification using the probabilistic approach in the synthetic case and in the field case study are available at https://doi.org/10.5281/zenodo.7121021.

Biographies and photographs of the authors are not available.