## ABSTRACT

Seismic processing often involves suppressing multiples that are an inherent component of collected seismic data. Elaborate multiple prediction and subtraction schemes such as surface-related multiple removal have become standard in industry workflows. In cases of limited spatial sampling, low signal-to-noise ratio, or conservative subtraction of the predicted multiples, the processed data frequently suffer from residual multiples. To tackle these artifacts in the postmigration domain, practitioners often rely on Radon transform-based algorithms. However, such traditional approaches are both time-consuming and parameter dependent, making them relatively complex. In this work, we present a deep learning-based alternative that provides competitive results, while reducing the complexity of its usage, and, hence simplifying its applicability. Our proposed model demonstrates excellent performance when applied to complex field data, despite it being exclusively trained on synthetic data. Furthermore, extensive experiments show that our method can preserve the inherent characteristics of the data, avoiding undesired oversmoothed results, while removing the multiples from seismic offset or angle gathers. Finally, we conduct an in-depth analysis of the model, where we pinpoint the effects of the main hyperparameters on real data inference, and we probabilistically assess its performance from a Bayesian perspective. In this study, we put particular emphasis on helping the user reveal the inner workings of the neural network and attempt to unbox the model.

## INTRODUCTION

In seismic exploration, geophysicists interpret reflections of acoustic waves to extract information from the subsurface. These reflections can be classified as primaries or multiples. Primary reflections are those seismic events whose energy has been reflected once, and they are used to describe the subsurface interfaces. In contrast, multiples are events whose energy has been reflected more than once and appear when the signal has not taken a direct path from the source to the receiver. The presence of multiples in the recorded data set can trigger erroneous interpretations because they interfere not only with the analysis in the poststack domain, (e.g., stratigraphic interpretation) but also with the prestack analysis, (e.g., amplitude-variation-with-offset (AVO) inversion). For this reason, the demultiple process plays a crucial role in any seismic processing workflow. Multiple-attenuation methods can be classified as predictability- and separation-based. Predictability-based approaches exploit the repetitive nature of multiples and their inherent connection to primaries. In general, they consist of two steps: a multiple prediction step, in which a model of multiples is created, followed by adaptive subtraction (Verschuur et al., 1992; Abma et al., 2005) where the predicted multiples are adaptively matched and removed from the recorded wavefield. Some of the most widely used methods are wavefield extrapolation (Berryhill and Kim, 1986; Wiggins, 1988; Wang et al., 2011), surface-related multiple elimination (SRME) (Berkhout, 1985; Verschuur, 1991; Ma et al., 2019) and the inverse scattering series free-surface multiple elimination (Carvalho et al., 1991; Weglein et al., 1997, 2003; Ma et al., 2019). All of these approaches are recognized for their effectiveness in mitigating free-surface multiples. Nevertheless, they involve numerous steps, and their efficacy is highly influenced by factors such as acquisition setting and geometry, as well as signal-to-noise ratio (S/N) (Gisolf and Verschuur, 2010; Kostov et al., 2015; Ma et al., 2019). In addition, to not risk damaging weak primaries, the adaptive subtraction step is often applied conservatively, resulting in residual multiple energy in the final image (Wang et al., 2011; Zhang et al., 2021). Recently, closed-loop SRME (CL-SRME) (Lopez and Verschuur, 2015; Zhang and Verschuur, 2021) has been proposed to tackle the shortcomings of SRME in shallow-water settings, nonetheless, the high computational demand still poses a challenge.

However, separation-based methods translate seismic data into intermediate domains, where one can eliminate multiples based on different characteristics of multiples and primaries (Weglein et al., 2011). The concept here is to exploit the fact that, on average, multiples have encountered a lower velocity than the primaries, and thus multiples are expected to exhibit an increasing residual moveout (RMO) along the offset dimension. Although suffering from their own set of limitations, separation-based methods are of a computationally simpler nature and can be applied at various stages of the processing workflow. One of the most widespread approaches to making use of this feature is the parabolic Radon transform (PRT) (Hampson, 1986). It translates prestack gathers from a time-offset to a $tau$-*p* space, by mapping them by a set of parabolic events. By design, PRT works best in the case of multiples perfectly following parabolic paths and for unlimited offset axis, both of these aspects are, nevertheless, not realizable in practice (Hampson, 1986). As a consequence, PRT can potentially degrade parts of the primary signal. Another limitation appears when dealing with gathers that are coarsely sampled. In such cases, data sparsity can lead to false energy mapping to the $tau$-*p* space, which in turn leads to insufficient separation of primaries and multiples. This either creates residual multiple energy or removes primary energy. To address some of the aforementioned weaknesses, the high-resolution Radon multiple removal method has been introduced (Sacchi and Ulrych, 1995; Sacchi and Porsani, 1999; Trad et al., 2003). It is, however, an approach of higher complexity, requiring the interpreter to manually fine-tune numerous parameters. Moreover, another disadvantage arises from the necessary time-consuming step of picking an appropriate mute function in the $tau$-*p* space to separate primaries from potential multiples. Oftentimes, the nature of the data set requires a laterally varying mute function design, adding yet another level of complexity. When it comes to industry workflows, the usage of predictability-based methods in the premigration domain, e.g., SRME, and separation-based methods in the postmigration gather conditioning, e.g., PRT demultiple, are typically combined. In this fashion, interpreters can leverage the best from both methodologies and achieve more reliable outcomes.

With the introduction of deep learning, a new vein of methods has emerged (Breuer et al., 2020; Bugge et al., 2021; Qu et al., 2021; Wang et al., 2022). These approaches are based on artificial neural network architectures, which are universal approximators, i.e., they can, in theory, model any continuous function. Breuer et al. (2020) present a deep learning-based method to trim statics and remove multiples on postmigration common-depth point (CDP) gathers using a moveout discriminator approach trained on synthetic data. Subsequently, Bugge et al. (2021) propose a similar approach that simultaneously tackled both demultiple and denoising on prestack gathers. Qu et al. (2021) present a hybrid workflow combining a deep neural network trained on synthetic data for shallow reflection reconstruction and PRT for deeper event reconstruction followed by CL-SRME. Finally, Wang et al. (2022) introduce a solution that exploits noise and data augmentation applied to training data generated using SRME or PRT for the free-surface multiple removal. Unfortunately, although the aforementioned methods have contributed to improving state-of-the-art results on multiple removal, they still suffer from generalization problems. To deal with this issue, Qu et al. (2021) require the generation of synthetic training data for each field of interest. Similarly, the approach by Wang et al. (2022) necessitates the synthetization of labeled data per survey using conventional multiple elimination methods for real data applications. Note, however, that these are proxy solutions, as they do not attempt to solve the survey-data set dependency of the model, but rather bypass it.

In this paper, we introduce and perform a detailed analysis of a separation-based automated end-to-end deep-learning approach, which can be applied on moveout-corrected post-migration CDP gathers to remove events that follow parabolic-like patterns while preserving the primary energy at cross-points. As already pointed out by Qu et al. (2021), training the model on data sets preprocessed using traditional methods introduces the limitations of such methods into the model as a side effect. To decouple the model from such limitations, we follow the workflow introduced in Breuer et al. (2020) and train a convolutional neural network (CNN) with synthetic pairs of multiple-contaminated and multiple-free gathers. The network is trained on feature-rich synthetic CDP gathers designed to enable the trained network to identify multiples in the prestack domain based on the reflection moveout paths rather than periodicity, thus making the model highly generalizable and independent of acquisition design. Furthermore, our approach works in a parameter-free manner, relieving the user from any manual task. In addition, we conduct an in-depth hyperparameter search, where we study the role played by the different components and their impact on the outcome. To that end, we visualize the inner workings of our neural network, to pinpoint the effect of the main hyperparameters on physical events. Finally, extensive in-field evaluations show that our model is able to preserve the inherent characteristics of the data in different scenarios, and thus, to generalize well. As a result, our approach can be seen as an alternative to traditional moveout separation-based approaches in the postmigration stage, such as PRT, in existing processing workflows.

## U-NET VISUALIZATION

U-net (Ronneberger et al., 2015) is a CNN topology, which was initially designed for semantic segmentation tasks in the medical domain. However, due to its generalization capacity, it has been widely adapted to various other domains. The architecture of U-net is divided into two paths: the contraction path, known as the encoder, designed to capture the image’s context, and the expanding path, referred to as the decoder, responsible for facilitating accurate localization. Both paths are symmetric and made of blocks of convolutional layers followed either by a down-sampling operation (encoder) or by an up-sampling operation (decoder). In addition to the encoder-decoder scheme, U-net has long skip connections that bypass some layers and connect different blocks from the encoder to their counterparts from the decoder. These shortcuts provide alternative paths for the gradient during back-propagation that help the model to incorporate fine-grained details in the predictions. Figure 1 shows the architecture of U-net for the demultiple scenario.

CNN architectures are successfully used in a large variety of applications, ranging from computer vision to natural language processing. They are made up of neurons that have learnable parameters arranged in filter-shape structures. Each of these neurons receives some inputs, performs a dot product, and finally, applies a nonlinear activation function (e.g., sigmoid or rectified linear unit [ReLU]) (Nair and Hinton, 2010). The output of the activation for a given filter is called a feature map or an activation map. Although the learning mechanism (back-propagation) is well understood, the intrinsic details, such as the reason why a specific decision or prediction is made, are not. As a result, neural networks are typically treated as black box models. To better understand the internal workings, we visualize different components of the network. In particular, we investigate the filters and the feature maps to try to conceptually unravel the learning of the model when dealing with demultiple problems.

On the left side of Figure 2, we can see some filters that the network has learned. Seemingly, they do not display any human-recognizable pattern from which one can draw conclusions. The statistics are, however, more informative. The filters’ weights appear to always follow a Gaussian distribution, independent of the layer. Similar observations by Gavrikov and Keuper (2022) suggest that convolution filters do not have distribution shifts along different axes of meta-parameters, such as data type, task, architecture, or layer depth. Nonetheless, we notice that the first block might break these empirical deductions, meaning that depth could indeed play a certain role in shallow layers. On the right side of Figure 2, we can observe some feature maps from different blocks. These intermediate representations display how the network modifies the input image and help us understand how multiples are identified and suppressed. On the one hand, as expected, we can visually assess a gradual loss of resolution (high-frequency components) in the first blocks, due to their down-sampling operations from the contraction path. The opposite effect is seen in the last blocks, caused by the up-sampling operations from the expanding path. However, contrary to what might be intuitive, the network is not learning to suppress multiples directly from the beginning. In fact, they are present in all the blocks, and almost in all feature maps. What the network seems to learn, instead, is to identify the multiples in each block to have a full understanding of the event. In this manner, in the very last layer, the model combines the feature maps in such a way that the undesirable events (multiples) are canceled out.

## TRAINING DATA SET

When interpreting real seismic data, we do not have the ground truth (GT) (annotated data). Unfortunately, these labeled data are some of the cornerstones of any supervised deep-learning model. Manual interpretation is an effective way to acquire GT, but it is an expensive and time-consuming process. Furthermore, its outcomes rarely contain all the events that would define the characteristics of the subsurface. To address this issue, in the demultiple scenario, one could create real labeled data, by using a traditional approach, for example, the PRT (Wang et al., 2022). Nevertheless, the network would be biased and limited by the performance of the traditional approach (Qu et al., 2021).

In this work, we introduce a network that is able to suppress multiples regardless of the domain and nature of the seismic gathers, i.e., offset or angle domain and time or depth domain. To achieve this, we systematically generate a substantial data set comprising 40,000 synthetic pairs of multiple-contaminated and multiple-free gathers. Exercising precise control over the features of the synthetically generated data set via an extensive parameter space empowers us to create a training data set that significantly enhances the model’s capacity to perform well on a wide range of real-world scenarios. Crucially, it is worth noting that our network’s proficiency does not stem from its ability to identify multiples based on their periodic relationships to specific primary signals. Instead, our goal is to exploit geometric differences in the RMO and cross-points between multiples and even barely visible primaries in the prestack gather. This parameter space consists of (1) variations of the density of multiples and primaries, and their position along the vertical axis; (2) variations of the strength of the RMO effect controlling minimum multiple moveout; (3) variations of the spectral components of the source wavelet together with a central frequency decay along the vertical axis; and (4) variations of amplitude change with offset/angle.

The synthetic gathers for training are created by generating a prestack reflectivity series $rp(t0)=rp(t,h=0)$ at zero offset $h=0$, first, for the primary reflections $p$. For the expansion to nonzero-offset $rp(t,h>0)$, linear interval velocity functions are defined, converted to RMS $vp$ velocity, and applied in the hyperbolic normal moveout (NMO) formula to calculate the event time $tp(t0,h,vp(t0))$. For the amplitude part, the Shuey approximation (Shuey, 1985) is used with $rp(t0)$ as the intercept of the amplitude variation with angle equation, to which we add a gradient term. A preliminary version of the primary-only gather is generated by convolving $rp(t,h)$ with a synthetic source wavelet of Ricker-type whose degrees of freedom are central frequency, bandwidth, and phase shift. Furthermore, we generate custom wavelets through the superposition of two individual wavelets, which are weighted and mutually shifted. Analogously, we also generate a nonzero-offset reflectivity series $rm(t,h)$ for the multiples, followed by convolution with the same wavelets as used for the primaries. The main difference from the primary reflectivity is the lower velocity $vm$ used to calculate $tm(t0,h,vm(t0))$. This gather of the multiples is added to the primary-only gather to generate a gather that contains both, the primaries and the multiples. Subsequent NMO correction of the gathers with perturbed RMS velocity, obtained by time-dependent perturbations of the interval primary velocity model, approximates gathers after prestack migration. The primaries appear almost flat (not necessarily perfectly flat) and multiples show stronger positive moveout than the primaries, thus they are seen in the gathers as events that intersect primaries and have a larger vertical extent. The NMO-corrected primary-only gathers are the GT in the process of training the network; the NMO-corrected gathers of primaries and multiples are the input gathers. The values of all parameters are obtained by Monte Carlo sampling of the parameter space within the bounds defined by the user of the modeling routine. Setting such bounds follows the guidelines of the variability of the corresponding parameters in field data acquisition and data preprocessing. It seems reasonable to let, e.g., the central frequency of the wavelet vary between 10 and 150 Hz to account (for deep seismic data) for the range of frequency content of various typical sources and the decay of frequency toward large depth. For the case of synthetic training data for shallow applications with typically much higher frequency content, one could stay within the same frequency limits and the same vertical resolution, because the network does not acknowledge physical units and, thus, makes no difference between realizations of *N* times higher frequency data on an *N* times higher resolved vertical grid. Some parameter bounds, however, have a steering effect on the functionality of the trained network. For example, defining a minimum-allowed moveout for the removed multiples teaches the network not to suppress potentially nonflattened primaries.

## ANALYSIS OF U-NET PARAMETERIZATION

Hyperparameters are values that control the learning process of neural networks. They define different aspects of the model, such as the learning rate, optimizer, depth, activation function, and loss function, just to mention a few. In general, neural networks are notorious for being very sensitive to the choice of hyperparameters, resulting in relatively different outcomes when the parameters are slightly modified.

In this section, we identify and describe the empirical effects that some hyperparameters have on our multiple-attenuation network. In particular, we focus on the impact of the optimizer, sampling technique, kernel size, loss function, and depth. To that end, we average validation results of 25 independent runs to guarantee reproducibility. We evaluate these results on four different metrics: mean-square error (MSE), S/N, structural similarity, and peak correlation. Furthermore, we validate the outcome on synthetic and real data sets. In this manner, we ensure certain generalizability and neutrality in our observations.

### Optimization functions

Within a neural network, the optimizer is an algorithm that modifies the weights of the network to minimize the loss function. They are built upon the idea of gradient descent, i.e., the greedy approach of iteratively decreasing the loss function by following the gradient. There are two main groups of optimizers: adaptive and nonadaptive methods. Hardt et al. (2016) argue that nonadaptive methods, such as stochastic gradient descent (SGD), are conceptually more stable for convex and continuous optimization, having smaller generalization errors. They also prove that, under certain conditions, the results can be carried over to nonconvex loss functions. Follow-up work by Wilson et al. (2017) finds empirical evidence of the poor generalization performance of adaptive optimization methods, such as adaptive moment estimation (Adam) (Kingma and Ba, 2014). Even when adaptive methods achieve a better training loss than nonadaptive methods, the test performance is worse. Finally, Choi et al. (2019) claim that the hyperparameter of the optimizer could be the reason that adaptive optimization algorithms failed to generalize.

In our experiments, we evaluate the impact of SGD with momentum and Adam optimizers for the demultiple task. Figure 3a shows the validation metrics in synthetic data for the two selected optimizers. In these plots, we can observe how the Adam optimization converges faster than the nonadaptive one (SGD) and also ends up in lower local minima, i.e., all the metrics reach better values. Nonetheless, although the gap between both optimizers seems to be significant when inspecting synthetic results, the differences are negligible (see Figure 3b). Furthermore, surprisingly, the demultiple outcomes on the real data set suggest that the model trained with the Adam optimizer tends to fail to generalize more often, and its results are not always consistent, varying among different runs. In Figure 3c, we display some results on real data, where we see how the Adam approach occasionally suppresses the primary energy, as it does for the reflection marked by the red rectangle from the second gather, and leaves some residual multiples in the far stack, as it does for the reflection marked by the red rectangle from the sixth and seventh gathers. Despite the fact that our model is trained using synthetic data, the system is meant to be applied to real data. Therefore, we prefer to use the SGD optimizer.

### Sampling technique and kernel size

The CNN-based models gradually down-sample their inputs so that the receptive fields of the deeper filters can reach most of the image at a certain depth. By doing that, the pixel dependencies, which lie far away from each other in an image, can be captured. This is an important aspect for any neural network that needs to interact with content that is spread on the input image, such as in fault detection or multiple removal. In our study, we conduct a twofold analysis related to the sampling, we evaluate the effect of different sampling techniques, and we analyze the impact of the kernel size.

Sampling techniques refer to those methods that decrease or increase the size of an input. In the contraction path of U-net, there are two down-sampling approaches: the pooling operation and the convolution operation. Although typically the pooling operation does not have learnable parameters (less computationally demanding), the convolutional operation does have such parameters. As a consequence, the latter can capture additional information, whereas the pooling will always imply a loss of information. In the expanding path, the decoder recombines the features sequentially until it recovers the original input size. To that end, this path requires up-sampling operations. Similarly to the contraction path, there are two main approaches: interpolation operation and transposed convolution operation. The first type of operation is parameter-free and lossy, and the second is the opposite. To evaluate the impact of the sampling methods, both down- and up-sampling, we check the different combinations. For the sake of simplicity, we restrict our analysis to the default configurations, which are max-pooling as a nonlearnable down-sampling technique and bilinear as a nonlearnable up-sampling technique.

Based on Figure 4a, experiments with transposed convolutions have less stable runs, nonetheless, all the sampling techniques have similar performance. Therefore, the extra computational cost of the learnable operations is not justified. Furthermore, the combination of max-pooling and bilinear, which are both nonlearnable sampling methods, provides the most stable results. Testing with synthetic and real data shows no difference among the configurations.

In addition to the sampling techniques, the kernel size might also contribute to the final outcomes. This hyperparameter determines to what degree the sampling operation down- and up-samples the corresponding input. Given that we work with elongated events, we empirically analyze the impact of kernels with square and nonsquare shapes and assess the impact of more aggressive sampling, i.e., the down- and up-sampling factors. Table 1 and Figure 5 describe the scenarios of our examples. Although the validation metrics seem to report the same behavior for all of the kernels, we observe a consistent improvement after quality control when using a 1 × 1, 2 × 2, 2 × 2, 2 × 2 kernel sequence (see Figure 4b). Models trained with the larger max-pooling kernels appear to remove multiples more aggressively, i.e., oversmoothing results and suppressing far stack energy of events that exhibit small moveout, marked by the rectangles in Figure 4c. According to Araujo et al. (2019), the effective maximum receptive field of the model trained with the chosen kernel sequence is 112 samples, meaning that a single pixel in the output is influenced by a square of 112 × 112 pixels from the input, as shown in Figure 6. This appears to be sufficient to observe multiples and their localized interactions with primaries, and hence we conclude that such a localized view is more important than the global view of the gather for this task. Moreover, the models trained with larger kernels seem to be more sensitive to the initial weights than their counterparts trained with smaller max-pooling kernels, as confirmed in the probabilistic study (see the following section).

### Loss function

*N*samples). Mathematically, it can be formulated as

Besides formulating the loss function, it is crucial to define the primary objective. This entails clearly outlining the specific task that the network is designed to achieve. To elaborate further, we propose two distinct objectives: direct and inverse. Given an input image *x*, the direct proposal tackles the demultiple problem by optimizing the prediction $y^$, which is a multiple-free image. The inverse approach, however, formulates the solution from another perspective. It defines the objective task as an optimization problem, where the prediction $y^$ should contain only the multiples of the input image, i.e., $x\u2212y$ (see Figure 7a). In this way, the network should focus exclusively on identifying the multiples, omitting the rest. Once the model is able to do that, we can subtract the prediction from the input image to obtain a multiple-free image. In Figure 7b, we plot the metrics using different objective functions. Interestingly, the results from both scenarios are similar. We hypothesize that the network learns to cancel out the same features, in the direct and inverse formulation, and consequently, the outcomes seem equivalent. Nonetheless, more advanced loss functions could potentially improve the results.

### Depth of the network

The goal of our neural network is to model a function *F* that maps the raw input data *x* to a multiple-free output. To that end, we create *F* by concatenating *n* nonlinear functions *f*, i.e., $F(x)=f1(f2(..fn(x))))$. Notice that adding more layers provides higher capacity to the network, which leads to deeper networks. In our experiments, we investigate the effect of three levels of depth. We take as a baseline the standard model shown in Figure 1, which consists of nine blocks. Then, we remove two down-sampling and two up-sampling layers to create a smaller version, called “small U-net.” Finally, we repeat the procedure, but this time adding two down-sampling and two up-sampling layers into the baseline. We call this last model big U-net. Table 2 shows the details of each topology and their inference times.

Figure 8a and 8b shows the depth analysis from which we derive the following statements. (1) The small U-net is too shallow and does not have sufficient capacity to suppress the multiples and occasionally oversmoothes the gathers. As a result, metrics and real data underperform when compared with the standard model. (2) The big U-net model is overparametrized, and therefore, the extra layers do not offer any further improvement. Note, however, that this analysis involves a training data set of a constant size and thus, training the big U-net model on a larger data set could yield different results. In summary, our standard model has the optimal trade-off between quality and size.

### Alternative topologies

The attention U-net architecture, proposed by Oktay et al. (2018), enhances the standard U-net model by incorporating self-attention mechanisms (Jetley et al., 2018). These mechanisms, such as channel and spatial attention, allow the model to adaptively emphasize relevant features during both the encoding and decoding stages. By selectively highlighting informative regions and suppressing noise or irrelevant details, the attention U-net improves its overall performance. In contrast, the MultiResUNet architecture introduced by Ibtehaz and Rahman (2020) introduces the concept of multiresolution residual blocks within the U-net structure. The main idea is that the incorporation of multiple resolution paths will help the architecture to effectively capture local and global contextual information. The fusion of information from different resolution levels enables MultiResUNet to learn intricate details and capture a broader context, enhancing its segmentation capabilities. In terms of architecture details, attention U-net and MultiResUNet consist of nine layers with max-pooling operations at resolutions of 2 × 2, 2 × 2, 2 × 2, 2 × 2. Attention U-net has a parameter count of 34.9 million and uses a combination of max-pooling and bilinear interpolation for down-sampling and up-sampling. MultiResUNet has 7.2 million parameters and uses max-pooling for down-sampling and transposed convolution for up-sampling. In Figure 9a, we show the evaluation scores from topology analysis. Twenty-five models have been trained with each topology and tested on synthetic testing data. Based on the depicted curves, the performance of the attention U-net is comparable with that of the proposed U-net architecture, whereas the MultiResUNet demonstrates noticeably inferior results. Figure 9c shows the results of U-net, MultiResUNet, and attention U-net and amplitudes extracted along two selected reflectors, which are plotted above the gathers. Based on these plots, it becomes evident that the MultiResUNet affects the absolute amplitudes of primaries, whereas the U-net and attention U-net output primaries with an overall similar amplitude intercept and gradient. Moreover, the MultiResUNet has not successfully suppressed the multiple crossing of the red reflector. Figure 9b shows another comparison of these three topologies, this time, however, on numerous gathers from the Norwegian Sea. A comparable observation can be made based on this figure.

In summary, both the attention U-net and the MultiResUNet introduce modifications to the standard U-net architecture to address specific limitations and potentially enhance performance. For our use case, the MultiResUNet demonstrates a tendency to diminish the absolute amplitude across the gathers rather than solely addressing the presence of multiples. In contrast, the U-net and attention U-net exhibit similar outcomes, although the attention U-net occasionally exhibits too severe of an effect. It is important to note that the attention U-net, as opposed to standard U-net, is not fully convolutional and thus is input-shape dependent. Despite its competitive performance, the constraint of this topology to a specific input shape is too limiting for our use case.

## BAYESIAN INVESTIGATION

Once we have analyzed the role of the different parameters, it is important to quantify the uncertainty of the model, i.e., epistemic uncertainty. In this manner, one can determine how reliable the actual predictions are, avoiding miscalibrated models. To that end, we need to move from a deterministic approach, where we solely rely on a point estimator, to a probabilistic approach, where we leverage Bayesian probabilities via Bayesian neural networks (BNNs). Although traditionally BNNs have been computationally expensive and difficult to train, recent approximations, such as deep ensembles (Lakshminarayanan et al., 2017), concrete dropout (Gal et al., 2017), and stochastic weight averaging Gaussian (Maddox et al., 2019), have eased these constraints.

In this work, we have implemented deep-ensemble learning, which can be considered a special case of BNNs (Wilson et al., 2022). The idea behind ensemble learning comes from the observation that aggregating the predictions of a large set of average-performing but independent predictors can lead to better predictions than a single well-performing expert predictor (Breiman, 1996). In our case, however, we prefer to use such a method to obtain the uncertainty associated with the underlying processes. This is achieved by normalizing and then computing the standard deviation of the predictions of numerous sampled model parameterizations. Notice that the resulting range of values indicates the percentage with respect to the output signal amplitude. As a result, if the different models agree on the multiple-free solutions and their absolute amplitudes, then the uncertainty is low. Otherwise, the uncertainty is high.

Figure 10a–10d shows four prestack gathers from a real data set and their associated uncertainties for a set of experiments (see Table 3). These uncertainty figures show the areas of the prestack gather where the models have a lack of knowledge, resulting in a certain ambiguity within the multiple removal process. In practice, this manifests itself as variations in the amplitude or shape of the removed events across parameterized models. Low uncertainty is displayed in black or dark purple, high uncertainty is displayed in pink and yellow. Given that the demultiple model is not perfect and hence its epistemic uncertainties are not zero, one has to target a model that does not exhibit high amplitude uncertainties that align with primaries. Otherwise, this would suggest that some realizations of the model remove or suppress primary energy, which is highly undesirable. However, uncertainties following a parabolic or a linear moveout are tolerated, as they potentially belong to a multiple. Such uncertainty suggests that the model is not certain about whether the event is a multiple or there is a mismatch in the amplitude. We observe that Models B, C, E, and H exhibit a clear increased uncertainty throughout the entire gather, hinting that some of these model realizations do affect the amplitudes of primaries. On the contrary, Models A, D, F, G, and I only produce uncertainties with significant values that follow parabolic events which we presume to be multiples. Finally, although these five models provide similar uncertainty maps, Models A, F, and I achieve the best qualitative performance (see the previous section). Therefore, as already mentioned, we prefer Model A because it offers a better trade-off between quality, size, and flexibility.

## SYNTHETIC EXAMPLE

Figure 11a shows the outcomes of our method and compares them to the results obtained from the Radon-based demultiple technique. The assessment is carried out on gathers obtained from a synthetic data set. This data set is created using a 3D finite-difference method that incorporates a free-surface boundary condition. The gathers are represented in the depth-offset domain, and our deep-learning approach was directly applied in this domain. Both our method and the Radon-based demultiple technique successfully eliminate the clearly defined parabolic events within a depth range of 3–5 km. However, in the far-offset shallow section, our deep-learning approach exhibits superior performance in removing steeply dipping linear noise when compared with the Radon-based demultiple method. For deep-learning approaches, which take seismic data as input and produce seismic data as output, amplitude preservation of the primaries is of utmost importance. Figure 11b shows the amplitude preservation capabilities of the U-net (our deep-learning model) and the Radon-based demultiple results. Displayed amplitudes are extracted along the red and blue lines from the raw gather and plotted above their respective gathers. The red line follows a potential phase-reversal event with a positive intercept, whereas the blue line traces an event with a negative intercept and a positive gradient. Both the deep-learning approach and the Radon-based method preserve the overall amplitude trend. In addition to multiple removal, an AVO-preserving denoising effect of the deep-learning approach can be observed. In the difference plots, marked by arrows, we observe how high amplitude events in the removed energy along the lines align closely for both approaches.

## FIELD EXAMPLES

In addition to tests on synthetic data, the trained model has been tested on numerous real postmigration data sets without any additional fine-tuning. Figure 12 shows the results of our method as compared with a traditional Radon-based demultiple approach on a real data set from the Norwegian Sea, subsequently referred to as Field Data A. In addition, Figure 13 shows a similar comparison using prestack gathers of the Volve field data set (Equinor, 2018) from the Norwegian North Sea, subsequently referred to as Field Data B. The deep-learning approach was applied directly in the depth-angle domain, whereas the PRT application involved depth-to-time and angle-to-offset conversions. For both real data examples, we plot the removed multiples for the proposed and the traditional methods to help visualize the main discrepancies between the two systems. From this visualization, we observe that PRT predominantly removes events along idealized parabolas, which are unlikely to closely represent multiples in real data CDP gathers. Complex overburden causes deviations from the parabolic shape, thus, the mapping of such multiples to clusters of points in $tau$-*p* space is in disagreement with the attempt of modern high-resolution PRTs to achieve a sparse $tau$-*p* representation. Our deep-learning approach, in contrast, does not make use of any specific path of the multiples and is able — based on what was shown to the network during training — to remove along its full path any given event that intersects other events with smaller RMO. In this way, it also removes converted wave energy and steeply down-dipping linear noise, and it is better suited to remove residuals of a demultiple process of the premigration steps, i.e., which appear only in the far stack. Such events can be seen in the far stack of the first three gathers in Figure 12. We also provide the results of the same data sets as full-stack sections in Figures 14 and 15. Herein, we can see how the lateral coherency of the removed events is consistent in both approaches. For Field Data A, the removed multiples by the U-net model appear to align better with the overlaying stratigraphy, resulting in sharper results.

## DISCUSSION

Given postmigration prestack gathers, our deep-learning approach identifies the multiples and cancels them out from the output result based on their moveout and geometric interference with primaries in a parameter-free manner. The main success of our implementation is not only the ability to remove multiples, but to do it while preserving the high-frequency components that characterize the data, and to generalize to different data scenarios without the need to retrain. Although denoising is a common postprocessing step targeting these frequency components, a noncontrolled application of it can lead to smoothing of the data, resulting in a loss of relevant features. We believe that seismic interpretation is a challenging task, therefore, any processing method needs to guarantee the preservation of these characteristics.

Despite the fact that in the past years CNNs have been extensively used in seismic applications, there is still a lack of rigorous explanation of hyperparameters choice. Thus, we think that the geophysics community would benefit from our approach to unbox neural networks to establish the relationship between the neural network parameters and their effects on the demultiple task from a deterministic and probabilistic perspective. In particular, our extensive set of experiments has determined that for multiple removal (1) the SGD optimizer is a better candidate than Adam, as it leads to more stable, consistent results; (2) the choice of the sampling operations seems to play a minor role and thus, we prefer to keep the model simpler with less demanding up- and down-sampling operators; (3) as for the kernel size, we empirically have found that square- and small-sized kernels do consistently outperform other kernel shapes when applied on real data; (4) both direct and inverse loss functions provide similar results; and finally (5) the depth of the network has a dramatic effect on the performance and consequently, one needs to determine the correct trade-off between network capacity and inference time. Although the results are encouraging, the empirical assessment only represents a subset of the total number of possible hyperparameter configurations. Nonetheless, they are sufficient to decide which hyperparameters play an important role in improving the transferability of features learned from synthetic to real data applications. As demonstrated by Breuer et al. (2020), similar neural network topologies can be used for different gather-to-gather processing steps, such as trim-statics. Hence, our hyperparameter analysis should also be of value for other seismic gather-to-gather approaches based on the U-net architecture.

In general, it is relatively trivial to train a neural network that can yield accurate results on a synthetic data set. However, it is highly challenging to obtain similar performance on unseen real data with potentially very different acquisition, geology, and processing settings. For this reason, producing synthetic data that realistically mimic subsurface events is crucial ongoing research (Durall et al., 2021). During our experimental evaluation, we have iteratively modeled different synthetic training data to investigate the effects on real data. This data-driven methodology has allowed us to generate a concise multiple-oriented data set, with high generalization properties. Instead of focusing on the large-scale periodic relationships between primary and multiple events, in our approach, we use their geometric shapes and localized interactions. Counterintuitively, the proposed approach does not require a global view of the gather to complete the task. As a result, training the model with larger or elongated max-pooling kernels to increase the receptive field size does not enhance performance; instead, it introduces unwanted compression-decompression artifacts on the primary features (Figure 10). Nonetheless, for tasks where a global view of the gather is of critical importance (e.g. approaches using periodic relationships between events), elongated max-pooling kernels might prove beneficial. Moreover, the main objective of our hyperparameter study was the ability of the model to generalize. To that end, we test the intermediate models on numerous data sets and evaluate their performance qualitatively, as opposed to solely benchmarking using quantitative metrics on synthetic testing data. This fact, together with a feature-rich training data set containing primaries and multiples of various frequencies, moveouts, densities, and noise levels, allows us to reliably process data sets of various characteristics.

The model is applicable to both offset and angle gathers in the time and depth domains, using a parameter-free approach. In this way, our approach can expedite interpretation tasks, providing human experts with assistance in managing extensive volumes of real data.

## CONCLUSION

In this work, we propose a demultiple model that can be interpreted as an image-to-image transformation system in the category of separation-based multiple removal approaches. Thanks to elaborate hyperparameter analysis using ensemble methods and iterative synthetic training data generation, our approach has proven to generalize well when applied to various synthetic and real field data without the necessity to retrain the model. The events removed by our method and PRT are mostly similar, with occasional advantages for the proposed methodology. This advantage is pronounced in cases where the remnant multiple energy is concentrated in the far stack. Due to its parameter-free nature and independence of the CDP gather domain (i.e., offset, angle, depth, and time), this approach has the potential to drastically reduce the turn-over time for postmigration gather conditioning.

## ACKNOWLEDGMENTS

The first and second authors contributed equally to this paper. The authors wish to express their gratitude to the members of the Fraunhofer ITWM DLSeis consortium (http://dlseis.org) for their generous financial support. Additionally, we extend our appreciation to Equinor ASA, Vår Energy ASA, Petoro AS, and ConocoPhillips Skandinavia AS for granting us permission to utilize their Field Data A, and to ExxonMobil for providing the synthetic data set featured in this paper. Furthermore, we acknowledge Equinor and the Volve License partners for making the Volve seismic field data (Field Data B) available under an Equinor Open Data Licence.

## DATA AND MATERIALS AVAILABILITY

Data associated with this research are confidential and cannot be released.

Biographies and photographs of the authors are not available.