## ABSTRACT

Building realistic and reliable models of the subsurface is the primary goal of seismic imaging. We have constructed an ensemble of convolutional neural networks (CNNs) to build velocity models directly from the data. Most other approaches attempt to map full data into 2D labels. We exploit the regularity of seismic acquisition and train CNNs to map gathers of neighboring common midpoints (CMPs) to vertical 1D velocity logs. This allows us to integrate well-log data into the inversion, simplify the mapping by using the 1D labels, and accommodate larger dips relative to using single CMP inputs. We dynamically generate the training data in parallel with training the CNNs, which reduces overfitting. Data generation and training of CNNs is more computationally expensive than conventional full-waveform inversion (FWI). However, once the network is trained, data sets with similar acquisition parameters can be inverted much faster than with FWI. The multiCMP CNN ensemble is tested on multiple realistic synthetic models, performs well, and was combined with FWI for even better performance.

## INTRODUCTION

Seismic imaging and inversion suffer from fundamental ambiguities, such as a lack of ultralow frequencies in the data and ultralong offsets. This lack of critical data results in the well-known gap in the intermediate wavenumber illumination of the subsurface models (Claerbout, 1985; Mora, 1989; Sirgue and Pratt, 2004; Alkhalifah, 2016; Kazei et al., 2016; Kazei and Alkhalifah, 2018; Yao et al., 2019). In most cases, this gap is significant and causes difficulties when building smooth background models. The low frequencies can in principle be acquired together in a long-offset acquisition (e.g., Ten Kroode et al., 2013), yet that is a very expensive solution to the problem of insufficient middle wavenumber information.

Numerous modifications have been proposed to make full-waveform inversion (FWI) work without low-frequency data. Conventionally, the lack of low-wavenumber information is addressed by changing the data misfit functionals in FWI (e.g., Luo and Schuster, 1991; Bozdăg et al., 2011; Choi and Alkhalifah, 2012; van Leeuwen and Herrmann, 2013; Sun and Alkhalifah, 2019). Introducing advanced model regularization techniques, such as total variation or, more generally, Sobolev space norms (e.g., Esser et al., 2016; Kazei et al., 2017; Kalita et al., 2019; Skopintseva et al., 2019), into the inversion is another approach that improves low-wavenumber coverage in FWI. Gradient filtering and conditioning (Ravaut et al., 2004; Alkhalifah, 2016; Kazei et al., 2016; Ovcharenko et al., 2018a; Ruan et al., 2018) is easier to tune and achieves similar results in FWI. All these methods are computationally intensive and require parameter tuning.

Alternatively, low-wavenumber information can, to some extent, be inferred directly from the data in the form of artificial low frequencies (Ovcharenko et al., 2017, 2018b, 2019a; Jin et al., 2018; Sun and Demanet, 2018, 2019; Kazei et al., 2019). Low-dimensional models (e.g., Araya-Polo et al., 2018; Wu et al., 2018) can be directly inferred from the data. Moreover, for models similar to the training data set (e.g., those generated with the same model generator), deep learning can provide resolution comparable to conventional FWI (Farris et al., 2018; Araya-Polo et al., 2020). A step forward in terms of the estimated model complexity and generalization is, however, not as easy; the labels are large-dimensional images. The beauty of deep learning and so on applications in seismic inversion and velocity-model building is that, although training can be very computationally intensive, the application of the trained neural network is computationally very cheap. With the growth of computational power, acoustic FWI has become a prominent tool in high-resolution velocity model building (e.g., Warner et al., 2013). However, anisotropic elastic FWI is still not widely applied due to the large computational cost and large null-spaces associated with multiparameter inversion (Köhn et al., 2015; Kazei and Alkhalifah, 2018, 2019; Podgornova et al., 2018). Deep-learning applications to the multiparameter inversion of seismic data (Ivanov and Stovas, 2017; Dramsch et al., 2019; Zhang and Alkhalifah, 2019) can help address the issue of null space. The ability to apply the trained neural network quickly can also help to analyze the uncertainty in FWI coming from the initial model.

The full-waveform-based methods listed rarely exploit the regularity of active seismic-exploration sampling. Deep learning can infer mapping for seismic data into the subsurface at a given location and then reapply it at another location. Most attempts are made to infer models as a whole (Farris et al., 2018; Richardson, 2018; Wu et al., 2018; Zhang et al., 2018; Araya-Polo et al., 2019, 2020; Øye and Dahl, 2019; Yang and Ma, 2019). Here, we propose using only relevant data from neighboring common-midpoint (CMP) gathers to estimate a single vertical velocity log at a given location. This setup effectively reduces the dimensionality of the target model space and simplifies the learning procedure. We also expect the trained neural network to be versatile for applications to other models because it corresponds to estimating a log, not a full model.

The paper is organized as follows. First, we discuss the regularity of sampling and seismic data relevance. Then, we construct synthetic subsurface models for training purposes by using elastic and affine transforms of an existing model. After that, we numerically explore a single-CMP versus a multiple-CMP training of a convolutional neural network (CNN) and its application on the constructed synthetic models. Later, we apply our “multiCMP” CNN ensemble to the data modeled in realistic velocity models (Marmousi II, SEG Advanced Modeling (SEAM) Phase I, and Overthrust). We also explore a combination of the multiCMP CNN with FWI on all of these models. Finally, we discuss the implications of further applications of the method and draw conclusions.

## THE REGULARITY AND RELEVANCE OF SEISMIC DATA

Equidistant placement of active seismic sources and receivers in active seismic acquisition helps to balance the illumination of the subsurface, thus making it easier to process with conventional stacking procedures. For this reason, regular sampling is typical in active seismic exploration, and the set of available offsets typically is the same for the CMPs in the middle of the region of interest. This means that the setup for estimating the velocity profile can be the same for different CMPs. The last fact is acknowledged by conventional velocity analysis such as Dix conversion and advanced stacking procedures (Mann et al., 1999). However, to the best of our knowledge, these procedures rely on significantly simplified assumptions about the subsurface and do not perform well in complicated geologic scenarios. However, FWI can accommodate arbitrary model complexity, yet it forgets about the regularity of the sampling, and spatial relations between the model and the data are typically handled implicitly. Data-driven inversion allows us to construct a CNN that maps relevant data to relevant locations and disregards the irrelevant data (Figure 1).

### Relevant data

First, let us examine the potential contribution expected from seismic data to a particular subsurface location illumination.

Standard velocity analysis uses CMP stacking along hyperbolas to extract velocities, and then the stacked data are mapped to depth. This is good enough for horizontally layered media in most cases. More advanced velocity analysis techniques, such as common-reflection surface (Mann et al., 1999) or multifocusing (Gelchinsky et al., 1999), take care of mild horizontal velocity variations and curved reflectors relying on certain assumptions about the subsurface. We take this concept to its extreme, by relying on geologically plausible models as realistic scenarios and replacing the conventional stacking analysis with deep-learning inference.

Relevant observed data $uobs(xCMP\u2212\epsilon :xCMP+\epsilon ,hmin:hmax,t)$ are mapped to an estimate of the vertical seismic velocity profile at a target location $v(xCMP,z)$. Here, $xCMP$ is the central midpoint, $uobs$ is the observed wavefield (the acoustic pressure in our case), $hmin$ and $hmax$ are the shortest and longest offsets that are usable from the data, and $t$ and $z$ stand for time and depth, respectively.

In the next subsection, we discuss why the mapping defined by equation 1 is a candidate for an optimal way to cast the seismic exploration inverse problem in the deep learning setup.

### Regularity of seismic data and deep-learning inversion of full waveforms

Conventional active seismic acquisition aims at providing equal illumination to all target areas of interest. Because the earth’s subsurface parameters are not known, we often accomplish this objective by setting up a survey that is regularly sampled in all available dimensions. Therefore, the problem of vertical velocity profile estimation is exactly the same regardless of the location for a given exploration field. Taking this fact into consideration, we set up a deep-learning problem that takes advantage of the similar data coverage for all locations.

The resemblance between different locations in the field of exploration is well understood and is taken into account in seismic imaging methods that rely on simplified media assumptions, such as stacking. However, it cannot be easily incorporated into methods based on full-waveform modeling such as FWI and reverse time migration. Artificial neural networks (ANNs) can serve as universal estimators, and they have no principal constraints on which data space to map; therefore, ANNs can be used to infer the mapping (equation 1) from data-model pairs created by full-waveform modeling.

Finally, training deep neural networks is often computationally expensive. From a mathematical point of view, the mapping in equation 1 is a mapping of 3D functions (5D for 3D data acquisition) to 1D functions, which should theoretically be much easier to estimate than mapping the full data to full models, which would need inference between 3D and 2D functions for 2D problems, or even in 5D to 3D for 3D problems. This makes the mapping to 1D logs an affordable and sufficient option.

## DATA

Data-driven applications heavily depend on the quantity, features, and variability of samples in the data set, which makes data collection and selection crucial. We intend to produce virtual well logs from seismic data arranged into CMP gathers. The data set for such applications should consist of input seismic data and the respective target logs. However, there is a very limited amount of field samples of well logs because the drilling and collection of cores is a costly task. To overcome this limitation, we generate a synthetic data set representing real-world geologic features. First, we generate a set of pseudorandom subsurface models and then numerically propagate seismic waves in them. Later, recorded wavefields are assembled into CMPs and the random velocity models are decomposed into a set of well logs.

### Pseudorandom models

Despite the clarity of intention, there is still no recognized way to generate realistic, textured all-purpose subsurface models with proper geologic features. Relatively realistic subsurface models could potentially be delivered by using neural networks (Ovcharenko et al., 2019b; Wu et al., 2020) and wavelet packets (Kazei et al., 2019). However, to reduce ambiguity in the model generation, multiple approaches and setups need to be further tested.

To simplify the model generation process for the current application, we essentially combine elastic image transformations commonly used in text-recognition applications (Simard et al., 2003) and cropping with stretches (Sun and Demanet, 2018) from an existing model. In particular, we empirically generate pseudorandom subsurface models in three steps from the guiding model — the Marmousi benchmark model (Bourgeois et al., 1991).

First, we build a guiding geologic model by flipping around the vertical axis and replicating the Marmousi model (Figure 3a). Second, we crop a random rectangular patch from the deformed model and resize it back to the size of the target subsurface model (Figure 3b). Then, we produce a distortion map (Figure 3b) from a random Gaussian field and map vertical coordinates in the prior according to this field (Figure 3c). Finally, we add a smooth 20% distortion over the velocity to produce various random models (Figure 3d).

The generator described previously allows us to generate a set of models that expand on the layered geologic structure from the Marmousi model. Despite using a specific velocity reference, the generator produces a diverse range of subsurface models that automatically satisfy the desired range of velocities. The main part of the transformation is the vertical repositioning of the local textures in the guiding model. However, depending on the smoothness of the coordinate transformation, new horizontal layers can be generated and old layers may collapse (Figure 3c).

### Seismic-wave propagation

There are no principal limitations to the type of wave equation or solver we may use. Yet, thousands of shots are necessary to generate statistically significant input for deep learning. To reduce the computational cost of modeling, acoustic 2D (Araya-Polo et al., 2018; Ovcharenko et al., 2018a, 2019a; Li et al., 2019) or elastic 1D (e.g., Zheng, 2019) media assumptions are typically used. We use acoustic modeling.

Should the model be horizontally invariant, we can try to reconstruct its elastic layers from a single shot gather (Röth and Tarantola, 1994) or a single CMP gather (Zheng, 2019). Although there is limited applicability of deep-learning models to laterally inhomogeneous models (Zheng, 2019), to the best of our knowledge, training was performed on vertically variant models to speed up the data generation. Here, we use conventional finite-difference 2D wave propagation and investigate different options for laterally variant media. The data are generated with equidistant sources and receivers at a spacing of 100 m. We model the data with a 7 Hz Ricker wavelet and then band-pass them to 2–4 Hz. The selection of the lower end of the spectrum is inspired by a typical multiscale FWI (MSFWI) strategy (e.g., Bunks et al., 1995), in which a low-resolution model is retrieved from lower frequencies. Also, processing the lower frequencies in the data allows us to reduce the input size by decimating modeled data in space and time. The maximum offset is limited to 4 km, and the data below 2 Hz are filtered out to present a decent challenge to conventional FWI. We symmetrize the data and use only the positive offsets (Figure 4) based on reciprocity theory.

To generate seismic data in each random model, we integrate the Madagascar package (Fomel et al., 2013) into the workflow. We use a compute unified device architecture (CUDA)-accelerated acoustic finite-difference solver in the time domain to numerically simulate seismic wave propagation and to record seismograms at each receiver for every source in the acquisition.

## DEEP LEARNING: SETUP

The general idea of deep learning is to build a mathematical model, which would derive a desired nonlinear relation directly from the data. Selection of a particular deep-learning model is heavily motivated by the attributes of the available data.

We map multiple-CMP gathers (Figure 4) to their respective target well logs. The inputs and the outputs are known arrays of real numbers; therefore, the problem reduces to a supervised task of multivariate regression. To train the neural network, we use the standard mean squared error between estimated $Vestimate$ (predicted) and ground truth $Vtrue$ (label) velocities.

At the training stage, a supervised model derives a relation, which links the input and target variables for each pair in the training partition of the data set. Whereas at the inference stage, the model infers target variables when given a sample from the previously unseen test partition of the data.

Regardless of the type of deep-learning model used in the application, a proper normalization is typically required for each sample of the input and target data. Normalization makes the data fit a range matching the bounds of the activation functions. It also enforces more even contributions of the data features into the training. This leads to weight optimization (training) for a better convergence in a shorter time.

Seismic data are naturally centered approximately zero if there is sufficient variation in the data. Instead of typical data standardization, we use batch normalization as a kind of scalar scaling. After that, we add random noise with 0.1 standard deviation to the input of the neural network to reduce the sensitivity of the network to low-amplitude parts of the CMP gathers. We observed that the procedure works similarly to data standardization, but it allows for more robustness in the generalization of the network.

The target data are seismic velocity profiles, which naturally span a narrow range of values and have similar variability. We experimented with standardization of the training data set and the common Min-Max scaling to the range $[\u22121,1]$. We observed that, although Min-Max scaling leads to faster convergence on the training data set, standardization offers better generalizability.

### Metrics

The normalized root-mean-square (NRMS) error provides a relative measure of cumulative mismatch between the predicted velocity profiles and the known target profile, with an exact match for NRMS = 0%. The coefficient of determination (R2) reflects the fraction of the variance in the data that the model fits (Araya-Polo et al., 2018; Kazei et al., 2020a, 2020b). A model is scored higher than zero when its predictions are more accurate than the mean value. A perfect match is at $R2=1$. The R2 value reflects the quality of the model comparable to the other models, whereas the mean-square-error loss is the actual optimization target.

### CNNs

Regular feedforward neural networks, such as a multilayer perceptron, are suitable for problems with a relatively small size of input data as the number of parameters in them grows quickly with the input size. When the input volume is an image, networks with convolutional layers come into play. Convolutional layers perform convolution of the input volume with a set of filters, which results in a set of feature maps, one corresponding to each filter. The key feature of the convolutional layer is that it admits local connectivity, which means that only a small set of neighboring points contribute to a particular point on the feature map when the filter slides over the input volume. This feature allows the network to learn spatial patterns in the data and use them at the inference stage.

### Architecture

We initially considered two CNNs with exactly the same architecture apart from their first layers, which accept input data of different shapes. First, we construct a 2D CNN that shifts its filters along the offset and time for individual CMPs. This neural network takes a single CMP gather as the input. The second neural network takes multiple CMPs as multiple channels in the first layer and then follows exactly the same architecture as the first neural network (shown in Figure 5). This multiCMP CNN gradually reduces the size of the input, similarly to an encoder. Every other layer replicates the previous one and reduces the size of its input twice following the same flow as in AlexNet (Krizhevsky et al., 2012) and VGG (Simonyan and Zisserman, 2014) neural networks. The input in our case is a multichannel image of size ($noffsets,ntimesteps$, and $nCMPs$). Because $ntimesteps\u226bnoffsets$, it makes sense to consider filters that are stretched along the time axis.

*x*except the last one . ELU is an upgrade to the rectified linear unit activation that improves on the dead neurons problem (Clevert et al., 2015). The last layer of the CNN features a linear activation function. This configuration is often used in deep-learning applications because it presents a computationally cheap way to introduce nonlinearity into the network. Finally, we apply batch normalization after every layer to regularize the training process and prevent overfitting. Though Clevert et al. (2015) observe that using ELU reduces the need for batch normalization, we observe that the weights sometimes collapse without it.

## DEEP LEARNING: TRAINING

Fitting within CNNs happens through the optimization of multiple filters. First, the CNN filters are initialized with random values, and then those are optimized to match the output. The optimization is typically executed with a local search method through backpropagation of errors of the neural network on the training data set. We use one of the popular optimization algorithms, Nadam (Dozat, 2016), which is an adaptive moment estimation (Adam) enhanced by the Nesterov accelerated momentum.

### Standard training

The classic deep-learning process starts with splitting all available data into training, validation, and testing subsets. The goal of training is to infer the dependence between the input and the target data such that it can further be applied to other data. Therefore, the performance of the CNN is evaluated throughout the process of training by applying the neural network to validation data. Once the CNN stops improving its performance on the validation data set, the training stops. The validation data are used for monitoring the training process and should not be used to evaluate the model performance. Thus, a small portion of the whole data is isolated from the training procedure to form the test data set.

Often, the three subsets of the data are determined randomly. However, it is not fair in our case because the samples are spatially correlated. If the CNN sees neighboring logs in the training, all it would learn would be the simple interpolation between them. To mitigate this problem when testing the neural network performance and to examine its performance on training samples, we split the data set in the following way: 77% for training, 18% for validation, and 5% for testing. The total number of models generated is 1000; we observe that using 50 of them is relatively stable for testing.

Once the training set is organized, we need to choose the optimization strategy. This typically includes an optimization algorithm and its parameters. We empirically find that a batch size of 64 provides high yet stable performance over epochs in our case for approximately 57,000 samples. Nadam provides a rapid decrease in the loss function. We note that the validation loss essentially stops decaying after approximately 20 epochs for both neural networks. Then, the learning rate is decreased to further fit the data. Both CNNs tend to overfit the data, which to some extent is prevented by batch normalization and by early stopping on the condition of no validation loss decay over seven epochs. Training takes approximately 60 s per epoch for the single-CMP neural network and approximately 120 s per epoch for the multiCMP CNN. Both networks are trained on a single graphic processing unit (GPU) with 48 GB video memory capacity. Neither network gains any performance by moving to a multi-GPU data-parallel training; we associate this with the relatively small batch size and poor parallelizability of the batch normalization layers. Also, the multiCMP CNN has a larger number of input channels/data per sample; thus, the throughput of the samples provides an additional constraint on the parallelization. From the classic training procedure, we compiled Table 1, which suggests significant overfitting in the learning process.

As expected from the physics of wave propagation in laterally heterogeneous media, the multiCMP CNN is more suitable for inversion. Namely, the ratio of the NRMS error on training and testing data is higher and, most importantly, the NRMS error is lower for the validation data set. We therefore discard the single-CMP neural network from further analysis. We also tested including a wider range of CMPs into multiple-CMP gathers, which we then used as individual training samples. There was no gain in performance of the network on this particular data set, so we fixed the architecture for further analysis.

### Dynamic learning

The standard practice of learning with static data sets suffers from overfitting. To overcome the issue, data augmentations are typically introduced. To help the network be more robust to noise in the data, we add random Gaussian noise to the inputs of all convolutional layers of the network, which could be considered the most generic data augmentation. More interestingly, the process of generating the data set could be considered as an augmentation itself. We therefore develop the following approach: 100 models are generated, and the respective data samples are loaded as the static validation data set. However, the training data set is considered dynamic. We generate 100 random models in the training process and model the data. When the new data are ready, they are fed into the CNN. Thus, new models and data are generated asynchronously with training. We observe that generation of the data related to 100 random models with three GPUs takes approximately 90 s. Training the neural network for a single epoch takes approximately 60 s. Therefore, if we distribute data generation to three GPUs and train the model on the left one, we can keep training the network at the same speed. About every 1.5 epochs, we replace the training data. This way, the training data set is effectively infinite and the overfitting is reduced. We sequentially train an ensemble of five multiCMP neural networks and observe gradual reduction in the overfitting (Figure 6). Note that for training and validation purposes, only the random models originating from the Marmousi model are used and no extra training is performed before testing on other synthetic data and models. We also observe that after 200 epochs of training the R2-scores are almost the same for all ensemble members and average approximately 0.91 and 0.88 on training and testing data, respectively.

### Reproducibility

One of the main features of the neural networks is their stochasticity. There are multiple reasonable mappings that the neural network could learn in a multidimensional regression setting. Leaving the randomness of our training data set out of scope (we fix the seed for reproducibility of the data set), stochasticity of the neural networks comes from several factors: random weight initiation, random training set shuffling, and random parallel arithmetic operations order inside the GPU. Although the first two factors could in principle be isolated by random seed selection, the third one is very hard to deal with. Apart from that, particular choices of random initial weights should not significantly affect the inversion results.

Therefore, instead of going for exact numerical reproducibility, we try to provide a clue about the randomness of our estimates by training multiple instances of our neural networks on the same data set. To keep the computational time within a few hours, we take only five instances of each network, which is not enough to get a reasonable uncertainty estimate, but it can provide us with some insights into it. In the next section, we test our trained ensemble of five CNNs. Note that flipping the input along the CMP axis should not change the output, yet it effectively leads to coupled CNNs. Therefore, we can increase the ensemble to 10 different estimates.

## DEEP LEARNING: TESTS ON OTHER MODELS

In most deep-learning applications to seismic inverse problems, the models are significantly simplified. Here, we go for the full model complexity. The section consists of three subsections. First, we examine the performance of the ensemble of five multiCMP CNNs on the Marmousi II model, which is very similar to the guiding Marmousi model used for the training data set generation. Then, we examine an application to two rescaled crops from the SEAM Phase I model (Fehler and Keliher, 2011). Finally, we show an application on the overthrust model of the multiCMP CNNs and subsequent FWI. For each example, we compute the average and the standard deviation of the outputs within the ensemble.

### Marmousi II

Marmousi II model (Martin et al., 2006; Figure 7a) is an updated version of the guiding Marmousi model (Bourgeois et al., 1991; Figure 3a). Therefore, this model is very similar to those in the training data set and can easily be inferred by the trained network. In the shallow part of the model, the multiCMP CNN retrieves a low-velocity anomaly (Figure 7b) that is not present in the Marmousi model used for the training data set generation. In the deeper part, fine layering is not recovered. Standard deviation maps describe inconsistency between the velocity estimates from the five independently trained CNNs. These were trained by starting from different weight initializations and with different pseudorandom model sets, and they converged to slightly distinct final results. The number of ensemble members is obviously not sufficient to properly estimate the variance, yet we note that the variance increases with depth and is focused around the deeper high-velocity layer location (Figure 7c). Also, it suggests that the deeper part of the model under the strong reflectors is poorly resolved. High-velocity salt intrusions contribute the most into the mismatch distribution. Large-contrast interfaces are quite often associated with larger uncertainty because minor shifts in the interfaces lead to significant changes in the velocity (e.g., Galetti et al., 2015).

### SEAM Phase I

The SEAM Phase I model exhibits sedimentary layering with large embedded salt formations. The depth of the model is much larger than 3 km; therefore, we need to either rescale the model or crop a chunk of it. We consider two geologic cases extracted from the same benchmark model. First, we apply the ensemble of CNNs on a crop with sedimentary layering. Second, we examine the network capability on a rescaled section with a salt body.

#### SEAM I: Layering without salt

We crop a chunk of the sediments from the SEAM Phase I model next to the ocean bottom (Figure 8a). The model is layered with increasing velocity with depth, so FWI-like methods are likely to succeed. We model the data as if we had redatumed the data from air guns to the ocean-bottom data. Therefore, we do not take into account the wave propagation through the upper water layer. The multiCMP estimate of the velocity results in a smooth model that is nearly indistinguishable from the true model (Figure 8b). However, the standard deviation map suggests that there is an increase in the variability of the estimates at depth (Figure 8c). This means that the resulting subsurface model is likely to be sufficient for conventional imaging, but it might need to be updated for time-lapse FWI, in which small perturbations matter.

#### SEAM I: Salt body

The salt-affected part of the SEAM Phase I model features strong reflections from the salt flanks. This type of event is poorly represented by the CMPs in the training data (Figure 9d). However, the inversion recovers the velocity trends next to the salt and the salt boundary (Figure 9b). The CNNs were not given a sufficient amount of data/models featuring salt bodies. Therefore, they are more likely to produce layers rather than large homogeneous high-velocity objects. The central part of the deviation map inside the actual salt position dominates over the neighboring regions. This could be used to potentially find the areas that are not properly inverted and alter them with model interpolation, similarly to variance-based model interpolation for conventional FWI (Ovcharenko et al., 2018a). The sedimentary sections on the sides of the salt body and the top of the salt body are retrieved; therefore, the estimated model could also be useful for salt flooding and subsequent FWI (Kalita et al., 2019).

### Overthrust 2D

To further explore the generalization power of the trained CNNs, we test them on data modeled in a slice of the overthrust model (Figure 10a). This model is challenging for FWI due to its low-velocity layers, which are squeezed between high-velocity reflectors. It is also challenging for conventional layer stripping due to the presence of uplifts and faults. The CNNs recover most of the large-scale features of the model. For the most difficult to resolve part of the model (near faults), the deep-learning (DL) estimate is not perfect (Figure 10b). To improve the model estimate in that region, we run FWI of the full-band data with the source central frequency of 5 Hz. After 200 iterations of the preconditioned conjugate-gradient optimization of full-band, full-offset data, the predicted model is improved (Figure 10b). Each iteration of FWI takes approximately 4.5 s. The improvement of the inversion result is the most significant for this model visually and according to the metrics. Therefore, we attempt to refine the model by running higher frequency (7 Hz) FWI leading to an even better result (Figure 11a).

We compare the 1D vertical velocity profiles estimated by the CNNs (DL), updated by FWI (DL + FWI), MSFWI (DL + MSFWI), and the ground truth profiles at 6, 10, and 12 km positions. The vertical profile at the 6 km horizontal position is in the horizontally flat part of the model. Therefore, it is estimated well by all methods (Figure 11b). The multiCMP CNN (DL) already reveals most of the model features correctly except for the very deep part. FWI improves the fit and provides almost the same result as MSFWI at this location. The second profile (10 km) is located in between the complex (right) and simple (left) parts of the model. The deep-learning estimate is good for the shallower reflector at 1 km and then degrades in the poorly illuminated zone (Figure 11c). FWI and MSFWI coincide except for the very deep reflector. The most complicated part of the model is represented by the third profile (12 km, Figure 11d) with lots of inclinations and faults. The deep-learning estimate suffers from significant underfit in the middepth region, whereas MSFWI here clearly helps the standard FWI starting from the depth of approximately 1.5 km.

In all cases, the general low-wavenumber trend of the model is captured by the multiCMP CNNs (DL) and then improved upon by FWI. Further iterations of FWI would probably lead to an even better model estimate.

Standard deviation (Figure 10c) of the multiCMP estimates correlates well with the errors in the predicted logs. At the 6 and 12 km positions, the deviation is relatively low, whereas at 10 km, we see a larger deviation.

### Computations: DL versus FWI

The efforts put into general training result in an ensemble of CNNs that can be directly and instantly used on different data sets. The application of trained CNNs presented here takes less than 1 ms for inference per sample (CMP gather). Conventional FWI takes 10–300 iterations. Each iteration of conventional FWI requires at least two modeling runs for each shot. The training of the network for one epoch on a single GPU takes less time than generating a single data chunk on three GPUs. Training is at least twice slower than inference because backpropagation is required. Therefore, we can expect the inference with a trained network to be at the very least 10 × 2 × 2 = 40 times faster than conventional FWI. In our particular application, FWI took approximately $4.5\xd7200s\u224315$ min on a single GPU, and the inference happens in less than 1 s. However, training of a single CNN takes approximately 4 h, which is equivalent to approximately 16 FWI applications. Training the whole ensemble of five CNNs is computationally similar to 80 FWI applications. These comparisons are heavily dependent on the implementation and are more of a guideline than a strict comparison of the computational efficiency of the methods.

## DISCUSSION

For regularly sampled seismic-data acquisition, which is typical in seismic exploration, the problem of full-waveform-based velocity-model building can be reduced to the search for mapping from 3D data (relevant traces) to 1D data (single velocity logs). This formulation is not a viable option for conventional FWI implementations because the velocity model needs to be optimized as a whole. However, deep neural networks are universal estimators and can be trained to map arbitrary input to target data samples as long as there is such a mapping. Therefore, training for the inference of particular velocity logs from data sets is a viable option for deep-learning-based velocity-model inference.

For a deep-learning application, it is beneficial to split the whole data set into relevant and irrelevant data to speed up training. Therefore, we extracted the data (CMP gathers) from the neighboring area to the target vertical velocity profile location. We seek to infer the nonlinear relations between the raw seismic data, arranged into CMP gathers, and the respective velocity profiles by using the ANN. In particular, we constructed two CNNs for this purpose and analyzed their capabilities: The first CNN accepts the single-CMP data as input, and the second CNN accepts multiple neighboring CMP gathers as input. We trained both neural networks on data modeled in augmentations of the Marmousi model, produced by elastic deformations and cropping. The multiCMP CNN had a better learning capacity and performed better on models with significant lateral variations. For this reason, throughout the study we focused on multiCMP CNN applications.

An ideal general-purpose neural network should produce plausible results when applied to an arbitrary data set. Obviously, building such a model is a nontrivial task that requires advances from algorithmic and training data sides. For the multiCMP CNNs presented here, low-relief structures are the easiest target. Salt bodies cannot yet be inverted, although inclined reflecting layers can be recovered. Higher frequencies in the input seismic data can in principle be used in the inversion. We observed some deterioration in the results when increasing the low-cut frequency in the input of the multiCMP CNN. To further examine inversion of higher frequencies, the input size and the size of the modeling grid would have to be increased to avoid spatial aliasing and grid dispersion, respectively, which would increase the computational requirements.

The method can directly be applied to field data as described using known vertical profiles from the field, when a lot of locations with known velocity profiles obtained from well logs or check shots exist with the same acquisition setup. Alternatively, the DNN can be trained on synthetic data with acquisition parameters extracted from a field data set. We acknowledge that there are several challenges on the path to a successful field data application of the method and that further research is still necessary.

In the described deep-learning approach, the velocity models are built automatically from raw recorded waveforms. The resolution of these models is comparable to that of traveltime tomographic models rather than FWI models. However, the models built here can be useful as starting models for FWI. Trained neural networks could provide an ensemble of starting models for an FWI uncertainty analysis.

## CONCLUSION

We have proposed and constructed a mapping between raw seismic data (multiple-CMP gathers) and vertical velocity profiles by deep learning. This mapping is represented by trained purely convolutional neural networks. Namely, we trained CNNs to map the relevant data cubes to 1D velocity profiles, rather than to full velocity models. This is a key feature of our method that reduces the complexity of the training process and opens up an opportunity for direct use of the sonic logs for velocity-model building with deep learning.

Just like FWI, our method relies on full-waveform modeling and uses all available data. Therefore, there are no principal limitations to the complexity of models that can be inverted. Just like in FWI, some prior assumptions about the velocity model are necessary to mitigate nonuniqueness in the seismic inversion. These assumptions can easily be incorporated into the training set. Every application of conventional FWI requires numerous forward problem solutions to perform the model optimization. Once the network is trained on a data set, other data sets with similar acquisition parameters and geologic features can be inverted much faster than with conventional FWI. If necessary, the quality of the estimated model can also be subsequently improved by FWI.

## ACKNOWLEDGMENTS

We thank the editors of Geophysics, four anonymous reviewers, and J. Walda from the University of Hamburg for their comments and suggestions that improved the manuscript. We also thank A. Grzywaczewski, A. Baumstein, and H. Denli, and members of the Seismic Modeling and Inversion group and the Seismic Wave Analysis Group at KAUST for constructive discussions on deep learning. The research reported in this publication was supported by funding from KAUST, Thuwal, 23955-6900, Saudi Arabia and Saudi Aramco.

## DATA AND MATERIALS AVAILABILITY

Presented here and additional reproducible examples are available at https://github.com/vkazei/deeplogs.

Biographies and photographs of the authors are not available.