## ABSTRACT

We present a wave-equation inversion method that inverts skeletonized seismic data for the subsurface velocity model. The skeletonized representation of the seismic traces consists of the low-rank latent-space variables predicted by a well-trained autoencoder neural network. The input to the autoencoder consists of seismic traces, and the implicit function theorem is used to determine the Fréchet derivative, i.e., the perturbation of the skeletonized data with respect to the velocity perturbation. The gradient is computed by migrating the shifted observed traces weighted by the skeletonized data residual, and the final velocity model is the one that best predicts the observed latent-space parameters. We denote this as inversion by Newtonian machine learning (NML) because it inverts for the model parameters by combining the forward and backward modeling of Newtonian wave propagation with the dimensional reduction capability of machine learning. Empirical results suggest that inversion by NML can sometimes mitigate the cycle-skipping problem of conventional full-waveform inversion (FWI). Numerical tests with synthetic and field data demonstrate the success of NML inversion in recovering a low-wavenumber approximation to the subsurface velocity model. The advantage of this method over other skeletonized data methods is that no manual picking of important features is required because the skeletal data are automatically selected by the autoencoder. The disadvantage is that the inverted velocity model has less resolution compared with the FWI result, but it can serve as a good initial model for FWI. Our most significant contribution is that we provide a general framework for using wave-equation inversion to invert skeletal data generated by any type of neural network. In other words, we have combined the deterministic modeling of Newtonian physics and the pattern matching capabilities of machine learning to invert seismic data by NML.

## INTRODUCTION

Full-waveform inversion (FWI) has been shown to accurately invert seismic data for high-resolution velocity models (Lailly and Bednar, 1983; Tarantola, 1984; Virieux and Operto, 2009). However, the success of FWI heavily relies on an initial model that is close to the true model; otherwise, cycle-skipping problems will trap the FWI in a local minimum (Bunks et al., 1995).

To mitigate the cycle-skipping problem, Bunks et al. (1995) propose a multiscale inversion approach that initially inverts low-pass-filtered seismic data and then gradually admits higher frequencies as the iterations proceed. AlTheyab and Schuster (2015) remove the mid- and far-offset cycle-skipped seismic traces before inversion and gradually incorporate them into the iterative solutions as the inverted velocity model become closer to the true model. Alternatively, Wu et al. (2014) use the envelope of the seismic traces to invert for the subsurface model because they claim that the envelope carries the ultra-low-frequency information of the seismic data. Ha and Shin (2012) invert the data in the Laplace-domain, which is less sensitive to the lack of low frequencies than conventional FWI. Sun and Schuster (1993), Fu et al. (2018), and Chen et al. (2019) use an amplitude replacement method to focus the inversion on reducing the phase mismatch instead of the waveform mismatch. In addition, they use a multiscale approach by temporally integrating the traces to boost the low frequencies and mitigate cycle-skipping problems, and then they gradually introduce the higher frequencies as the iterations proceed.

Nonlinear inversion often gets stuck in a local minimum, which means that the objective function is very complex and is characterized by many local minima. To avoid this problem, Luo and Schuster (1991a, 1991b) suggest a skeletonized inversion method that combines the skeletonized representation of seismic data with the implicit function theorem to accelerate convergence to the vicinity of the global minimum (Lu et al., 2017). Simplification of the data by skeletonization reduces the complexity of the misfit function as well as the number of local minima. Examples of wave-equation inversion of skeletonized data include the following. Luo and Schuster (1991a, 1991b) use the solutions to the wave equation to invert the first-arrival traveltimes for the low-to-intermediate wavenumber details of the background velocity model. Feng and Schuster (2019) use the traveltime misfit function to invert for the subsurface velocity and anisotropic parameters in a vertical transverse isotropic medium. Instead of minimizing the traveltime misfit function, Li et al. (2016) find the optimal S-velocity model that minimizes the differences between the observed and predicted dispersion curves associated with surface waves. Liu et al. (2018) extend 2D dispersion inversion of surface waves to the 3D case. Li et al. (2018) invert the data recorded over near-surface waveguides using the dispersion curve misfit function. Instead of inverting for the velocity model, Dutta and Schuster (2016) develop a wave-equation inversion method that inverts for the subsurface $Qp$ distribution. Here, they find the optimal $Qp$ model by minimizing the misfit between the observed and the predicted peak/centroid-frequency shifts of the early arrivals. Similarly, Li et al. (2017) use the peak-frequency shift of the surface waves to invert for the $Qs$ model. A tutorial for skeletonized inversion is in Lu et al. (2017).

One of the key problems with skeletonized inversion is that the skeletonized data must be picked from the original data, which can be labor intensive for large data sets. To overcome this problem, we propose computing the skeletonized data by an autoencoder and then use solutions to the wave equation to invert such data for the model of interest (Schuster, 2018). The skeletonized data correspond to the latent codes in the latent space of the autoencoder, which has a reduced dimension and retains significant information related to the model.

The autoencoder neural network is an unsupervised machine learning method that is trained for dimensionality reduction (Schmidhuber, 2015). An autoencoder maps the data into a lower dimensional space by extracting the data’s most important features. It encodes the original data into a condensed representation, also denoted as the skeletonized representation, of the input data. The input data can be reconstructed by a decoder applied to the encoded latent-space vector. In this paper, we first use the observed seismic traces as the training set to train the autoencoder neural network. Once the autoencoder is well trained, we feed the observed and synthetic traces into the autoencoder to get the corresponding low-dimension representations of the seismic data. We compute the misfit function as the sum of the squared differences between the observed and the predicted encoded values. To calculate the gradient with respect to the model parameters such as the velocity in each pixel, we use the implicit function theorem to compute the perturbation of the skeletonized information with respect to the velocity. The high-level strategy for inverting the skeletonized latent variables is summarized in Figure ^{1}, where $L$ corresponds to the forward modeling operator of the governing equations, such as the wave equation. Any machine-learning method, such as principal component analysis (PCA), variational autoencoder (VAE), or a regularized autoencoder, can be used to approximate the original data by a lower dimensional representation.

This paper is organized into four sections. After the introduction, we explain the theory of the Newtonian machine-learning (NML) inversion method. This theory includes the formulation first presented in Luo and Schuster (1991a, 1991b) where the implicit function theorem is used to employ numerical solutions to the wave equation for generating the Fréchet derivative of the skeletal data. Then, we present the numerical results for the synthetic data and field data. The last section provides a discussion and a summary of our work and its significance.

## THEORY

Conventional FWI inverts for the subsurface velocity distribution by minimizing the $l2$ norm of the waveform difference between the observed and synthetic data. However, this misfit function is highly nonlinear and the iterative solution often gets stuck in a local minimum (Bunks et al., 1995). To mitigate this problem, skeletonized inversion methods simplify the objective function by combining the skeletonized representation of data, such as the traveltimes, with the implicit function theorem, to give a gradient optimization method that quickly converges to the vicinity of the global minimum. Instead of manually picking the skeletonized data, we allow the unsupervised autoencoder to generate such picks.

### Theory of the autoencoder

An autoencoder is an unsupervised neural network in which the target output data is the same as the input data, as illustrated in Figure ^{2}. An autoencoder is trained to learn the extremely low-dimensional representation of the input data, also denoted as the skeletonized representation, in an unsupervised manner (Valentine and Trampert, 2012). It is similar to PCA, which is generally used to represent input data using a smaller dimensional space than is originally present (Hotelling, 1933). However, PCA is a linear operation that is restricted to finding the optimal rotation of the original data axes that maximizes its projections to the principal component axes. In comparison, the autoencoder with a sufficient number of layers can find almost any nonlinear sparse mapping between the input and output images. A typical autoencoder architecture is shown in Figure ^{2} that generally includes three parts: the encoder, the latent space, and the decoder.

- •
Encoder: Unsupervised learning by an autoencoder uses a set of training data consisting of $N$ training samples ${x(1),x(2),\u2026,x(N)}$, where $x(i)$ is the $ith$ feature vector with dimension $D\xd71$ and $D$ represents the number of features for each feature vector. The encoder network indicated by the pink box in Figure

^{2}encodes the high-dimension input data $x(i)$ into a low-dimension latent space with dimension $C\xd71$ using a series of neural layers with a decreasing number of neurons; here, $C$ is typically much smaller than $D$. This encoding operation at the first hidden layer can be mathematically described as $z(i)=g(W1x(i)+b1)$, where $W1$ and $b1$ represent the network parameters and the vector of bias terms for the first layer and $g()$ indicates the activation function such as a sigmoid, ReLU, or tanh. - •
Latent space: The compressed data $z(i)$ with dimension $C\xd71$ in the latent space layer (denoted by the green box in Figure

^{2}) is the lowest dimensional space in which the input data are reduced and the key information about the data is preserved. The latent space usually has a few neurons, which forces the autoencoder neural network to create effective low-dimensional representations of the high-dimensional input data. These low-dimensional attributes can be used by the decoder to reconstruct the original input. - •
Decoder: The decoder portion of the neural network represented by the purple box reconstructs the input data from the latent space representation $z(i)$ by a series of neural network layers with an increasing number of neurons. For a decoder with one hidden layer, the reconstructed data $x\u02dc(i)$ are calculated by $x\u02dc(i)=W2z(i)+b2$, where the coefficients of the matrix $W2$ and vector $b2$ represent the network parameters and the bias term of the decoder network, respectively.

In practice, training by a preconditioned steepest-descent method is employed with minibatch inputs. The above equations are for a two-layer autoencoder only; however, this representation can be easily extended to the *N*-layer case.

### Skeletonized representation of seismic data by autoencoder

In this section, we show how the autoencoder computes the low-dimensional skeletonized representation of seismic data. The input data consist of seismic traces, each represented by an $nt\xd71$ vector. Each seismic trace represents one training example in the training set. For the crosswell experiment, there are $Ns$ sources in the source well and $Nr$ receivers in the receiver well. We mainly focus on the inversion of the transmitted arrivals by windowing the input data around the early arrivals.

Figure ^{3} shows a homogeneous velocity model with a Gaussian anomaly in the center. Figure ^{3} is the initial velocity model having the same background velocity as the true velocity model. A crosswell acquisition system with two 1570 m deep cased wells separated by 1350 m describes the source and receiver wells. The finite-difference method is used to compute 77 acoustic shot gathers for the observed and synthetic data with a 20 m shot interval. Each shot is recorded with 156 receivers that are evenly distributed along the depth at a spacing of 10 m. To train the autoencoder network, we use the following workflow.

- 1)
Construct the training set. For every five observed shots, we randomly select one shot gather as part of the training set that consists of a total of 2496 training examples, or seismic traces. We did not use all of the shot gathers for training because of the increase in computation cost.

- 2)
Data processing. Each seismic trace is Hilbert transformed to get its envelope, and then the transformed data are subtracted by its mean and divided by its variance. Figure

^{4}and^{4}shows a seismic trace before and after processing, respectively. We use the signal envelope instead of the original seismic trace because it is less complicated than the latter. According to our tests, the signal envelope leads to faster convergence compared to the original seismic signal. - 3)
Training the autoencoder. We feed the processed training set into an autoencoder network in which the dimension of its latent space is equal to 1. In other words, each training example with a dimension of $nt\xd71$ will be encoded as a smaller number of latent variables by the encoder. The autoencoder parameters are updated by iteratively minimizing equation 1. The Adam and minibatch gradient-descent methods are used to train this network. Figure

^{5}and^{5}shows an input training example and its corresponding reconstructed signal by the autoencoder, respectively, and their differences are shown in Figure^{5}.

After training is finished, we input all the observed and predicted seismic traces into the well-trained autoencoder network to get their skeletonized low-dimensional representations. Of course, each input seismic trace requires the same data processing procedure as we performed for the training set. Figure ^{6}–^{6} shows three observed shot gathers, which are not included in the training set, and their encoded values are shown in Figure ^{6}–^{6}, which are the skeletonized representations of the input seismic traces. The encoded values do not have any units and can be considered as a skeletonized attribute of the data.

We compare the traveltime differences and differences of the latent variables for the observed and synthetic data in Figure ^{7}. The black and red curves represent the observed and synthetic data, respectively. Figure ^{7} shows larger traveltime differences than Figure ^{7} and ^{7}, as its propagating waves are affected more by the Gaussian velocity anomaly than the other two shots. However, the misfit functions for the low-dimensional representations of the seismic data exhibit a pattern similar to that of the traveltime misfit functions. Both reveal a large misfit at the traces affected by the velocity anomaly. Similar to the traveltime misfit values, the encoded values are also sensitive to the velocity changes. In this case, we can conclude that the (1) autoencoder network is able to estimate the effective low-dimensional representations of the input data and (2) the encoded low-dimensional representations can be used as skeletonized features sensitive to changes in the velocity model.

### Theory of the NML inversion

To invert for the velocity model from the skeletonized data, we use the implicit function theorem to compute the perturbation of the skeletonized data with respect to the velocity.

#### Connective function

#### Misfit function

^{8}shows the encoded misfit values versus different values of velocity, which clearly shows that the misfit function monotonically decreases as the velocity values approach the true velocity value ($vtrue=2200\u2009\u2009m/s$). Therefore, the skeletonized misfit function in equation 5 is able to quickly converge to the global minimum when using the gradient optimization method. Using equation 3 and the implicit function theorem, we can get

The Fréchet derivative $\u2202pz(xr,t;xs)/\u2202v(x)$ is derived in the next section.

#### Fréchet derivative

## NUMERICAL TEST

The effectiveness of the NML inversion method is now demonstrated with three synthetic data sets and with a crosswell data set recorded by Exxon in Texas (Chen et al., 1990).

### Checkerboard tests

We first test the NML method on data generated for checkerboard models with three different acquisition geometries.

#### Crosswell checkerboard test

The crosswell checkerboard model is shown in Figure ^{9}. A source well is located at $x=10\u2009\u2009m$, which includes 89 shots evenly distributed along the well with a shot interval of 20 m. Each shot gather is recorded by 179 receivers evenly deployed along the receiver well located at $x=590\u2009\u2009m$. A 15-Hz Ricker wavelet is used as the source wavelet. The initial model is the homogeneous model shown in Figure ^{9}. Figure ^{9} shows the first iteration of the NML gradient, which accurately recovers the checkerboard velocity perturbations.

#### Early arrival surface geometry checkerboard test

In the second checkerboard test, we apply the NML inversion method to early-arrival data recorded by a surface-acquisition geometry. Figure ^{10} and ^{10} shows the true velocity model and true velocity perturbations, respectively. There are 124 shots evenly spaced on the surface at 20 m intervals, and each shot is recorded using 249 receivers placed on the surface with an interval of 10 m. The source function is a Ricker wavelet with a peak frequency of 15 Hz. A time domain 2–4 acoustic finite-difference modeling algorithm is used for data simulation and inversion. The initial model shown in Figure ^{10} is a linear increasing model that is the same as the background velocity of the true model. The first iteration gradient of the NML inversion method is shown in Figure ^{10}, which successfully recovers the shallow velocity perturbations visited by the diving waves.

#### Reflection energy surface geometry checkerboard test

The Born modeling method is used to compute 119 shot gathers on the surface using the true velocity and true reflectivity models shown in Figure ^{11} and ^{11}, respectively. Each shot is recorded with 239 receivers that are evenly distributed on the surface at a spacing of 10 m. An 18-Hz Ricker wavelet is used as the source wavelet, and the initial velocity model is shown in Figure ^{11}. The NML gradient after the first iteration is shown in Figure ^{11}, where part of the velocity perturbations is well recovered in the gradient. However, the nature of reflection inversion with a limited source-receiver aperture only allowed the velocity perturbations near the layer interface to be accurately represented in the gradient.

### Crosswell layer model

The NML inversion method is now tested on synthetic data generated for a layered model with a crosswell acquisition geometry. Figure ^{12} shows the true velocity model, which has three horizontal layers and a linear increasing background velocity. A Ricker wavelet with a peak frequency of 15 Hz is used as the source wavelet. A fixed-spread crosswell acquisition geometry is deployed, where 119 shots at a source interval of 20 m are evenly distributed along the vertical well located at $x=10$. The data are recorded by 239 receivers for each shot, where the receivers are uniformly distributed every 10 m in depth in the receiver well located 990 m away from the source well. The simulation time of the seismic data is 2 s with a time sampling interval of 1 ms.

The training set includes 15,296 observed seismic traces because we only use one of every three shot gathers for training. After data processing, we feed the training data into the autoencoder network shown in Figure ^{13}. The number below each layer indicates the dimension of that layer. The boxes with the pink, green, and blue colors represent the encoder network, latent space layer, and the decoder network, respectively. The autoencoder network is trained with minibatches of 50 traces and a tanh activation function. The training time takes approximately 20 min using one graphics processing unit.

After the autoencoder neural network is well trained, we can input the synthetic traces generated at each iteration of the inversion to get their encoded values. This allows us to calculate the skeletonized misfit and gradient functions to update the velocity model. Figure ^{12} shows the initial model that is homogeneous in the effective inversion area (0.2–2.2 km in depth) with the background velocity equal to 3535 m/s. For comparison, the inverted tomograms from the FWI, wave-equation traveltime inversion (WT), and NML methods are shown in Figure ^{14}–^{14}, respectively. It clearly shows that the FWI tomogram is far from the true model so that FWI will have a cycle-skipping issue. However, the WT and NML inverted models are similar to each other and both tomograms have successfully reconstructed the three velocity layers and the background velocity. The similarity between these two tomograms is because the single latent space neuron mainly preserves the kinematic information of different seismic traces. Figure ^{14} shows the FWI inverted tomogram that used the NML inverted model as the initial model, which further recovered high-wavenumber details of the velocity model.

The comparisons of the velocity profiles at $x=0.3$, $x=0.5$, and $x=0.8\u2009\u2009km$ in the velocity models in Figures ^{12} and ^{14} are shown in Figure ^{15}–^{15}, respectively. The true and initial velocity profiles are represented by the blue and gray solid lines, respectively. The FWI profiles are indicated by the black solid lines, whereas the red and green lines show the NML and WT velocity profiles that are close to the true result. The velocity profile extracted from the FWI + NML tomogram is shown as the magenta dotted line, which almost perfectly matches the true velocity profile.

### Crosswell Marmousi model

Data computed from a part of the Marmousi model are used to test the NML inversion method. We select the upper-right region of the Marmousi model shown in Figure ^{16} with $157\xd7135$ grid points. The finite-difference method is used to compute 77 acoustic shot gathers with a 20 m source interval along the depth of the well located at 10 m. Each shot contains 156 receivers that are evenly distributed at a spacing of 10 m along the vertical receiver well, which is offset 1340 m away from the vertical source well. The data simulation time is 2 s with a time interval of 1 ms. The source wavelet is a 15-Hz Ricker wavelet, and the initial velocity model is shown in Figure ^{16}. The inverted velocity model is shown in Figure ^{17}, and the comparison of their vertical profiles at $x=0.5$ and $x=0.8$ is shown in Figure ^{17} and ^{17}, respectively. The blue, red, and black curves represent the velocity profiles of the initial, true, and inverted velocity models, respectively. It shows that the inverted model can only reconstruct the low-wavenumber information in the true velocity model. To get a high-resolution inversion result, a hybrid approach such as the skeletonized inversion + FWI approach can be used (Luo and Schuster, 1991a, 1991b).

### Friendswood crosswell field data

We now test our method on the Friendswood crosswell field data set. Two 305 m deep cased wells separated by 183 m are used as the source and receiver wells. Downhole explosive charges are fired at intervals of 3 m from 9 to 305 m in the source well, and the receiver well has 96 receivers placed at depths ranging from 3 to 293 m. The seismic data are recorded with a sampling interval of 0.25 ms for a total recording time of 0.375 s. We apply the following processing steps to the raw data, which is similar to the processing workflow in Dutta and Schuster (2014) and Cai and Schuster (1993):

The raw data are scaled by $(t)$ to correct the 3D geometric spreading effects. We multiply the data spectrum with the filter $i/\omega $ to correct the phase.

A band-pass filter of 80–400 Hz is applied to the observed data to remove the noise shown in Figure ^{18}. The filtered data have a peak frequency of 190 Hz.

To remove the tube waves shown in Figure ^{18}, we first flatten the tube waves and then apply a nine-point median filter along the horizontal direction to remove any other arrivals except the tube waves. The filtered tube waves are then shifted back to their original time positions and subtracted from the original data. Figure ^{18} shows the data after tube wave removal.

Because our goal is velocity inversion rather than imaging, we use an FK method to separate the upgoing waves from the downgoing waves. Figure ^{18} shows the pure upgoing waves after wavefield separation, and Figure ^{18} shows the data that contain the downgoing waves only. We interpolate the data to a 0.1 ms sampling interval to ensure numerically stable solutions. A final processed shot gather is shown in Figure ^{18}.

The autoencoder architecture we used here is almost the same as the previous two cases, except the dimensions of the input and output layers are changed to $3750\xd71$. A linearly increasing velocity model is used as the initial model and is shown in Figure ^{19}. Figure ^{19} shows the inverted velocity model with 10 iterations. Two high-velocity zones at the depth range between 85–115 and 170–300 m appear in the inverted result. However, some source artifacts appear near the source well. Figure ^{20} shows the encoded value map of the observed data, where the vertical and horizontal axes represent the source and receivers indexes, respectively. This shows that the near-offset traces have large positive values and the encoded values decrease as the offset increases.

Figure ^{20} and ^{20} shows the encoded value maps of the seismic data generated from the initial and inverted velocity models, respectively, where the latter map is much more similar to the encoded value map of the observed data. To measure the distance between the true and the initial models, we plot the values of the encoded residuals in Figure ^{20}. It shows that there are relatively larger residuals at the near-offset traces than at the far-offset traces. However, these residuals are largely reduced with the inverted tomogram that is shown in Figure ^{20}. This clearly demonstrates that our inverted tomogram is much closer to the true velocity model compared to the initial model.

## DISCUSSION

Tests on the synthetic and observed data demonstrate that wave-equation inversion of seismic data skeletonized by an autoencoder can invert for the low- to intermediate-wavenumber details of the subsurface velocity model. We next test the method’s sensitivity to noisy data and discuss overfitting problem.

### Noise sensitivity tests

In the previous synthetic tests, we assumed that the seismic data are noise free. We now repeat the synthetic tests associated with Figure ^{7}, except we add random noise to the input data. Different levels of noise are added to the observed and synthetic data. Figure ^{21}, ^{21}, ^{21}, and ^{21} shows four shot gathers, and their 80th traces are displayed in Figure ^{21}, ^{21}, ^{21}, and ^{21}. Their encoded results are shown in Figure ^{21}, ^{21}, ^{21}, and ^{21}, where the black and red curves represent the encoded values from the observed and synthetic data, respectively. It appears that the range of encoded values decreases as the noise level increases. Moreover, the encoded residual also decreases, which indicates that the encoded values become less sensitive to the velocity changes as the data noise level increases.

Figure ^{22} shows magnified views of the encoded values in Figure ^{21}, where some oscillations appear in the noisy data. These oscillations could further affect the accuracy of the inverted result, especially if the small velocity perturbation is omitted. Therefore, good data quality with less noise is preferred for the autoencoder method to recover an accurate subsurface velocity model.

### Overfitting problem

In our examples, the number of seismic traces in the training set is usually smaller than the number of parameters in the autoencoder, which might result in an overfitting problem. If the data are overfitted, the network learns the intricacies of the training data set at the expense of its ability to represent unseen examples (in the test data) (Valentine and Trampert, 2012). In other words, at some point during training, the reconstruction error of the training set keeps decreasing, while the reconstruction error of the testing set is either stable or becomes worse. Figure ^{23} and ^{23} shows the reconstruction errors of the training and testing sets versus the iteration number, respectively. It clearly shows that the reconstruction errors of both data sets decrease rapidly within the first 10 iterations and then gradually become stable. A similar pattern with our training and testing sets demonstrates that we do not suffer from the overfitting problem during training.

### The connection between the encoded value and the decoded waveform

An ideal autoencoder neural network seeks to identify the common features in the training set and encapsulates these within the encoder and decoder functions. The latent variables contain the information that distinguishes between each individual example in the data set (Valentine and Trampert, 2012). To illustrate this point, we perturb the encoded values in the latent space and see how the decoded waveform changes. Figure ^{24} displays the changes in the encoded values and the decoded waveforms. It clearly shows that the changes in the encoded values result in temporal shifts in the waveforms, but the shape of the waveform barely changes. Therefore, in this case, the latent space information is mainly related to the traveltimes, which are necessary to distinguish different examples in the data set.

### Multidimensional NML inversion

We apply the multidimensional NML inversion method to the same data that were generated in the crosswell Marmousi model. The same data set is used for training except that there are two latent space neurons in the autoencoder. The autoencoder with two latent space neurons converges to a smaller residual compared to the single-latent space neuron autoencoder, which means that more waveform information is preserved in the latent space. The initial model is a linear increasing model shown in Figure ^{16}. Figure ^{25} and ^{25} shows the single-neuron and double-neuron NML tomograms, where the latter recovers more detail especially at depths between $z=0.1$ and $z=0.8\u2009\u2009km$. The velocity profile comparisons at $x=0.5$ and $x=0.9\u2009\u2009km$ are shown in Figure ^{25} and ^{25}, respectively, which shows that the double-neuron NML profile (the red solid line) more closely agrees with the true velocity profile (the blue solid line). This improvement suggests that two latent space neurons contain more information about the subsurface than does a single neuron. However, inverting a greater number of neurons will likely lead to a greater chance of getting stuck in a local minima. To avoid this we suggest a multiscale strategy that initially inverts for the velocity model from a low-dimensional latent space representation. This low-wavenumber model is then used as the starting velocity model for inverting latent-space variables with a higher-dimension.

## CONCLUSION

We presented a wave-equation method that finds the velocity model which minimizes the misfit function in the autoencoder’s latent space. The autoencoder can compress a high-dimensional seismic trace to a smaller dimension that best represents the original data in the latent space. In this case, measuring the encoded residuals largely reduces the nonlinearity when compared with measuring their waveform differences. Therefore, the inverted result will be less prone to getting stuck in a local minimum. The implicit function theorem is used to connect the perturbation of the encoded values with the velocity perturbations to calculate the gradient. Numerical results with synthetic and field data demonstrate that skeletonized inversion with the autoencoder network can accurately estimate the background velocity model. The inverted result can be used as a good initial model for FWI.

The most significant contribution of this paper is that it provides a general framework for using solutions to the governing partial differential equation in order to skeletal data generated by any type of neural network. The governing equation can be that for gravity, seismic waves, electromagnetic fields, and magnetic fields. The input data can be the records from different types of data, as long as the model parameters are sensitive to the model perturbations. The skeletal data can be the latent space variables of a regularized autoencoder, a VAE, or a feature map from a CNN, or PCA features. That is, we have combined the deterministic features of forward and backward modeling in Newtonian physics with the dimensional reduction capabilities of machine learning to invert seismic data by NML.

## ACKNOWLEDGMENTS

The research reported in this paper was supported by the King Abdullah University of Science and Technology (KAUST) in Thuwal, Saudi Arabia. We are grateful to the sponsors of the Center for Subsurface Imaging and Modeling (CSIM) Consortium for their financial support. For computer time, this research used the resources of the Supercomputing Laboratory at KAUST. We thank them for providing the computational resources required for carrying out this work. We also thank Exxon for the Friendswood cross-well data.

## DATA AND MATERIALS AVAILABILITY

Data associated with this research are confidential and cannot be released.