ABSTRACT

Full-waveform inversion (FWI) is a powerful method for providing a high-resolution description of the subsurface. However, the misfit function of the conventional FWI method (metric l2-norm) is usually dominated by spurious local minima owing to its nonlinearity and ill-posedness. In addition, FWI requires intensive wavefield computation to evaluate the gradient and step length. We have considered a general inversion method using a deep neural network (DNN) for the FWI problem. This deep-learning inversion method reparameterizes physical parameters using the weights of a DNN, such that the inversion amounts to reconstructing these weights. One advantage of this deep-learning inversion method is that it can serve as an iterative regularization method, benefiting from the representation of the network. Thus, it is suitable to solve ill-posed nonlinear inverse problems. Furthermore, this method possesses good computational efficiency because it only requires first-order derivatives. In addition, it can easily be accelerated by using multiple graphics processing units and central processing units, for weight updating and forward modeling. Synthetic experiments, based on the Marmousi2, 2004 BP, and a metal ore model, are used to show the numerical performance of the deep-learning inversion method. Comprehensive comparisons with a total-variation regularized FWI are presented to show the ability of our method to recover sharp boundaries. Our numerical results indicate that this deep-learning inversion approach is effective, efficient, and can capture salient features of the model.

INTRODUCTION

Full-waveform inversion (FWI) is a powerful method for reconstructing high-resolution subsurface parameters (Tarantola, 1984; Pratt and Worthington, 1990). This method consists of minimizing a misfit function measuring the discrepancy between observed and synthetic data. This minimization process is often realized by adopting either local or global optimization strategies. In practice, local optimization methods are preferred to global optimization methods because global strategies require more computations. Over the past few decades, many local optimization methods have been introduced to minimize the FWI objective function, including the steepest-descent method (Tarantola, 1984; Choi et al., 2005) and its acceleration (Yan and Wang, 2018), the nonlinear conjugate gradient method (Pratt, 1999; Kamei and Pratt, 2013; He and Han, 2018), the quasi-Newton method (Brossier et al., 2009), and the truncated Newton method (Epanomeritakis et al., 2008; Métivier et al., 2013).

Applying local minimization methods to the conventional FWI objective function remains challenging because they are usually hindered by multiple local minima, resulting from its high nonlinearity and ill-posedness (Virieux and Operto, 2009), and by large-scale computations in 3D. This means that low-frequency information or a good initial model is required when using local minimization methods. Several strategies have been developed to address the multiple minima issue, including regularization methods (Lin and Huang, 2015; Esser et al., 2018), methods based on Wasserstein measures (Métivier et al., 2016, 2018; Yang et al., 2018), wavefield reconstruction inversion (Leeuwen and Herrmann, 2013), adaptive waveform inversion (Warner and Guasch, 2014, 2016), and extended modeling methods (van den Berg and Kleinman, 1997; Symes, 2008; Abubakar et al., 2009; He et al., 2016; Fu and Symes, 2017a, 2017b). For large-scale computations, one can adopt a fast and parallel forward problem solver to perform wavefield modeling, including parallel direct and iterative solvers (Operto et al., 2007; Li et al., 2015; Ghysels et al., 2016) and develop an efficient method to estimate the gradient of the misfit function, such as the adjoint-state method (Plessix, 2006).

Nowadays, deep neural networks (DNNs) are powerful machine learning architectures with many applications. Neural networks were revived in the machine-learning community around 2006 (Hinton et al., 2006) and have since been successfully used in computer vision, speech recognition, and many other fields (LeCun et al., 2015). The success of a DNN is mainly supported by the continuous increase in modern computing capacity and the efficient implementation of back-propagation procedures (Goodfellow et al., 2016). Recently, DNNs and other data science techniques have attracted much attention for solving inverse problems (Jin et al., 2017; Schwab et al., 2019). In geophysical applications, DNNs have gained interest for fault detection, low-frequency reconstruction, and velocity model building (Araya-Polo et al., 2017; Lewis and Vigh, 2017; Araya-Polo et al., 2018b; Farris et al., 2018; Jin et al., 2018; Sun and Demanet, 2018; Wang et al., 2018; Wu et al., 2018; Adler et al., 2019; Gabriel and Rahul, 2020). DNNs can generate acceptable results for these problems, provided that many high-quality labeled training data sets exist and that DNNs are correctly trained. Generative adversarial networks (GAN-FWI), recurrent neural networks (RNN-FWI), and convolutional neural networks (CNN-FWI) have recently been applied to seismic FWI problems (Richardson, 2018a, 2018b; Wu and McMechan, 2018, 2019). Compared with GAN-FWI, the RNN-FWI and CNN-FWI are efficient and do not require hundreds of thousands of data sets during training (Richardson, 2018a), which makes them competitive for solving large-scale FWI inverse problems.

In this study, we present a general inversion method, using DNNs for the FWI problem, called DNN-FWI. The main idea is motivated by GAN-FWI, RNN-FWI, CNN-FWI, and the universal approximation property of neural networks (Cybenko, 1989; Hornic, 1989). For this deep-learning inversion method, the physical parameters are first parameterized with weights of a DNN. Thus, the unknown parameters become the neural network weights associated with the unknown physical parameters. The neural network weights can be efficiently determined using the Adam optimization method widely used in the field of DNNs (Kingma and Ba, 2014). DNN-FWI can serve as an iterative regularization method by incorporating specific neural network layers. For example, the convolution layer can extract boundary features information. Specifically, the deep feedforward neural network can be regarded as a type of iterative regularization method (Benning and Burger, 2018). Furthermore, unlike other methods, the DNN-FWI iterative framework performs repeated forward simulations, which can increase data consistency and improve the quality of the inversion results (Li et al., 2020). Compared with the work of Wu and McMechan (2018, 2019), we present detailed theoretical considerations to support this deep-learning inversion method. Parameterization is usually a sparse representation of the physical model of interest. We comprehensively compare DNN-FWI and conventional total variation (TV)-regularized FWI to demonstrate its ability to recover sharp boundaries.

In the following sections, we first present the methodology of the DNN-FWI method, including the pretraining and inversion training stages. We then review some efficient forward solvers and develop a specific deep learning model. Experimental examples are presented based on the Marmousi2 model (Martin et al., 2006), the 2004 BP model (Billette and Brandsberg-Dahl, 2005), and a synthetic metal ore model, to exemplify the effectiveness and efficiency of the DNN-FWI inversion scheme.

THEORY

The FWI method

Mathematically, the FWI method consists of minimizing a misfit function f(m) that measures the discrepancy between recorded and simulated data. For simplicity, we assume that only one source is applied; it is straightforward to sum over multiple source locations. Under this assumption, the conventional FWI misfit function is formulated as
f(m)=12Ru(m)dobs22,
(1)
where ·2 denotes the l2-norm, R is a restriction operator that extracts the wavefield at the receiver positions for each source, m represents the physical parameters that describe the subsurface, dobs are the observed data, and u(m) represents the simulated data, which are solutions of the following equation:
F(u,m)=s.
(2)
In equation 2, s denotes the source term and F represents a partial differential operator of the wave equation.
Minimizing the constrained problem in equations 1 and 2 can be achieved by adopting gradient- or Newton-type methods. For an iterative method, at the k+1th iteration, the parameter m can be generally formulated as
mk+1=mk+αkpk,
(3)
where pk denotes the descent direction of the misfit function in equation 1, and αk is the step length. In equation 3, choosing pk=f(mk) leads to the steepest-descent method and setting pk=Bk1f(mk) yields a Newton-type method, where Bk is a symmetric positive definite approximation of the Hessian matrix, or it is the true Hessian matrix Hk.

Reparameterized FWI under DNNs

Fundamental theory of DNNs

DNNs are widely used for machine learning tasks, including signal processing, image recognition, and language processing. The fundamental principle of a DNN is the universal approximation theorem. Mathematically, this theorem is a specific application of the function approximation theorem. For simplicity of the network architecture and convenience of training, most DNNs consist of piecewise linear layers. This simple architecture is underpinned by the fact that piecewise linear polynomials are dense in the functional space composed of all Borel-measurable functions. Note that the piecewise polynomial approximation property is widely applied in mathematics, for example, in the finite-element method, which approximates a function with piecewise polynomials on each grid cell. From the big data science point of view, the success of the universal approximator is supported by the fact that high-dimensional data concentrate close to a low-dimensional manifold (Lei et al., 2020). Deep feedforward networks (DFNs) or feedforward neural networks are the quintessential deep learning architecture of DNNs. We focus only on the DFN in the following context.

In mathematical analysis, for any Borel measurable function g(x), there is a piecewise linear polynomial p(x) that can approximate g(x) with any desired degree of accuracy, denoted by
g(x)p(x)=n=0Nenϕn(x),
(4)
where en is the coefficient with respect to the piecewise linear polynomial ϕn. In DNN terminology, there exists a network G, denoted by
p(x)=G(w)(x)GL[wL,GL1(wL1,G1(w1))](x),
(5)
where G1, GL1, and GL are the first, L1th, and Lth layer of the network, respectively; L denotes the depth of the deep learning model; and w=(w1,,wL1,wL) is the weight matrix that the neural network should learn to approximate a given function g(x). The simple layer Gk,k{1,,L} with the activation function σk is defined by
Gk(Wk)(x)=σk(Wkx+bk),
(6)
where Wk is the affine mapping and bk denotes the bias term. The parameters Wk and bk constitute the learnable parameters, which are initialized randomly for the kth layer. The element-wise nonlinear function σk is continuous, such as the rectified linear unit (ReLU(x)=max{0,x}; Jarrett et al., 2009) and its variants, Leaky ReLU (Maas et al., 2013), and PReLU (He et al., 2015).

Note that the mathematical approximation theorem only guarantees the existence of the piecewise polynomial for a given function. It does not provide a method for building this polynomial. In contrast, the DNN framework provides a systematic and efficient method to find an optimal piecewise polynomial through solving an optimization problem. This is one of the reasons why we consider the FWI problem within a DNN framework.

Model reparameterization with DNNs and inversion

Based on the previous theory, for the model vector m of the FWI problem, we can learn a network G(w) that can properly approximate the parameter m to a given accuracy. For simplicity, we use the following equation to represent this approximation:
m=G(w).
(7)
Equation 7 can be also denoted as m(w)=G(w) to explicitly illustrate the functional dependency of the vector weight w. With this reparameterization of the parameter m, the misfit function in equation 1 becomes
J(w)f(m(w))=12Ru(G(w))dobs22.
(8)

In equation 8, the FWI problem amounts to building the vector weight w of the network G. One advantage of this reparameterization strategy is that it provides a sparse representation of the model m by introducing special layers (such as convolutional layers), thereby regularizing FWI, and mitigating local minima problems. Furthermore, by including a priori information regarding the physical parameters, some specific features could be extracted by building specific layers of the network. In other words, inversion under the DNN framework can be regarded as an implicit regularization method, which is essential for tackling the ill-posedness of the FWI problem. In addition, inversion in the context of the DNN could be expedited with graphics processing units (GPUs) through a neural network library. Hereafter, we refer to this inversion method as the DNN-FWI method.

To minimize equation 8 using a gradient-type method, the gradient of the misfit function with respect to the learnable parameters w should be calculated efficiently. Differentiating the objective function in equation 8 and applying the chain rule, we obtain
J(w)=(m(w)w)Tf(m(w))m(w)=(G(w)w)Tf(m(w)).
(9)
The first term G(w)/w on the right side of equation 9 is the Jacobian matrix of the network with respect to the weight w, which can be efficiently approximated using the back-propagation algorithm (Rumelhart et al., 1986; Goodfellow et al., 2016). The factor f(m(w)) is the gradient of the conventional FWI with respect to parameter m. It can also be efficiently computed using the adjoint-state method (Plessix, 2006).
To run DNN-FWI, we must parameterize the given initial model m0 to let the network learn some a priori information of the initial model. We call this parameterization processing “pretraining.” To parameterize or learn the initial model, we need to minimize the following misfit function Jpre, which measures the distance between G(w) and m0:
w*=argminwJpre(w)=G(w)m01,
(10)
where ·1 denotes the l1-norm. The l1-norm can capture the main features of the initial model, and it is more robust with respect to outliers than the l2-norm.

After pretraining, we start the reparameterization of the FWI with a DNN. The main difference between conventional training and the reparameterization FWI training steps is the computation of the gradient with respect to the vector weight w. For conventional training, only the discrepancy between the output of the network and the given labeled data is back-propagated from the output layer to the input layer, to evaluate the gradient. Whereas for the DNN-FWI, the gradient f(m) is back-propagated to evaluate the gradient with respect to the vector weight w. Therefore, the DNN-FWI approach can take advantage of the constrained forward model by synthesizing the recorded data to ensure data consistency. Accelerating the computation of the gradient f(m) and of the partial derivative of the network G(w) can be achieved through parallelization on central processing units and GPUs. Algorithm 1 describes the framework of the reparameterized FWI algorithm in detail.

Note that the parameterized network G(w) can be regarded as the generative part of a generative adversarial network (GAN), a special deep network architecture consisting of a generator network and a discriminator network (Goodfellow et al., 2014). GANs have achieved significant successes in image inpainting (Pathak et al., 2016), superresolution (Iizuka et al., 2017), 3D vision (Wu et al., 2016), face editing (Shen and Liu, 2017), and FWI (Araya-Polo et al., 2018a; Richardson, 2018a).

EXPERIMENTAL TESTS

Forward problem

The method presented in this study is general, and it can be applied to acoustic and viscoelastic media, in the time and frequency domains. For illustration, we focus on the 2D frequency-domain acoustic FWI problem to show the computational efficiency of this method. We assume that the density is constant and that the model m is parameterized in terms of the compressional-wave velocity. In this context, the acoustic wave equation can be described by the following Helmholtz equation:
F(u,m)Δu+ω2m2u=s,
(11)
where Δ denotes the Laplacian operator, ω is the angular frequency, m is the velocity, and s is the source function. Denote the differential operator as A(m)Δ+ω2m2. Then, equation 11 can be rewritten as
A(m)u=s.
(12)

To solve the forward problem numerically, equation 11 is discretized with a fourth-order finite-difference stencil (Hustedt et al., 2004). Perfectly matched layers (PMLs) are used to reduce artificial reflections from boundaries (Berenger, 1994; Komatitsch and Martin, 2007; Pan et al., 2012). After discretization, we obtain a large sparse linear system whose solution is the numerical solution of equation 11. This large linear system can be efficiently solved using the hierarchically semiseparable structured (HSS-structured) multifrontal solver (Ghysels et al., 2016; Rouet et al., 2016). Compared to standard direct solvers, the HSS-structured multifrontal solver achieves high performance and good scalability in terms of computational costs and storage. This solver is implemented under the distributed memory MPI and shared memory OpenMP parallel programming frameworks. For the DNN-FWI method, f(m) is implemented with the C language under the framework of MPI and the network update is coded with the Python language under the framework of the PyTorch platform.

In the following section, we provide insights into the design of a deep network for a specific inverse problem and we detail a specific deep network for the FWI problem. Synthetic examples are then presented to test the performance of DNN-FWI. Comprehensive comparisons with the conventional FWI, using the limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm (L-BFGS) solver, are also provided (Liu and Nocedal, 1989). Meanwhile, the Adam optimization method is used for pretraining and DNN-FWI inversion (Kingma and Ba, 2014).

Network design and setting

The universal approximation theorem ensures that, for each continuous function, a DNN approximating this function must exist, with arbitrary accuracy (Cybenko, 1989; Hornic, 1989). However, from a geometric view of deep learning theory, for any DNN with fixed architecture, there exists a manifold that cannot be learned by the network (Lei et al., 2020). Therefore, the number of units (or neurons) and layers that should be used for the deep network, and how these layers should be connected to each other, are challenges for the design of any neural network. The rule of thumb in the design of deep networks is that a deeper model can greatly reduce the number of units and the amount of generalization error, although it could also face vanishing and exploding gradient calculation problems (Goodfellow et al., 2016). We also observe that one fixed network may represent several models, with different weights, although it may not be the optimal network for these models. We refer to Radford et al. (2015) for guidelines on designing networks.

Here, we explain the neural network for our synthetic experiments. The network consists of 32 layers that perform alternating convolution (Conv) and deconvolution (DeConv) operations, and the hyperbolic tangent (Tanh) function is the output layer. The Conv operations are used to capture abstract features at a high level, and the DeConv layers double the resolution of the outputs of previous Conv layers. The nonlinear activation function ReLU is added to each hidden layer. A batch normalization (BatchNorm) layer is introduced for each Conv and DeConv layers to remove the mean and normalize the variance of the outputs of previous layers. Figure 1 illustrates the specific network applied in our numerical experiments. In this network, a CellLayer that consists of one DeConv layer and four Conv layers is introduced to simplify the description. Arguments I and O of the CellLayer represent the number of input and output channels, respectively.

Marmousi2 model

The Marmousi2 model of complex geologic structures has been created to simulate seismic data. This model is used as a benchmark for testing seismic imaging methods (Martin et al., 2006). The original Marmousi2 model has a width of 17 km and a depth of 3.5 km. A modified model, with a grid size of 384 × 128 and spatial sampling of Δx=Δz=20  m, is used here to test the performance of the DNN-FWI method. Figure 2a shows the true model. The initial model (or starting model) is obtained by smoothing the true model using a Gaussian filter with a standard deviation equal to 160 m and is shown in Figure 2b. Note that 10 cells are added on each side of the rectangular domain for the PML boundary conditions. The acquisition is composed of 116 sources, uniformly located at a depth of 40 m, every 60 m. Each shot is recorded by 350 receivers located 40 m below the surface. The receivers are distributed uniformly at a 20 m interval.

As mentioned previously, pretraining is a key point for inversion under the DNN. At this stage, the initial model is used as the learning model for our network. The input tensor a in Figure 1 is fixed as [−0.5, 1, 0.5]. Note that elements of the tensor a can be any real number, as long as they are in the interval [−1, 1], but the real numbers should be the same for the pretraining and inversion stages to maintain consistency. The pretraining is carried out on a GPU, and the Adam optimization method, with a learning rate of 0.001, is used to accelerate convergence. After 30,000 iterations, the output model of the pretraining network is an accurate approximation of the initial model. The traveltimes through the initial and output models are calculated. The root-mean-square (rms) error of the traveltimes is approximately 0.94 ms. Note that, because the output of the Tanh layer is within [−1, 1], the smooth model should be transformed to make each element in the same interval.

In total, 20 frequencies are used to calculate the synthetic data sets using an impulsive source wavelet. The frequencies are obtained by discretizing from 3.00 to 12.00 Hz with a constant sampling interval of 0.45 Hz. The DNN-FWI method is conducted from low to high frequencies; that is, we invert the monochromatic data, and after 50 iterations the inversion moves to the next frequency. This inversion strategy can mitigate the nonlinearity and ill-posedness of this inverse problem and has been widely applied in conventional FWI (Bunks et al., 1995; Sirgue and Pratt, 2004). The learning rate is set to 0.001 for this inversion stage. After 1000 iterations, the final output model of the network is shown in Figure 3a, and the vertical profiles at horizontal positions of 3.0, 5.0, and 6.0 km are illustrated in Figure 4.

To further improve the resolution of the reconstructed model, we undertake a new inversion under the framework of the conventional FWI, with the reconstruction of the DNN-FWI as the initial model, called stage II. To accelerate this stage, an efficient truncated trust region method is used (Wang, 2003). In contrast to the DNN-FWI procedure, in stage II, only frequencies of 3.00, 4.35, 5.70, 7.05, 8.40, 9.75, and 12.00 Hz are selected, and the inversion is again conducted from low to high frequencies. The maximum number of iterations for each frequency is 20. Figure 3b shows the final reconstruction results, and vertical profiles at horizontal positions of 3.0, 5.0, and 6.0 km are illustrated in Figure 4. Throughout the numerical experiments, we denote the reconstructed model using DNN-FWI and the reconstructed model using FWI in stage II as the DNN model and the DNN + FWI model, respectively.

Figures 3 and 4 show that DNN-FWI can successfully and accurately reconstruct the complex Marmousi2 model. In addition, the vertical profiles of the reconstructed model match well with the true model, which indicates that the DNN-FWI method is effective and can provide high-resolution models. However, Figure 4 shows that the improvement in stage II is limited and the results are degraded with the increase in depth, as indicated by the blue ellipses. This degradation may be attributed to the multiple minima of the misfit function and the lack of illumination.

The 2004 BP model

Another benchmark for the velocity estimation problem is the 2004 BP model (Billette and Brandsberg-Dahl, 2005). The main challenge for FWI with this model is the high velocity contrast between sediments and salt bodies. In our numerical experiment, the original velocity model is resampled with a symmetric box-like K-domain filter, provided by the Seismic Unix toolbox (Stockwell and Cohen, 2008). The modified model, with a grid size of 384 × 128 and the spatial sampling step of Δx=Δz=25  m, is shown in Figure 5a. The initial model is again obtained by smoothing the true model using a Gaussian filter with a standard deviation equal to 200 m, and it is shown in Figure 5b. The acquisition is composed of 116 impulsive sources located 25 m below the surface every 75 m. Each shot is recorded by 350 receivers located 50 m below the surface every 25 m.

During the pretraining stage, the input tensor a in Figure 1b for the DNN is again set as [−0.5, 1, 0.5] and the Adam optimization solver is set with a learning rate equal to 0.001. After 30,000 iterations, the rms error of the traveltimes is approximately 0.96 ms, indicating that the DNNs can parameterize the initial model reasonably well.

In the DNN-FWI stage, we first simulate 10 synthetic data sets with 10 frequencies: 3.00, 3.45, 3.90, 4.35, 4.80, 5.25, 5.70, 6.15, 6.60, and 7.05 Hz. The DNN-FWI method is applied from low to high frequencies. The maximum number of iterations for each frequency is fixed to 50. Figure 6a shows the final output model of our learned network after 500 iterations. We also present vertical profiles at horizontal positions of 3.57, 6.25, and 7.50 km in Figure 7. Figures 6 and 7 show that the DNN-FWI method successfully reconstructed high-contrast salt structures, in terms of magnitude and shape.

Similar to the previous example, we perform a second inversion (stage II) by applying conventional FWI to further improve the resolution. Frequencies at 3.00, 4.35, 5.70, 7.05, 8.40, 9.75, and 11.55 Hz are used for this inversion, and the maximum number of iterations for each frequency is set to 20. Figures 6b and 7 illustrate the reconstructed model and vertical profiles using the trust region method. The inversion results are slightly better than those of the DNN-FWI inversion in terms of magnitude.

A metal ore model

To further investigate the performance of the DNN-FWI, in terms of feature-capturing and high-resolution capability, we use a metal ore model built from the geologic knowledge acquired from well logging information in South China. The velocity ranges from 5000 to 7500 m/s. The model, with 512 × 256 cells and a spatial sampling of Δx=Δz=10  m, is shown in Figure 8a. The small-scale, high-velocity structures of this model challenge the inversion method to provide a high-resolution result. With steps similar to the ones used in the previous examples, we smooth the true model for pretraining using the Gaussian filter, with a standard deviation equal to 80 m (see Figure 8b). The acquisition is composed of 119 sources uniformly located 50 m below the surface every 40 m. Each shot is recorded by 472 receivers located 50 m below the surface every 10 m.

For pretraining, the input tensor a is [−0.50, 0.50, −0.50, 0.50; −0.50, 0.50, −0.50, 0.50]. As with previous experiments, elements of the tensor a can be any real number, as long as they are within [−1, 1], and are fixed during the pretraining and DNN-FWI stages. The maximum number of iterations for the pretraining is set to 30,000, and the optimization solver is set with a learning rate equal to 0.001. The rms error of the traveltimes is approximately 0.14 ms, which demonstrates that the network represents the initial model accurately.

For the DNN-FWI inversion stage, 50 discretized frequencies are used to compute the synthetic data sets. These frequencies are uniformly sampled between 3.00 and 45.00 Hz with a constant sampling step of 0.86 Hz. In this stage, the training of the DNN-FWI is conducted with a hierarchical multiscale strategy from 3.00 to 45.00 Hz. The maximum number of iterations for each frequency is set to 30. To further improve the resolution, a conventional FWI inversion is conducted again (stage II) with six frequencies: 3.00, 11.57, 20.14, 28.71, 37.28, and 45.00 Hz. The maximum number of iterations for each frequency is fixed to 15. Figure 9a and 9b shows the reconstructed results for the DNN-FWI method after 1500 iterations and for the conventional FWI method after 90 iterations in stage II, respectively. Figure 10 illustrates the vertical profiles (x=1.20, 2.80, and 3.80 km). Figures 9 and 10 demonstrate that the DNN-FWI method can accurately reconstruct the high-contrast model. They also indicate that DNN-FWI possesses the regularization property of preserving sharp structures. From Figures 9 and 10, we observe that the improvement in stage II is also limited and some outliers appear in the deep part. The outliers may be attributed to the multiple minima of the misfit function and the lack of illumination.

Comparisons with the TV-regularized conventional FWI method

To demonstrate the benefits of the DNN-FWI method compared to a conventional FWI method, an FWI method equipped with the well-known L-BFGS solver is selected (Hager and Zhang, 2013). To preserve the boundary information, and mitigate the multiple local minima issue of the FWI problem, TV regularization is applied to the FWI objective function (Guitton, 2012)
freg(m)=f(m)+λmTV,
(13)
where λ is a regularization parameter that controls the trade-off between the data fitting term f(m) and the TV-regularized term mTV. The TV-regularized term is defined as
mTV=i,jψi,j(m),
(14)
ψi,j(m)=(mi,jx)2+(mi,jz)2+β,β>0,
(15)
mi,jxmi+1,jmi,jΔx,mi,jzmi,j+1mi,jΔz.
(16)
Note that in the discretization formulation of the TV term, we omit a factor of ΔxΔz from the right side of equation 14 because it can be absorbed in the regularization parameter λ. A small positive real number β is introduced to overcome the nondifferentiability at the origin, and it is set to 1.0×104 in our implementation. We refer to the TV-regularized FWI method as TV-FWI. The derivative of the TV term mTV with respect to the parameter mi,j can be approximated by
mTVmi,j=1Δx(mi1,jxψi1,j(m)mi,jxψi,j(m))+1Δz(mi,j1zψi,j1(m)mi,jzψi,j(m)).
(17)
There is another effective choice for TV regularization that imposes the TV term as a constraint. We refer to Peters and Herrmann (2017) and Esser et al. (2018) for details on the TV-constrained regularization method.
The peak signal-to-noise ratio (PS/N) index is used to quantify the inversion results for different methods. The higher the PS/N, the better the quality of the reconstructed result. The PS/N is defined as
PS/N(mt,mr)=20.0*log10(MAX(mt)MSE(mt,mr)),
(18)
where
MSE(mt,mr)=1NMi=0N1j=0M1(mi,jtmi,jr)2,
(19)
MAX(mt)=maxi,jmi,jt.
(20)
Here, mt denotes the true model and mr is the inversion result (Yoo and Ahn, 2012).

We independently carry out DNN-FWI and TV-FWI to reconstruct the metal ore model. Here, the initial model is much smoother than the previous experiment, as shown in Figure 11. Figure 12a and 12b illustrates the shot records generated by the true and initial models, respectively. These records are computed using a Ricker wavelet with a peak frequency of 25.00 Hz at z=10.00m. Figure 12 shows that it may be difficult to reconstruct this model from this initial model without low-frequency information or a proper regularization strategy. For the DNN-FWI method, the maximum number of iterations for pretraining is also set to 30,000. The reconstructed results, after 1500 iterations, are illustrated in Figure 13a and 13b for the DNN-FWI and TV-FWI methods, respectively. The regularization parameter λ is set to 2.0×103 in equation 13. Table 1 presents the computational performance of these methods. The total computational time is 48,439 s for the DNN-FWI and 78,124 s for the L-BFGS method. The vertical profiles for x=1.20, 2.80, and 3.80 km are shown in Figure 14. Figures 13 and 14 demonstrate that the DNN-FWI method possesses better performance than the TV-FWI method. The PS/N is 31.20 for the DNN-FWI method and 30.19 for the L-BFGS method. We believe that the better results for DNN-FWI may be attributed to the implicit regularization and feature-capturing properties of the convolution layers of the neural network. The DNN-FWI method exhibits good computational efficiency owing to the hierarchical structure of the algorithm. More computation time is required for the L-BFGS method because more line-search steps are needed for some iterations, resulting in more function and gradient estimations. Specifically, 6408 matrix factorizations are conducted for L-BFGS, whereas only 3000 are used for DNN-FWI. In addition, from Figure 13, the geometric shape is correctly recovered at the deeper part of the reconstructed result of DNN-FWI, but the magnitude is larger than the true model. We think that these anomalies result from the multiple local minima of the misfit function. The local minimum problem can be mitigated using a proper regularization or providing a good initial model.

DISCUSSION

As illustrated by our numerical tests, our DNN can properly parameterize different velocity models. However, some challenges require further investigation, including the design of the network architecture, the choice of the optimization algorithm, and the dependency to the initial model.

Challenges for network architecture and input

A crucial consideration for designing a network for a specific problem is its architecture, that is, how many units it should have and how these units should connect to each other. For chain-based architectures, an empirical rule is that a deeper network, with few units per layer, is often preferred to a shallower network with many units. In other words, a deeper network can greatly reduce the number of units compared with a shallow network, while still representing the desired function (Goodfellow et al., 2016). The main result for the representation capacity of a given network is estimated by the number of linear regions carved out by a deep rectifier network (Montúfar et al., 2014)
O(nd(nd)d(l1)),
(21)
where d is the dimension of the inputs, l is the depth of the network, and n denotes the number of units per hidden layer. Equation 21 illustrates that the approximate capability is exponential with the depth parameter l. However, determining the optimal depth of a network is difficult.

Another challenge is the connection between layers. The choice of basic layers is also a challenge for a specific problem. For our tests, the network of DNN-FWI with a simple chain of layers works well. We believe that other architectures widely applied in computer vision could provide flexible choices for the DNN-FWI method. In addition to these issues, the size of the input to the network should be further studied.

Challenges for the optimization methods

The universal approximation theorem guarantees that a network to represent a given Borel measurable function must exist, to any degree of accuracy. However, even if the network can represent the model, we may fail to obtain a reasonable specific network, because the optimization algorithm may not be able to find values of weights that properly parameterize the desired model. Similar to conventional deep learning tasks, DNN-FWI with the Adam solver performs well for the FWI problem, in terms of computational efficiency and stability. Unlike conventional learning tasks, applying the conjugate-gradient method (CG), or L-BFGS, may further accelerate DNN-FWI. CG and L-BFGS are seldom used for conventional learning tasks because of the specific structure of the tasks. Furthermore, a proper regularization may improve the numerical performance of DNN-FWI, which will be further discussed in our future research.

Challenges for the initial model

The DNN-FWI still depends on a given initial model, owing to the intrinsically limited capacity of the l2-norm metric, and local convergence of the optimization algorithm. However, because the DNN-FWI serves as the regularization method, it can alleviate this dependency owing to the strong a priori information of the network structure. In any case, understanding how network architecture affects dependency on the initial model should be further studied. In addition, applying the Wasserstein metric measuring the distance between probability density distributions on a given space to the DNN-FWI method may be another interesting research avenue (Métivier et al., 2016; Martin et al., 2017; Yang et al., 2018).

CONCLUSION

We have considered a general inversion method, under the DNN framework, for solving ill-posed FWI problems. DNN-FWI can be regarded as a reparameterization method and has several advantages compared to conventional FWI methods. One advantage is that the representation of the model with piecewise polynomials can be considered as a priori information to guide the inversion method to select the solution in a desired function space. Thus, it is an automatic regularization method suitable for tackling ill-posed problems. The DNN-FWI also has good computational efficiency because it only involves the computation of first-order derivatives. Furthermore, DNN-FWI can easily be implemented with existing deep learning toolboxes, including PyTorch and TensorFlow.

We build a DNN containing 32 layers to evaluate the effectiveness of DNN-FWI. The network mainly consists of convolution and deconvolution operations. With this learning model, synthetic tests, based on the Marmousi2 model, 2004 BP model, and metal ore model, are used to demonstrate its feasibility and efficiency. Our numerical results indicate that DNN-FWI is efficient and can invert complex subsurface physical parameters with relatively high accuracy. Therefore, we conclude that DNN-FWI may be competitive in solving large-scale inverse problems in other science and engineering fields.

ACKNOWLEDGMENTS

This work is supported by the National Natural Science Foundation of China, under grants nos. 11801111 and 91630202, the National Key R&D Program of the Ministry of Science and Technology of China with the Project Integration Platform Construction for Joint Inversion and Interpretation of Integrated Geophysics (grant no. 2018YFC0603500), and the China Postdoctoral Science Foundation (grant no. 2019M650831). This work is also funded by Guizhou Science and Technology Plan Project [2019]1122 and Guizhou Science and Technology Platform talents [2018]5781. We would like to thank the editors J. Shragge, A. Guitton, and A. Baumstein and the anonymous reviewers for their helpful comments and for the time they have spent on reviewing our manuscript. These comments helped to greatly improve the quality of this article. We thank the developers of PyTorch for providing a high-performance machine learning framework to implement our method (https://pytorch.org/).

DATA AND MATERIALS AVAILABILITY

Data associated with this research are available and can be obtained by contacting the corresponding author.

Biographies and photographs of the authors are not available.

Freely available online through the SEG open-access option.