## Abstract

Seismic wave propagation forms the basis for most aspects of seismological research, yet solving the wave equation is a major computational burden that inhibits the progress of research. This is exacerbated by the fact that new simulations must be performed whenever the velocity structure or source location is perturbed. Here, we explore a prototype framework for learning general solutions using a recently developed machine learning paradigm called neural operator. A trained neural operator can compute a solution in negligible time for any velocity structure or source location. We develop a scheme to train neural operators on an ensemble of simulations performed with random velocity models and source locations. As neural operators are grid free, it is possible to evaluate solutions on higher resolution velocity models than trained on, providing additional computational efficiency. We illustrate the method with the 2D acoustic wave equation and demonstrate the method’s applicability to seismic tomography, using reverse‐mode automatic differentiation to compute gradients of the wavefield with respect to the velocity structure. The developed procedure is nearly an order of magnitude faster than using conventional numerical methods for full waveform inversion.

## Introduction

The simulation of seismic wave propagation through Earth’s interior underlies most aspects of seismological research, from the simulation of strong ground shaking due to large earthquakes (Graves and Pitarka, 2016; Rodgers *et al.*, 2019), to imaging the subsurface velocity structure (Fichtner *et al.*, 2009; Tape *et al.*, 2009; Virieux and Operto, 2009; Lee *et al.*, 2014; Gebraad *et al.*, 2020), to derivation of earthquake source properties (Duputel *et al.*, 2015; Ye *et al.*, 2016; Wang and Zhan, 2020). The compute costs associated with these wavefield simulations are substantial; and, for reasons of computational efficiency, 1D models are often used, even when better 3D velocity models are available. As a result, seismic wave simulations are often the limiting factor in the pace of geophysical research.

Recently, deep learning approaches have been explored with the goal of solving various geophysical partial differential equations (Moseley, Markham, and Nissen‐Meyer, 2020; Moseley, Nissen‐Meyer, and Markham, 2020; Smith *et al.*, 2020; Moseley *et al.*, 2021). Beyond the goal of accelerating compute capabilities, such physics‐informed neural networks may offer other advantages such as grid independence, low‐memory overhead, differentiability, and on‐demand solutions. These properties facilitate deep learning being used to solve geophysical inverse problems (Zhu *et al.*, 2020; Smith *et al.*, 2021; Xiao *et al.*, 2021; Zhang and Gao, 2021), as a wider selection of algorithms and frameworks then are available for use, such as approximate Bayesian inference techniques like variational inference.

One of the major challenges associated with wave propagation is that a new simulation must be performed whenever the properties of the source or velocity structure are perturbed in some way. This alone substantially increases the necessary compute costs, making some problems prohibitively expensive even if they are mathematically or physically tractable. For the most part, these limitations have been accepted as an inevitable part of seismology, but now physics‐informed machine learning approaches have started to offer some pathways for moving beyond this issue. For example, Smith *et al.* (2020) use a deep neural network to solve the Eikonal equation for any source–receiver pair by taking these locations as input. Then, this can be exploited for hypocenter inversion by allowing for gradients of the travel time field to be computed with respect to the source location (Smith *et al.*, 2021). However, these models are relatively inefficient to train and even then are only able to learn approximate solution operators to the differential equations.

The aforementioned limitations may seem surprising but result from a basic attribute of neural networks that in fact makes them ill‐suited for solving differential equations. Specifically, neural networks are designed for learning maps between two finite‐dimensional spaces, whereas learning a general solution operator for a differential equation requires the ability to map between two infinite dimensional spaces (i.e., function spaces). A paradigm for learning maps between function spaces was recently developed (Li *et al.*, 2020a,b, 2021) and has been termed as neural operator. The general idea behind these models is that they have shared parameters over all possible functions describing the initial conditions, which allows them to operate on functions, even when the inputs are a numerically discretized representation of them.

Here, we explore the potential of neural operators in improving seismic wave propagation and inversion. We develop a prototype framework for training neural operators on the 2D acoustic wave equation and show that this approach provides a suite of tantalizing new advantages over conventional numerical methods for seismic wave propagation. This study provides a proof of concept of this technology and its application to seismology.

## Preliminaries

*A*and a Green’s function

*G*, let

*U*denotes the solution to a linear partial differential equation (PDE), that is, the solution operator,

*A*could describe the velocity structure as well as the initial conditions. Neural operator generalizes this formulation to the nonlinear setting in which a set of linear operators are sequentially applied to construct a general nonlinear solution operator. In its basic form, an $\u2113$‐layered neural operator is constructed as follows:

*V*, we have,

*A*. In a limited sense, neural operators can be viewed as generalized Green’s functions.

## Methods

We designed a framework that applies neural operators to the 2D acoustic wave equation. The basic idea for this procedure is outlined schematically in Figure 1. A specific type of neural operator called a Fourier neural operator (FNO; Li *et al.*, 2021) receives a velocity model specified on an arbitrary, possibly irregular mesh, along with the coordinates of a point source. One of the main features of FNO is that the major calculations are performed in the frequency domain that allows the convolutions in equation (3) to be rapidly computed. The output of the FNO is the complete wavefield solution, which can be queried anywhere within the medium, regardless of whether the points lie on the input mesh.

The most basic component of the FNO is a Fourier block (Fig. 1), which first transforms an input function (*V*) to the Fourier domain. In the first layer of the network, *V* is equal to the initial conditions *A*. Then, a kernel ($Ki$) is computed specifically for this function and is truncated at low order, before performing the integration via multiplication. Finally, the result is transformed back and a nonlinear activation function is applied, which concludes the Fourier block. For this study, the architecture of the FNO contains four sequential Fourier blocks and applies a Rectified Linear Unit (ReLU) activation to the output of each (Fig. 1). We note that the truncation of the Fourier modes is performed on the function values after lifting them to a higher dimensional space, rather than the raw input function, so that this does not lead to compression. We refer the interested readers to Li *et al.* (2021) for more mathematical details about the FNO.

We constructed a training dataset of simulations to learn from by first generating random velocity models. We set up a 64 × 64 grid with 0.16 km spacing to define the velocity model. Then, we created 5000 random velocity models by sampling random fields having a von Karman covariance function with the following parameters: Hurst exponent $\kappa =0.5$, correlation length $a=[ax,ay]=[1.6\u2009\u2009km,1.6\u2009\u2009km]$, and $\u03f5=0.1$, $\mu =3\u2009\u2009km/s$, and $\sigma =0.15\u2009\u2009km/s$. Then, for each of these velocity models, we apply a Ricker wavelet source at a random point and solve the acoustic wave equation using a spectral element method (SEM; Afanasiev *et al.*, 2019). It should be noted that there is a source grid used, because this is a requirement of the SEM. Because Gaussian random fields (GRFs) can represent all continuous functions, the purpose of these steps is to create a suite of simulations that span a range of possible conditions. We show later that they can even well approximate strongly discontinuous velocity models. An example velocity model and simulation is shown in Figure 1. Applying the aforementioned procedure results in a training dataset of 5000 data samples, each of which is a different simulation.

Given the simulation dataset, we can proceed to train an FNO model in a supervised capacity, in which the goal is to learn a model that can reliably output a solution to the wave equation for arbitrary input conditions. The training is performed using batch gradient descent, for which the parameters of the FNO are updated to minimize the error against the “true” spectral element solutions. A mean‐square error loss function is used. We use a batch size of 30 simulations and train the model for a maximum of 300 epochs. We use all but 200 of the simulations for training and set aside the remainder for cross validation of the model. The time required to train the model from scratch is approximately 18 hr using six NVIDIA Tesla V100 graphical processing units (GPUs).

## Experiments

### Initial wavefield demonstration

Figure 2 shows two example wavefields corresponding to two different velocity models, each with a different source. The spectral element solution is shown alongside the wavefield predicted by the FNO for the eight different receivers (blue triangles). For these examples, the input velocity model is 64 × 64. The relative $\u21132$ loss of the FNO wavefields are 0.180 and 0.363. These examples are representative of the entire validation dataset, which has a loss of 0.273 relative to the spectral element solutions.

### The number of simulations needed for training

Once fully trained, the FNO can evaluate a new solution in a fraction of a second, and thus the time to train the FNO will be the vast majority of the needed compute time. A primary concern about the computational demands of the FNO approach is, therefore, the number of simulations needed for training. Here, we examine how the number of training simulations influences the accuracy of the solution. We create a series of tests in which the number of training simulations is varied from 1200 at the fewest to 4800 at the most. The results are shown in Figure 3, in which we show the FNO wavefield predictions for each dataset. Even with 1200 training samples, there is no indication that there is overfitting, because the training waveform error is similar across different models (Fig. 3a,b). Training using just 1200 simulations is able to predict the major arrival. Increasing number of training samples provides a better fit of the reflections (e.g., 3.2 s in Fig. 3c).

### Generalization to arbitrary velocity models

The FNO was trained only on velocity models drawn from Gaussian random fields; and, although this family of functions is broad, it does not include some types of functions that exist in the Earth, such as discontinuous functions. This raises the question of whether the FNO can still generalize well to these cases. Figure 4a–c shows an example of a predicted wavefield for a velocity model containing a constant velocity square embedded within a homogeneous medium. Although the velocity model itself is rather simple, it is actually very far removed from the characteristics of the random fields that the FNO was trained on and represents a challenging example. We can see that the predicted wavefield does a very good job of approximating the wavefield compared to the ground truth. We believe that the small residual errors can be reduced with better hyperparameter selection.

### Generalization to higher resolution grids

FNO can be viewed in some sense as a method for learning generalized Green’s functions valid for arbitrary boundary conditions. Because it is intrinsically learning a mapping between function spaces, the FNO is theoretically independent of the resolution at which the functions are discretized (this is only a requirement for evaluation on a computer). One important advantage of this is that the FNO can be trained on velocity models with a certain grid spacing and then be evaluated on velocity models with a different grid spacing at inference time. Here, we are not simply talking about interpolating the wavefield after solving the PDE; but, rather, the solutions to the PDE can actually be evaluated on a higher resolution velocity model with negligible extra compute cost. To demonstrate this, Figure 4d–f shows the FNO prediction for a random velocity field with 2× higher resolution (128 × 128) than the models used during training, alongside the spectral element solution. The FNO solution closely approximates the spectral element solution. The velocity models with different meshes have the same roughness as the training data set. Resolving more rough structure with denser spacing can be achieved by training with many more GRFs with varying correlation length scales and variance.

### Full waveform inversion with neural operators

One of the most useful applications of wavefield simulations is in inversion, to image the Earth’s interior. The adjoint‐state method is an approach to efficiently compute the gradients of an objective function with respect to parameters of interest and can be used for seismic tomography (e.g., Tape *et al.*, 2009; Gebraad *et al.*, 2020). Neural operators are differentiable by design, which enables gradient computation with reverse‐mode automatic differentiation. Automatic differentiation has been shown to be mathematically equivalent to the adjoint‐state method (Zhu *et al.*, 2021). This allows for the gradients of the wavefield to be determined with respect to the inputs (velocity model and source location).

Figure 5 demonstrates our full waveform inversion (FWI) performance. For each case, we compute synthetic observations using the source distribution as shown (red circles), taking every point in the 64 × 64 grid as a receiver. The observations are noise‐free for this experiment. Then, we perform tomography by starting with a homogeneous initial velocity model and forward propagating a wavefield with the FNO for each source. We calculate the loss $L=\u2211i\u2211j[uobs(xi,xj)\u2212upred(xi,xj)]2$ and compute $\u2207L$ with automatic differentiation. The velocity model is then iteratively updated with gradient descent for 1000 iterations using the Adam optimizer (Kingma and Ba, 2014) and a learning rate of 0.01. For comparison, Figure 5a,b shows the imaging result using SEM and adjoint‐state method, with a relative $\u21132$ misfit between the inverted and true velocity model of 0.0289. Figure 5c,d shows the result for the same velocity structure using FNO and automatic differentiation, with a misfit of 0.0319. Figure 5e,f is designed to demonstrate sharp discontinuous changes with a short wavelength. The results demonstrate the remarkable capabilities of FNO to learn a general solution operator.

We note that our FWI approach neither requires an adjoint wavefield to be computed nor a cross correlation; the gradients can be rapidly computed with GPUs using automatic differentiation. The rapid simulation makes it substantially more efficient than adjoint methods. For these experiment, 20 sources take ∼1 s for one tomographic iteration, including the costs of computing the forward model, whereas the spectral element method with adjoint methods takes ∼100 s for one tomographic iteration. These time measurements are from using only a single NVIDIA Tesla V100 GPU.

## Discussion

This study presents a prototype framework for applying neural operators to the 2D acoustic wave equation. We anticipate that the general framework would also be suitable for the 3D elastic‐wave equation with relatively little modification. Indeed, the FNO method was applied successfully to the Navier–Stokes equations (Li *et al.*, 2021), which can be more challenging to solve than the elastic‐wave equation. In our tests, we found that only a few thousand simulations were needed to train an FNO model and, from there, required negligible time to compute a new solution. Because FNO can be trained on lower resolution simulations and then generalize to higher resolution solutions once trained, this results in substantially faster computations than using traditional numerical methods at the full resolution.

One of the limitations of the approach is that the solutions are approximate, as seen in several of the figures. However, because this is a learning‐based approach, the performance can be improved in the future using a better model architecture, thorough tuning of hyperparameters, improving the size of the training dataset, using a more appropriate objective function, and various other factors. In addition, as new developments within machine learning emerge in this area, they would be able to be incorporated. Thus, these performance metrics should only be viewed as a starting point. For some applications, the error may be enough of an issue, and traditional numerical methods may be preferable; however, for many other situations in geophysics, a reasonably accurate solution may be acceptable.

Among the most exciting benefits of our approach is that by training the FNO on random velocity models, the FNO is able to produce solutions for arbitrary velocity models. This is because FNO learns a general solution operator to the PDE and not specifically the velocity model. This means that the model does not need to be retrained for each region. Thus, the approach offers the potential for a single FNO model to be used by the entire seismology community for any region of a similar size. Although the initial cost of training an FNO and performing the training simulations may be expensive, it only needs to be done a single time for the community as a whole.

## Data and Resources

All the data presented in this study are synthetic and available upon request. The supplemental material for this article demonstrates that the misfit between the simulations using spectral element method (SEM) and the Fourier neural operator (FNO) is minimal.

## Declaration of Competing Interests

The authors declare that there are no competing interests.

## Acknowledgments

The authors thank Jack Muir for helpful comments on an early version of the article.