## ABSTRACT

Eikonal models have been widely used for traveltime computations in the field of seismic imaging, but they are often criticized for having low accuracy and poor resolution of the output image. Including amplitude information can provide higher model resolution and accuracy of the images. We have developed a new approach for computing eikonal traveltimes and amplitudes, and we implemented it for multicore computer processing unit and graphics processing unit computer architectures. Traveltimes and amplitudes are computed simultaneously in iterations of the 3D velocity model. This is achieved by extending a fast sweeping method by computing amplitudes directly after the traveltimes with upwind numerical difference schemes. By performing the extra computations simultaneously with the traveltimes, the additional cost for the amplitude and raypaths is low. We tested our method on synthetic 3D data sets to compute traveltimes, amplitudes, and raypaths, from which the Helmholtz Green’s function was assembled. Using a grid of $1243$ nodes, the computations were performed in less than 1 s. The proposed method could work as a feasible alternative to full waveform modeling in seismic applications, which suffers from demanding computations because it requires several order of magnitudes shorter computing times.

## INTRODUCTION

Wave propagation phenomena play a central role in the fields of physics, environmental research, and medical imaging. The fast and accurate numerical modeling of these phenomena is hence an important task. In the field of geophysics, the availability of an accurate and fast wave modeling method can result in higher resolved images of the subsurface in shorter computing time, therefore allowing for lower costs and higher confidence for oil and gas exploration and new possibilities in geological research. Furthermore, a faster and more accurate wave modeling method can improve risk assessment methodologies for earthquakes and volcanoes.

Computing the direct solution of the wave equation is a difficult and time-consuming process (Mehra et al., 2012). Nonreflecting boundary conditions and imposed time stepping further increase the required computational time. Taking the Helmholtz equation as the static part of the wave equation is a frequently used simplification (Haber and MacLachlan, 2011). However, using the Helmholtz equation does not sufficiently decrease the computational time, and the solution process remains complicated (Haber and MacLachlan, 2011). A widely used approach to avoid these drawbacks is to use the Helmholtz Green’s function for a point source (Luo et al., 2012). First-arrival traveltimes and their amplitudes are needed to estimate the Helmholtz Green’s function. These traveltimes and amplitudes are computed as solutions of the eikonal and transport equations, respectively. We introduce methods that solve both equations fast and accurately.

## BACKGROUND

Several finite-difference schemes have been introduced in past years to solve the eikonal equation. Vidale (1988) proposes the expanding square method in 2D and extends it later for applications in 3D with the expanding box method (Vidale, 1990). In this method, traveltimes are updated on the surface of a box around the source. When all nodes in this structure are updated, the adjacent nodes are used to build a larger box. The expanding box method is still widely used, but it sometimes suffers in accuracy for complex velocity models due to the fact that the box shape does not in general resemble a wavefront (Rawlinson and Sambridge, 2005).

The fast marching method (Sethian and Popovici, 1999) instead tracks a general wavefront as it moves through the domain. The fast marching method is more stable, more accurate, and faster than the expanding box method. To track a general wave shape, nodes are divided into three groups: alive, close, and far (Rawlinson and Sambridge, 2005). All nodes in the wavefront, called *close nodes*, are sorted depending on their traveltimes. The close node with the smallest traveltime is relabeled as alive, and thereby evolving the position of the front. Traveltimes of nodes in the vicinity of the relabeled node that are not alive are recomputed and labeled as close. The procedure continues until all nodes are alive. The method handles sharp velocity contrasts well, and the accuracy is sufficient for most applications (Sethian and Popovici, 1999). The computational costs are of the order $N\u2009log\u2009N$, where the $log\u2009N$ term is due to the ordering. Therefore, the feasibility of on-the-fly modeling decreases with data sets of increasing size. Even though there have been some attempts of parallelization (Herrmann, 2003), the method remains sequential by nature because the front passes only one node at a time. This is a drawback because processing units are becoming cheaper rather than faster (Sutter, 2005).

The fast sweeping method (Zhao, 2004) sweeps through the 3D domain in eight alternating directions. When sweeping through the grid in one direction, traveltimes are computed for wavefronts traveling through the grid in that direction. The fast sweeping method is an iterative method that sweeps through the domain until the solution converges. The method has a computational cost of order $O(N)$ and can be parallelized to some degree, which makes it appropriate for large data sets (Zhao, 2007). Using the fast sweeping method for the amplitudes amounts to sweeping separately for the traveltimes and amplitudes until convergence (Luo et al., 2012).

A recently developed method, called the *3D parallel marching method* (3DPMM) (Gillberg et al., 2012b) is a parallel sweeping method, extending the parallel marching method to three dimensions (Weber et al., 2008). In 3DPMM, a stencil shaped like a tetrahedron is used instead of the standard Godunov “box” stencil. The sweeps in 3DPMM are in different directions compared with traditional fast sweeping method sweeps. The result is a significant gain in parallelization opportunity compared with the traditional fast sweeping method. On a grid with $N3$ nodes, $N2$ nodes can be computed in parallel. If the parallel computing opportunities are used, the computational time can be decreased by several orders of magnitude. By making use of diagonal stencils, the accuracy increases compared with the “box stencils” (Jeong and Whitaker, 2008; Gillberg et al., 2012a), independent of the algorithm used. Therefore, the 3DPMM method offers a very fast and accurate solution of the eikonal equation.

Some methods try to combine aspects of sweeping and front tracking algorithms. This marriage of algorithms creates very efficient, but also more complicated algorithms allowing for parallel implementations. For instance, the semiordered list of active subdomains (SOLAS) method (Gillberg et al., 2014) uses the 3DPMM method on selected parts of the full domain. We expect the methodology introduced in this paper to be applicable to algorithms other than pure 3DPMM.

## A NEW METHOD FOR TRAVELTIME AND AMPLITUDE COMPUTATION

We propose a solver for the Helmholtz Green’s function that takes full advantage of parallel computer architectures by extending the 3DPMM method. The main idea is to solve the transport equation simultaneously with the eikonal equation. Our key observation is that for any given node, the upwind amplitude data are in the vicinity of the upwind traveltime data. The data used to compute new traveltimes can be reused immediately to compute amplitudes. In our novel approach, when the traveltime of one node is updated, we immediately estimate the amplitude for the same node. Therefore both amplitude and time values are updated in the same sweep through the domain. Because we only use data from the last update planes, the computations of all nodes in the current update plane are independent. In this way, entire planes can be updated in parallel. To estimate new traveltimes, the entry location of the ray yielding the new traveltime is computed directly. The result is a method that calculates first-arrival traveltimes, amplitudes, and all rays simultaneously and efficiently on multiprocessor architectures.

The remainder of the paper is organized as follows: The theory section gives an overview of the basic methods and the main principles of the algorithm, including a summary of the 3DPMM method and a section detailing the computation of raypaths and entrance locations. This is followed by a section describing two stencils for the amplitude estimation. The proposed method was applied to three synthetic examples to show the functionality of the novel solver for the Helmholtz Green’s function as described in the results section.

## THEORY

## COMPUTING RAYPATHS AND ENTRANCE LOCATIONS

*rays*in seismology. Characteristic curves of the isotropic eikonal equation show the fastest path between two points given the velocity (Červený, 2005). Characteristic curves are often defined in terms of the Hamiltonian $H$ of the equation. For the eikonal equation 2, the Hamiltonian is defined as

### Estimating the Laplacian for slowly changing velocity fields

The first method of estimating the Laplacian uses a wide area upwind of the stencil, and it is therefore applicable only to slowly changing velocity fields. Only one assigned node in the stencil configuration in Figure 1a is sufficient for creating a new time estimation. The amplitude calculation needs at least seven nodes with traveltime values to estimate the Laplacian. We therefore propose to use two more stencil planes in the upwind direction, as shown in Figure 3a. Already estimated nodes in a 147-point cuboid behind the stencil are used to approximate the necessary second derivatives for the Laplacian. The pseudocode for the algorithm using this stencil extension for the estimation of the Laplacian can be found in Algorithm 1. Provided a sufficiently large source with respect to the spatial dimensions, this stencil extension manages to create an amplitude estimate even during the first sub sweep, when most of the nodes within the stencil do not yet have a noninfinity time estimate (see Figure 3b). The method lacks accuracy because the estimation of the Laplacian is not sufficiently local when a large set of nodes is used. The fraction of nodes not being upwind of the update node can be high, especially for fast varying velocity fields, which leads to poor accuracy.

for$i=4,\u2026,nx$do |

for$j=4,\u2026,ny\u22123$do |

for$k=4,\u2026,nz\u22123$do |

$told=Tijk$ |

forall of the eight tetrahedron stencilsdo |

Compute a $tnew$ estimate of $Tijk$ using values on the tetrahedron base; |

Compute $\u2207T(x)$; |

Estimate entrance location $xE=(xE,yE)$ |

$Tijk=min\u2061$ ($Tijk$, all $tnew$ estimates); |

$Tijk=tnew$; |

estimate $\u22072T(x)$ using |

$Ti\u2212a,j\xb1b,k,a\u2208{0,1,2,3},b\u2208{0,1,2,3},c\u2208{0,1,2,3}$; |

Create new amplitude estimate $anew$; |

$Aijk=anew$; |

Select angle of incidence; |

Update the previous estimates. |

for$i=4,\u2026,nx$do |

for$j=4,\u2026,ny\u22123$do |

for$k=4,\u2026,nz\u22123$do |

$told=Tijk$ |

forall of the eight tetrahedron stencilsdo |

Compute a $tnew$ estimate of $Tijk$ using values on the tetrahedron base; |

Compute $\u2207T(x)$; |

Estimate entrance location $xE=(xE,yE)$ |

$Tijk=min\u2061$ ($Tijk$, all $tnew$ estimates); |

$Tijk=tnew$; |

estimate $\u22072T(x)$ using |

$Ti\u2212a,j\xb1b,k,a\u2208{0,1,2,3},b\u2208{0,1,2,3},c\u2208{0,1,2,3}$; |

Create new amplitude estimate $anew$; |

$Aijk=anew$; |

Select angle of incidence; |

Update the previous estimates. |

### A local approach for the estimation of the Laplacian

Provided a sufficiently large source, the method introduced in the previous section results in a new amplitude estimate for all numerical experiments we tested. This reduces the number of iterations needed for convergence, but the computational cost of performing an update is high. Two key points must be addressed: First, the number of sweeps can only be reduced by one with the stencil described in the previous section. Because the fast sweeping approach is an iterative method, and therefore needs many sweeps to converge, a reduction of the needed sweeps by one is only a minor improvement. Second, the computational cost of finding the Laplacian in a 147-point cuboid is high considering that it has to be done for every node in the grid.

With these considerations in mind, we have designed a method that uses data from a smaller area behind the stencil. The Laplacian is estimated faster and more accurately at the expense of an increased number of sweeps needed for convergence. The idea is to create a method that does not always create an amplitude estimate to nodes at which the time value has been updated. Rather, the algorithm assigns an amplitude value if it is possible to calculate a Laplacian using local data. Different ways of estimating the Laplacian are possible. It may be tempting to directly use the seven-point stencil (see Figure 4) behind the pyramid stencil for the calculation of the Laplacian; however, the accuracy is poor in this case. The key to higher accuracy is to vary the position of three-point stencils for the second derivatives, depending on the entry of the ray into the pyramid base. The three-point stencils for the second derivatives that are closest to the raypath are used for the estimation of the Laplacian. In cases in which the rays are outside the stencils, the shortest path along the stencil boundary is used (see Figure 4). Amplitudes and traveltimes have to converge independently, therefore, at least three sweeps are necessary. The algorithm using the local estimation of the Laplacian is presented in Algorithm 2.

Due to the special stencil shape, entire 2D planes can be updated in parallel. The simultaneous computation of amplitudes and traveltimes and the abundant parallelization make the method very efficient.

for$i=3,\u2026,nx$do |

for$j=3,\u2026,ny\u22122$do |

for$k=3,\u2026,nz\u22122$do |

forall of the eight tetrahedron stencilsdo |

calculate $\u2207T(x)$ |

compute $Tijk$ using $Ti\u22121,j\xb1a,k\xb1b$, $a\u2208{0,1}$, and $b\u2208{0,1}$; |

select $tnew=min$ (eight time estimates), update $Tijk$; |

calculate corresponding entrance location $xE=(xE,yE)$; |

calculate $\u22072T(x)$ using $Ti-a,j\xb1b,k\xb1c,a\u2208{0,1,2},b\u2208{0,1,2},c\u2208{0,1,2}$; |

compute $Aijk$ using $Ai-1,j\xb1a,k\xb1b,a\u2208{0,1},b\u2208{0,1}$; |

ifamplitude computation possibleand$tnew<=told$ |

select $anew$; |

else ifamplitude computation possibleand$tnew>told$and amplitude is unassigned |

select $anew$; |

select the corresponding angle of incidence $\u2207T(x)$. |

for$i=3,\u2026,nx$do |

for$j=3,\u2026,ny\u22122$do |

for$k=3,\u2026,nz\u22122$do |

forall of the eight tetrahedron stencilsdo |

calculate $\u2207T(x)$ |

compute $Tijk$ using $Ti\u22121,j\xb1a,k\xb1b$, $a\u2208{0,1}$, and $b\u2208{0,1}$; |

select $tnew=min$ (eight time estimates), update $Tijk$; |

calculate corresponding entrance location $xE=(xE,yE)$; |

calculate $\u22072T(x)$ using $Ti-a,j\xb1b,k\xb1c,a\u2208{0,1,2},b\u2208{0,1,2},c\u2208{0,1,2}$; |

compute $Aijk$ using $Ai-1,j\xb1a,k\xb1b,a\u2208{0,1},b\u2208{0,1}$; |

ifamplitude computation possibleand$tnew<=told$ |

select $anew$; |

else ifamplitude computation possibleand$tnew>told$and amplitude is unassigned |

select $anew$; |

select the corresponding angle of incidence $\u2207T(x)$. |

## IMPLEMENTATION DETAILS

The current implementation, using the local approach for the estimation of the Laplacian, was written in C++ and was compiled with the CUDA code compiler nvcc. Because of the special stencil shape, some of the $T$-nodes were placed in the shared memory. *Shared memory* is a user-managed cache memory on graphics processing units (GPUs). Because the computation of one node is quite demanding, the use of CUDA shared memory only leads to a speed up of approximately 10% compared with the use of global memory. Our code was also parallelized for multicore CPU structures using OpenMP. The code was optimized with full optimization instructions using the -O3 flag. The numerical experiments were performed on two different machines.

The first machine is equipped with 32 GB of DDR3 RAM, a four-core Intel Core i7-4800MQ CPU @ 2.70 GHz CPU, and a GeForce GTX 770M GPU. The second machine is a Linux server with 64 GB of DDR3 RAM, an eight-core dual Intel E5-2670 (2.6 GHz) with a total of 16 physical compute cores, and a Tesla K20x GPU. When computing times are presented, they are averages from four repeated measurements. However, the variation in computing time is negligible between different runs.

## RESULTS

In this section, we present computational times needed for three different velocity models. All results were computed with a spherical source of 1% of the volume of the domain, which was computed by a first-order ray tracing. Different source sizes can be used depending on the demanded accuracy and the complexity of the velocity model. For every experiment, computing times for all required sweeps until convergence are presented. Convergence is reached when no amplitude or traveltime estimate changes by more than 0.1% compared with the value after the previous sweep. For each experiment, CPU computing times for a sequential computation, as well as using multiple cores, are presented. The computational times needed when using the mentioned GPUs are also presented. The CPU and GPU measurements use data stored with float accuracy.

The runs were performed on two different grid sizes, consisting of $1243$ and $2483$. The grid sizes were chosen to satisfy the constrains of the specific GPU that was used for the experiments. Other grid sizes are possible as long as they do not exceed the GPU’s bounded memory limitations.

### Homogeneous velocity field

The simplest case, using a homogeneous velocity field, is a special challenge for the method. This is due to straight rays traveling along the stencil boundaries. Our estimation of the Laplacian suffers slightly in accuracy in these regions, as seen in Figure 5. This error does not emerge when rays are bent. Convergence was reached after three iterations. Computing times are presented in Table 1.

Grid | i7,1 | i7,4 | E5,1 | E5,4 | E5,8 | E5,16 | 770M | K20x |

$1243$ | 11.18 | 3.84 | 14.39 | 8.40 | 7.20 | 5.91 | 0.90 | 0.71 |

$2483$ | 94.49 | 30.83 | 147.42 | 75.44 | 57.09 | 53.13 | 6.21 | 4.49 |

Grid | i7,1 | i7,4 | E5,1 | E5,4 | E5,8 | E5,16 | 770M | K20x |

$1243$ | 11.18 | 3.84 | 14.39 | 8.40 | 7.20 | 5.91 | 0.90 | 0.71 |

$2483$ | 94.49 | 30.83 | 147.42 | 75.44 | 57.09 | 53.13 | 6.21 | 4.49 |

### Two homogeneous half-spaces

The method was also applied to two homogeneous half-spaces with a smooth transition zone (see Figure 6). Once again, we see the inaccuracies in the 45° regions that are present due to the inaccuracy of the stencils along boundaries of the tetrahedrons. We can see a rise of the amplitude, where rays collide (Figure 6); a natural behavior of linear systems known as *superposition* or, in the case of wave phenomena, *interference*. By reducing the width of the transition zone, we can model interfaces in the domain. The algorithm also works in the case of noncontinuous interfaces; however, the accuracy is improved if treated as such by creating new wavefronts along an interface. Convergence was reached after three iterations. The computing times are presented in Table 2.

Grid | i7,1 | i7,4 | E5,1 | E5,4 | E5,8 | E5,16 | 770M | K20x |

$1243$ | 10.58 | 4.06 | 14.75 | 8.33 | 7.01 | 5.48 | 0.92 | 0.80 |

$2483$ | 94.37 | 29.90 | 144.81 | 62.79 | 56.09 | 50.57 | 6.23 | 4.08 |

Grid | i7,1 | i7,4 | E5,1 | E5,4 | E5,8 | E5,16 | 770M | K20x |

$1243$ | 10.58 | 4.06 | 14.75 | 8.33 | 7.01 | 5.48 | 0.92 | 0.80 |

$2483$ | 94.37 | 29.90 | 144.81 | 62.79 | 56.09 | 50.57 | 6.23 | 4.08 |

### The random velocity field

The third example is a random velocity field (see Figure 7) containing velocities in the range of 1000–$1500\u2009\u2009m/s$. It was created by assigning uniformly distributed nodes random values in the range of 1000–$1500\u2009\u2009m/s$ and subsequent linear interpolation. The highest velocity gradients are $50\u2009\u2009s\u22121$. Once again, we can see the superposition principle of waves as amplitudes rise, where rays collide. In this example, we see that rays avoid slow velocity areas. Notice that we have full ray coverage also in the slow velocity areas, and any “shadow zone” problems are avoided using the suggested method (Červený, 2005). Convergence is reached after four iterations. The computing times are presented in Table 3.

Grid | i7,1 | i7,4 | E5,1 | E5,4 | E5,8 | E5,16 | 770M | K20x |

$1243$ | 14.56 | 5.95 | 20.14 | 11.22 | 9.22 | 7.68 | 1.21 | 1.01 |

$2483$ | 116.17 | 39.25 | 179.91 | 71.8 | 64.72 | 53.60 | 8.25 | 7.94 |

Grid | i7,1 | i7,4 | E5,1 | E5,4 | E5,8 | E5,16 | 770M | K20x |

$1243$ | 14.56 | 5.95 | 20.14 | 11.22 | 9.22 | 7.68 | 1.21 | 1.01 |

$2483$ | 116.17 | 39.25 | 179.91 | 71.8 | 64.72 | 53.60 | 8.25 | 7.94 |

### Comments on performance

The GPUs consistently outperform the tested CPUs. This is expected because the 3DPMM allows for so many nodes (a minimum of $1242=15376$ nodes in the presented examples) to be updated in parallel at any time. The CPU implementation scales well up to four cores, but it flattens out thereafter. A significant amount of data is needed to update a node, which seems to decrease performance on the multicore implementation.

## DISCUSSION AND PERSPECTIVES

The introduced method is able to compute the Helmholtz Green’s function in a stable, fast, and accurate manner; however, it suffers from some restrictions. As mentioned earlier, the current implementation is built for acoustic waves in isotropic media. The adjustment for elastic waves is straightforward, although it requires some modifications. Because the entrance location of the ray into the stencil is already used, the algorithm exhibits a natural ability for anisotropy. Adapting the algorithm to anisotropic problems is possible but not trivial because the gradient of the wavefront no longer coincides with the raypath. However, raypaths can be found in the time estimation process (Gillberg et al., 2012b), and our approach of estimating the Laplacian of the time field close to the ray origin can be used. The associated computational costs are expected to be higher in such an extension. In our current implementation, we use first-order time stencils. An extension to second-order stencils is straightforward. In a second-order framework, the estimation of the Laplacian would be more natural. Future work will include a special treatment of interfaces in the domain and a multi-GPU implementation for larger velocity models. When solving for traveltimes, the direction of the ray (angle of incidence) and the ray entrance into the stencil element are implicitly computed. Given the entrance location, rays can be computed by additionally solving for the shoot-out angle that is constant along the rays. Rays allow for illumination studies and an analytical inverse step during tomography.

## CONCLUSION

We have introduced a methodology to compute first-arrival traveltimes and amplitudes simultaneously by extending the 3DPMM method. The method was implemented for multicore CPUs and for GPUs, resulting in a fast and efficient solver for the Green’s functions of the Helmholtz equation. When solving for traveltimes, the direction of the ray is implicitly computed giving us the angle of incidence and raypaths. The proposed method was presented for first arrivals of acoustic waves, but extensions to anisotropic velocity fields are possible. The numerical experiments showed that the accuracy of the proposed method is highly dependent on the initialization step and the estimation of the Laplacian of the time field. The method may potentially perform optimally in conjunction with accurate ray tracing methods in the vicinity of the source. The presented algorithm takes only first arrivals into account, and it is therefore a high-frequency approximation. As shown in our paper, it is possible to estimate several wave phenomena components in combination with first-arrival traveltime solvers. In fact, computing the amplitudes together with traveltimes increases the accuracy of the computed amplitude field when compared with postprocessing of the time field. Accurate amplitudes give rise to an additional source of information when imaging the subsurface. The methodology is well suited for parallel implementations, resulting in very fast solvers for Helmholtz Green’s functions. The proposed method can be applied to different components of a wave and presents the first step to a faster tool to model wave motion. An accurate and fast tool for forward modeling of wave phenomena could work as a viable alternative to full-waveform modeling in several seismic applications. Our methodology is especially appealing when taken into consideration that the computational time is in the order of seconds.

## ACKNOWLEDGMENTS

The presented work was funded by Kalkulo AS and the Research Council of Norway under grant 238346. The work has been conducted at Kalkulo AS, a subsidiary of Simula Research Laboratory. We would like to thank S. Clark, A. Magnus Bruaset, and C. Tarrou for helpful comments and support.