The conjugate gradient method can be used to solve many large linear geophysical problems — for example, least-squares parabolic and hyperbolic Radon transform, traveltime tomography, least-squares migration, and full-waveform inversion (FWI) (e.g., Witte et al., 2018). This tutorial revisits the “Linear inversion tutorial” (Hall, 2016) that estimated reflectivity by deconvolving a known wavelet from a seismic trace using least squares. This tutorial solves the same problem using the conjugate gradient method. This problem is easy to understand, and the concepts apply to other applications. The conjugate gradient method is often used to solve large problems because the least-squares algorithm is much more expensive — that is, even a large computer may not be able to find a useful solution in a reasonable amount of time.

## Introduction

The conjugate gradient method was originally proposed by Hestenes (1952) and extended to handle rectangular matrices by Paige and Saunders (1982). Claerbout (2012) demonstrates its application to geophysical problems. It is an iterative method. Each iteration applies the linear operator and its adjoint. The initial guess is often the zero vector, and computation may stop after very few iterations.

The adjoint of the operator A, denoted as AH, is defined as the operator that satisfies 〈Ax, y〉 = 〈x, AHy〉 for all vectors x and y (where 〈u, v〉 represents the inner product between vectors u and v). For a given matrix, the adjoint is simply the complex conjugate of the transpose of the matrix; this is also sometimes known as the Hermitian transpose and is sometimes written as A or A. Just to muddy the notation water even further, the complex conjugate transpose is denoted by A.H in NumPy and A′ in Matlab or Octave. However, we will implement the adjoint operator without forming any matrices.

Many linear operators can be programmed as functions that are more intuitive and efficient than matrix multiplication. The matrices for operators like migration and FWI would be huge, but we avoid this problem because once you have the program for the linear operator, you can write the adjoint operator without computing matrices. Implementing the conjugate gradient algorithm using functions to apply linear operators and their adjoints is practical and efficient. It is wonderful to see programs that implement linear algorithms without matrices, and the programming technique is a key theme in Claerbout's 2012 book.

This tutorial provides a quick start to the conjugate gradient method based on Guo's pseudocode (2002). Those interested in more depth can read Claerbout (2012) and Shewchuk (1994). A Jupyter Notebook with Python code to reproduce the figures in this tutorial is at https://github.com/seg/tutorials.

## The forward and adjoint operators

Starting with a known reflectivity model m, we create synthetic seismic data d = Fm, where F is the linear operator that performs the function “convolve with a Ricker wavelet.” Given such a trace and the operator, the conjugate gradient method can be used to estimate the original reflectivity.

Hall (2016) calls the operator G instead of F, and he creates a matrix by shifting the wavelet and padding with zeros. In contrast, I implement the operator using the NumPy function convolve(). This is advantageous because it allows us to solve the linear equation $m∘=F−1d$ without ever having to construct (or invert) this matrix, which can become very large. This matrix-free approach is faster and uses less memory than the matrix implementation.

We can add one more feature to the operator and implement it with its adjoint. A convenient way to combine the two operations is to use a so-called “object-oriented programming” approach and define a Python class. Then we can have two methods (i.e., functions) defined on the class: forward, implementing the forward operator, and adjoint for the adjoint operator, which in this case is correlation.

Claerbout (2012) teaches how to write this kind of symmetrical code and provides many examples of geophysical operators with adjoints (e.g., derivative versus negative derivative, causal integration versus anticausal integration, stretch versus squeeze, truncate versus zero pad). Writing functions to apply operators is more efficient than computing matrices.

Now that we have the operator, we can instantiate the class with a wavelet. This wavelet will be “built in” to the instance F.

Now we can compute d = Fm simply by passing the model m to the method F.forward(), which already has the wavelet:

This results in the synthetic seismogram shown in Figure 1.

Now that we have a synthetic, we wish to solve the linear inverse problem and estimate the model m from the data d. The model can not be completely recovered because the Ricker wavelet is band limited, so some information is lost.

One way to solve linear problems is to start with an initial guess and iteratively improve the solution. The next few paragraphs derive an iterative method. You do not need to understand all the derivation, so you might want to lightly read it and move to the section about the pseudocode.

We start with an initial estimate for the model, $m∘0=0$ (the zero vector) and compute the residual $r0=d−Fm∘0$ (i.e., the difference between the data and the action of the forward operator on the model estimate). A good measure of the error in the initial solution is the inner product 〈r0, r0〉 or np.dot(r0, r0) in code. This is equivalent to the squared norm (length) of the residual vector r0, and constitutes our cost function. If the cost is 0, or within some small tolerance, then we have a solution.

We can improve the estimate, $m∘0$ by selecting a direction s0 and a scale α0 to move $m∘0$ to a new guess, $m∘1=m∘0+α0s0$. You can compute the direction in model space that most rapidly decreases the error. This is the gradient of 〈$d−Fm∘,d−Fm∘$〉 or 〈r0, r0〉. If you grind through the mathematics, the gradient g0 turns out to be given by the action of the adjoint operator on the residual: g0 = FHr0.

The scalar α0 is computed to minimize the residual, $r1=d−Fm∘=d−F(m∘+α0s0)=r0−α0Fs0$. We can then compute α0 by taking the derivative of 〈r1, r1〉 with respect to α0, setting the derivative to 0, and solving for α0. The result is:

These equations define a reasonable approach to iteratively improve an estimated solution, and it is called “the steepest descent method.” The conjugate gradient algorithm builds on this. It is not much harder to implement, has similar cost per iteration, and faster convergence.

The first iteration of the conjugate gradient method is the same as the steepest descent method. The second (and later) iterations compute the gradient g1 and Fg1. With s0 and Fs0 from the previous iteration we can then compute the step direction. Scalars α and β are computed to minimize

Some mathematical manipulations determine the best direction is s1 = g1 + β s0 where β = 〈g1, g1〉 / 〈g0, g0〉. The conjugate gradient algorithm is guaranteed to converge when the number of iterations is equal to the dimension of $m∘$, but only a few iterations often give sufficient accuracy. For our implementation, we'll start with a simplifed version of the pseudocode provided by Guo (2002), with its implementation in Python on the right:

## Results

The Python code in the previous section was used to invert for reflectivity. Figure 2 shows the five iterations of the conjugate gradient method. The conjugate gradient method converged in only four iterations; the results of the fourth and fifth iteration almost exactly overlay on the plot. Fast convergence is important for a practical algorithm. Convergence is guaranteed in 50 iterations (the dimension of the model).

Figure 3 compares the original model and the model estimated using conjugate gradient inversion. Conjugate gradient inversion does not completely recover the model because the Ricker wavelet is band limited, but side lobes are reduced compared to the data.

Figure 4 compares the predicted data with the original data to show that we have done a good job of the estimation. This demonstrates more than one reflectivity sequence when convolved with the Ricker wavelet fits the data, in particular the original model and the model estimated by the conjugate gradient method. It may be interesting to explore preconditioning operators that promote a sparse solution.

The Jupyter notebook provided with this tutorial further explores finding least-squares solutions using the conjugate gradient method. The notebook demonstrates how preconditioning can be used to promote a sparse solution. It also provides examples using the solver provided in the SciPy package.

## Conclusions

I described the conjugate gradient algorithm and presented an implementation. This is an iterative method that requires functions to apply the linear operator and its adjoint. Many linear operators that are familiar geophysical operations like convolution are more efficiently implemented without matrices. The reflectivity estimation problem described in Hall (2016) was solved using the conjugate gradient method. Convergence only took four iterations. The conjugate gradient method is often used to solve large problems because well-known solvers like least squares are much more expensive.

## Acknowledgments

The SEG Seismic Working Workshop on Reproducible Tutorials held 9–13 August 2017 in Houston inspired this tutorial. For more information, visit http://ahay.org/wiki/Houston_2017.