## ABSTRACT

The determination of background velocity by full-waveform inversion (FWI) is known to be hampered by the local minima of the data misfit caused by the phase shifts associated with background perturbations. Attraction basins for the underlying optimization problems can be computed around any nominal velocity model, and they guarantee that the misfit functional has only one (global) minimum. The attraction basins are further associated with tolerable error levels representing the maximal allowed distance between the (observed) data and the simulations (i.e., the acceptable noise level). The estimates are defined a priori, and they only require the computation of (possibly many) the first- and second-order directional derivatives of the (model to synthetic) forward map. The geometry of the search direction and the frequency influence the size of the attraction basins, and the complex frequency can be used to enlarge the basins. The size of the attraction basins for the perturbation of background velocities in classic FWI (global model parameterization) and the data-space reflectivity reformulation (migration-based traveltime [MBTT]) are compared: The MBTT reformulation substantially increases the size of the attraction basins (by a factor of 4–15). Practically, this reformulation compensates for the lack of low-frequency data. Our analysis provides guidelines for a successful implementation of the MBTT reformulation.

## INTRODUCTION

The full-waveform inversion (FWI) approach to seismic inversion consists of minimizing the misfit between the recorded and synthetic data with respect to the earth parameters. It uses the full waveform of the recorded waves as data, which makes it free of horizon picking, and hence a good candidate for automatic inversion. It was first introduced by Bamberger et al. (1977, 1979) for the 1D wave equation, and it was followed, in two dimensions, by the work of Lailly (1983) and Tarantola (1984). In particular, Lailly (1983) links the gradient of the misfit functional, computed by the adjoint-state method, with the migration imaging operator used in geophysics. Initially carried out in the time domain, FWI can also be used in the frequency domain, as promoted by Pratt and Worthington (1990), Pratt and Goulty (1991), and Pratt et al. (1996). The use of complex frequencies was later introduced by Shin and Cha (2008, 2009), giving rise to the so-called Laplace and Laplace-Fourier domain FWI. Early numerical experimentations by Tarantola (1987) and Gauthier et al. (1986) already show the strength and the weakness of FWI: It can produce very clear and detailed images of the subsubsurface (strength), under the imperative condition that a good background velocity model is available, i.e., a priori information (weakness). In fact, the misfit functional with respect to background parameters shows local minima due to phase shift/cycle skipping, which hamper the reconstruction by local algorithms, unless the minimization starts inside the attraction basin.

In recent years, with the increase in computational power, the solution of the FWI approach to seismic imaging by local optimization algorithms has been extensively developed. The iterative minimization is usually conducted with Newton-type algorithms (Pratt et al., 1998). To avoid the formation of the dense Hessian (and its inversion), multiscale Newton methods are studied by Akcelik et al. (2002). For similar reasons, the pseudo-Hessian is studied by Shin et al. (2001) and the iterative conjugate gradient approach is implemented by Métivier et al. (2013). Estimating the step length is another challenge for the iterative minimization problem, and it is often conducted using the line search method (Nocedal and Wright, 2006). Alternatives for a more precise step exist, such as the trust region method (Eisenstat and Walker, 1994) or the maximum projected curvature method of Chavent (2010), but with increased computational complexity. The performance of FWI is further compared in the context of inverse scattering by Barucq et al. (2018), depending on the choice of search direction and the line search method.

However, none of these local optimization methods can avoid local minima. The determination of the full velocity model, including the low spatial frequency background, requires very low-frequency data (as unrealistic as 1 Hz) in marine seismic applications, coupled with a resolution from low- to high-frequency content to avoid “cycle skipping” (Bunks et al., 1995; Sirgue and Pratt, 2004). To overcome the presence of local minima when solving the FWI problem by local optimization algorithms, alternative parameterizations have been proposed. They are based on a decomposition of the earth parameter into background and reflectivity parameters, resulting in an increase of the computational complexity. The differential semblance optimization by Symes and Carazzone (1991) extends the depth reflectivity model to account for the various illuminations in the data, defining a semblance criterion whose minimization produces the desired background model; the migration-based traveltime (MBTT) approach of Chavent and Clément (1992), Clément et al. (2001), and Chavent et al. (2015) parameterizes the depth reflectivity by a data-space reflectivity, which is used to store the reflectivity during the update of the background. Note that the so-called reflection FWI (Xu et al., 2013) retains from MBTT the migration-demigration process, but it differs in that it does not use the data-space reflectivity concept and it uses the gradient of the classical FWI data misfit as the search direction.

In this paper, we investigate the optimizability of FWI in the frequency domain, using the geometric tools set up by Barucq et al. (2019b). We analyze the size of the attraction basin, in which the misfit functional is guaranteed to have no local minimum, and the associated maximal distance allowed between the data and the attainable set (i.e., the allowed noise level).

We first present the FWI minimization problem for a time-harmonic acoustic model, and its MBTT reformulation. The second section analyzes the influence of (complex) frequency and search geometry on the size of attraction basins and tolerable error levels for the classical FWI formulation. We also demonstrate the accuracy with experiments of FWI reconstruction. Then, we compare the optimizability of classic FWI and the MBTT reformulation. We quantify the improvement obtained with the MBTT approach which, thanks to the migration-demigration that eliminates phase shift/cycle-skipping problems, leads to larger attraction basins. We further indicate the guidelines for an efficient implementation of MBTT.

## TIME-HARMONIC INVERSE WAVE PROBLEM

### Propagation of time-harmonic acoustic waves

### Reconstruction via classic FWI

- •
The term $M$ is the model space, which is a subset of the parameter space $E=Rn$. The model is represented via $n$ coefficients (dimension of the unknown). In our application, it has a piecewise constant representation that follows the space discretization.

- •
The term $D$ is the data space. The data consist of a vector of $q$ complex numbers and $D=Cq$, with $q=nfrequency\xd7nsource\xd7nreceiver$, using the number of frequencies, sources, and receivers, respectively.

The main difficulty with FWI is the determination of the background velocity part of $m$ because the misfit functional suffers from many local minima caused by the phase shifts between the observed and simulated signals. These can only be avoided if the data contain extremely low frequencies, as illustrated in our experiments below. As an alternative, the following MBTT reformulation of the problem is designed to alleviate these difficulties.

### Background/data-space reflectivity decomposition for MBTT-FWI

We shall refer to $s$ as the data-space reflectivity, in opposition to the depth reflectivity $r$. The data-space reflectivity $s$ is one of the unknowns that we search for, together with $p$, in the MBTT formulation. At the optimum, $s$ can be understood as a demultiplied and deconvolved version of the data themselves, which hence provides a first approximation of $s$.

As indicated above, the motivation for parameterizing the reflectivity by an element $s$ of the data space is to eliminate the phase shifts induced in the simulations by changes in the background $p$: The events in the synthetics are obtained from the data-space reflectivity $s$ by migration, followed by simulation with the same kinematics. Hence, they are expected to have the same phase as those of $s$, as illustrated in our experiments. In addition to controlling the phase, the stack involved in the migration turns, for a fixed data-space reflectivity, the data misfit into a coherency measure for the current background (Chavent, 2017). On the other hand, the MBTT model decomposition induces greater computational cost because of the dependency on $DF(p)$ in equation 9. Evaluations are derived from adjoint-state formula, and we refer to Barucq et al. (2019b) in their Appendix C for more details.

The iterative reconstruction has been shown successful by Chavent et al. (2015), starting with 5 Hz data, and it follows alternative updates of the two model components:

- 1)
Start the reconstruction with the recovery of the reflectivity part $s$, using a fixed $p$.

- 2)
Fix the recovered reflectivity $s$, and iterate to reconstruct the background part of the model $p$.

- 3)
Continue with alternate updates, in a succession of increasing frequency.

The optimizability study that we provide below quantifies the enlargement of the attraction basins provided by the MBTT decomposition, in particular for the second step.

### Estimates of attraction basins and tolerable data error, methodology

We review the properties that allow us to represent the optimizability of least-squares problems such as in equations 5 and 12. We refer to optimizability as the absence of local minimum in the misfit functional, which ensures the convergence of the minimization algorithm toward the appropriate (hence global) minimum, provided that the data (observations) are “not too far” from the attainable set $F(M)$. The precise analysis can be found in Barucq et al. (2019b) (section 2). Here, we want to give a comprehensive overview of the tools that we use to analyze least-squares problems. It is important to note that the estimates to quantify optimizability are defined a priori; i.e., they only depend on the forward problem (not on the data) and can be computed without — and preferably before — performing any inversion.

In seismic applications, the model space $M$ does not usually contain much information because subsurface characterization is mostly unknown. In particular, the data phase shifts resulting from an error in the long spatial wavelength of the initial guess produce local minima for the problem stated in equation 5, hampering the search for the global minimum. Therefore, our idea is to analyze, from an initial model $m0$, what is, for a given direction $u$, the largest interval in which the misfit functional is guaranteed to contain no parasitic stationary points — and hence no parasitic local minima — beside the global minimum, provided that the error in the data is smaller than some tolerable level. We refer to such an interval $[m0\u2212\Delta m0uu,m0+\Delta m0uu]$*as* a directional attraction basin of size $\Delta m0u$ centered around $m0$ in the normalized direction $u$, i.e., with $\Vert u\Vert =1$.

^{1}, we give an intuitive presentation, in the case of a 2D data space, of the results of Chavent (2010) on strictly quasiconvex sets for the determination of attraction basins. Consider the curve or path $P$ of the data space, which is the image, by the forward map $F$, of the interval $[m0\u2212\Delta u,m0+\Delta u]$ in the model parameter space:

^{1}with the global minimum in $t=1$ and a parasitic minimum for a value $t\u2032\u2260t$. The same reasoning works if $t$ and $t\u2032$ are interior or endpoints of $P$. Of course, the distance to $d$ also has the same global and parasitic local minima over the subpath $Pt\u2032\u219dt$. Hence, $d$ necessarily belongs to normal half-spaces $N(t)$ and $N(t\u2032)$ at the endpoints of $Pt\u2032\u219dt$ (the dashed areas). The following inequalities can be observed in Figure

^{1}:

The numerical estimation of $RG(P)$ will be the basis for the a priori determination of attraction basins in the next sections.

^{1})

*sharp*: Equality holds as soon as $P$ is a plane curve that turns always in the same direction, which corresponds to the fact that all $d\Theta $ in equation 25 have the same sign. But it can also be very pessimistic; for example, when $P$ is a plane curve such that the $d\Theta $ in equation 25 changes sign steadily. Both situations will be met in the seismic applications below.

All the above formulas have been illustrated with a plane curve $P$. They remain true when $P$ is a curve of a Hilbert data space (at the exception of equation 25).

### Computation of estimates of attraction basins and tolerable data error

We apply the methodology to the a priori determination of directional attraction basins $[m0\u2212\Delta m0uu,m0+\Delta m0uu]$, which will allow us to analyze the behavior of seismic inverse problems and to compare formulations: The size $\Delta $ of a directional attraction basin in a descent direction tells us how far we can move away in this direction without being stopped by parasitic local minima. Our objective is hence to determine (see the illustration in Figure ^{2})

- 1)
The

*size*$\Delta m0u$ of the directional attraction basin centered at $m0$. The larger $\Delta m0u$ is, the better the least-squares problem is amenable to minimization by local algorithm because we allow a larger area for investigation. In our numerical experiments, we shall scale the estimate with the norm of the nominal model $\Vert m0\Vert E$ to provide the relative (to the model) quantity. - 2)The associated tolerable error level $RG,m0u$. It is the largest tolerable error on the data $d$, which ensures the absence of parasitic local minima for the least-squares objective functionover $[0,1]$. The larger $RG,m0u$ is, the better is the robustness of the minimization procedure to noise in the data. In our numerical experiments, we divide the estimates with the norm of the synthetic data $d0=F(m0)$ to provide relative (to the data) quantity.$t\u2208[0,1]\u219d12\Vert F(m0+(2t\u22121)\Delta m0uu)\u2212d\Vert D2,$(27)

We shall use two types of estimates:

- 1)
$\Theta $ estimates of $\Delta m0u$, based on equation 23, where optimizability is obtained by ensuring the sufficient condition $\Theta (P)\u2264\pi /2$. In this case, $RG(P)=R(P)$, so the tolerable error level is given by the minimum over the $[0,1]$ interval of $R(t)$ given in equation 18.

- 2)
$RG$ estimates of $\Delta m0u$, based on equation 17, where optimizability is obtained by satisfying the necessary and sufficient condition $RG>0$. In this case, the associated tolerable error level $RG$ has to be computed by numerically evaluating the infimum of $RG(t,t\u2032)$ in equation 15.

#### Exact $\Theta $ and $RG$ estimates of $\Delta m0u$ and associated tolerable error levels

To compute the exact estimates, deflection maps and global radius maps are computed, which display the values of $\Theta (t,t\u2032)$ and $RG(t,t\u2032)$ between the points $t,t\u2032$ of $[\u2212\Delta ,+\Delta ]\xd7[\u2212\Delta ,+\Delta ]$. On the diagonal of the maps, where $t=t\u2032$, $RG$ is not defined in equations 30 and 31, and we indicate instead the values of $R(t)$ given in equations 18 and 19, which represent the limits of $RG(t,t\u2032)$ when $t\u2032\u2192t$. We review the different steps for the computation of the maps in Algorithm 1.

One can then read on these maps (see the numerical experiments below):

- 1)
The exact $\Theta $ estimate of the attraction basin size $\Delta m0u$, given by the largest square centered at $(0,0)$ where $\Theta (t,t\u2032)\u2264\pi /2$ for all $t$, $t\u2032$.

- 2)
The exact $RG$ estimate of the attraction basin size $\Delta m0u$, given by the largest square centered at $(0,0)$ where $RG(t,t\u2032)>0$ for all $t$, $t\u2032$.

During this process, when the size $\Delta $ of the investigated square increases from zero to the exact $\Theta $ estimate, the associated exact tolerable error $Rm0u=inf\u2212\Delta \u2264t\u2264\Delta R(t)$ decreases from its value $R0$ at $m0$ to the tolerable error $Rm0u$ of the $\Theta $-attraction basin. When the size $\Delta $ of the square increases further to the exact $RG$ estimate, the tolerable error $RG=inf\u2212\Delta \u2264t,t\u2032\u2264\Delta RG(t,t\u2032)$ continues to decrease, until it reaches the value zero, which gives the limit of the $RG$-attraction basins. The exact estimates are relatively demanding because they require evaluation of $RG$ and $\Theta $ for many couples $(t,t\u2032)$, and we now introduce the cheaper local estimates.

#### Local $\Theta $ estimate for $\Delta m0u$ and tolerable error level $Rm0u$

*at*$m0$ in the direction $u$. We use for that a rectangle approximation of the integral in equation 26, which gives the approximate upper bound $\Theta m0u$ to the deflection $\Theta (P)$:

## FWI: NUMERICAL DETERMINATION OF CONVERGENCE BASINS AND TOLERABLE ERROR LEVELS

Our objective in this section is to give a comprehensive understanding of the behavior of the seismic FWI problem equation 5 with respect to parameters such as the frequency, the geometry of the target, and the parameterization. We select a nominal model $m0$ and compute for different (normalized) directions $u$:

- 1)
The size $\Delta m0u$ of an attraction basin, which represents the interval, in the model space, where the misfit functional is guaranteed to have only one minimum.

- 2)
The associated tolerable error level $Rm0u$, which is the maximal distance allowed between the data and the attainable set (i.e., the maximal tolerable error level).

When $m0$ is smooth, the direction $u$ can be seen as the search direction or the geometry of the unknown parameter to be reconstructed in the framework of inversion.

We consider the 2D geophysical setup for the acoustic Helmholtz equation 1, and we take the squared slowness $m\u2254c\u22122$ as unknown parameter, expressed in $s2m\u22122$. In our experiments, we consider a $9.2\xd73\u2009\u2009km$ domain, and we take $nsrc=19$ sources and $nrcv=183$ receivers associated with each source. Both are located near the surface: The sources are at 10 m depth, and the receivers, which remain in the same position for all sources, are at 40 m in depth.

### Influence of frequency $\omega $ and search direction $u$

We first investigate individual frequency, where the forward map $F\omega $ at frequency $\omega $ for parameter $m$ is a vector of $C(nrcv\xd7nsrc)$. The nominal model $m0$ is the smooth background velocity pictured in Figure ^{3}, in which the velocity varies linearly with depth, according to the fact that no information is known on the subsurface structures to reconstruct.

We investigate the behavior of attraction basins for different directions $u$, which are chosen to be based on straight reflectors, or extracted from the Marmousi model, a synthetic wavespeed model designed by the Institut Français du Pétrole in the late 1980s; we refer to Versteeg (1994) for additional details. They are shown in Figure ^{4}: $u1$ has a single, slim reflector (Figure ^{4}), $u2$ has two slim reflectors (Figure ^{4}), $u3$ has a single large reflector (Figure ^{4}), and $um$ is obtained by high-pass filtering from the Marmousi model of Figure ^{4}. The amplitude for the directions is selected such that $uk$ is normalized: $\Vert uk\Vert =1$, for all $k$.

Given a perturbation size $\Delta $, one can think of $m1=m0+\Delta u$ as being the target velocity, which one tries to retrieve starting from the smooth model $m0$, by minimization of the FWI misfit function in direction $u$ by a local algorithm (it is analogous to the update formula given in equation 6 in Newton-type algorithm). The question is: How large can $\Delta $ be while still guaranteeing that the algorithm will not stop before the global minimum, provided that the error in the data is smaller than some $R$ or $RG$?

#### Local $\Theta $ estimates of $\Delta m0u$ and $Rm0u$

The determination of $\Delta m0u$ and $Rm0u$ with the local $\Theta $ estimate, equations 33 and 34, and $Rm0u$ with equation 34, only requires the computation of the first- and second-order derivatives in the direction $u$, obtained by differentiation of the Helmholtz equation (Barucq et al., 2019b, Appendix A). Figure ^{5} shows the evolution of $\Delta m0u$ (Figure ^{5}) and $Rm0u$ (Figure ^{5}) for frequencies between 0.5 and 15.0 Hz, for the four directions $u$ of Figure ^{4}.

We observe that, in all cases, the size $\Delta m0u$ of the directional attraction basin decreases with increasing frequency, meaning that the largest basin of attraction is obtained for the lowest frequencies. This is the expected behavior that indicates that lower frequencies give a better chance for the iterative minimization to converge, especially when no prior information is known for the initial model.

Concerning the effect of the perturbation direction on attraction basins, Figure ^{5} shows that the size of the basins decreases in the order $u1\u2192u2\u2192um\u2192u3$. This can be understood as follows: Directions $u1$, $u2$, and $um$ are essentially reflective and contain no low-frequency components. Hence, increasing $\Delta $ will not change the phase of backscattered data, but because of multiple reflections, some nonlinearity will occur, which limits the size of the attraction basins. The directions $u1$ and $u2$ with fewer multiple reflections have the largest basins of attraction. Then, comes $um$, which has many reflectors and hence many multiples. On the contrary, direction $u3$ has a strong low-frequency component besides its reflectivity. Hence, increasing $\Delta $ will generate a phase shift in the backscattered data, which is likely to create local minima more quickly, and one expects the size $\Delta m0u$ of the attraction basin in the direction $u3$ to be much smaller than in the three other directions.

Concerning the largest tolerable error, Figure ^{5} shows that $Rm0u$ increases with the frequency, meaning that the low frequencies are more likely to be affected in the presence of noise. We notice the exception of $u3$ for which the allowed error remains relatively similar (and worse than the others) with frequency.

These results clearly show that it is harder to converge toward the salt dome than it is to recover Marmousi-like structures. Because of its low spatial frequency component, the former reduces the size of the attraction basin and is less tolerant to noise in the data. This situation is further illustrated in the context of model reconstruction in the subsection below. Note that the order of magnitude in Figure ^{5} and ^{5} is approximately $10\u22121$ of the norms of the model and data, respectively.

#### Exact $\Theta $ and $RG$ estimates of $\Delta m0u$ and $Rm0u$

As an alternative to confirm these preliminary observations, we compute $\Delta m0u$ using the exact $\Theta $ and $RG$ estimate. It requires the computation of deflection and global radius maps on the investigated interval, from which the estimate is extracted. In Figures ^{6} and ^{7}, we show the resulting maps in the direction of the Marmousi reflectors $um$ of Figure ^{4} and the single large reflector $u3$ of Figure ^{4}, using frequencies of 4 and 7 Hz. In Table 1, we show the numerical values of the interval size centered around $m0$, depending on the estimation.

The maps confirm the results of the previous experiment: The exact $\Theta $ estimates of $\Delta m0u$ (Figure ^{6} and ^{6}) are of the same order of magnitude, but they are slightly larger than the local $\Theta $ estimates of Figure ^{5} (see Table 1). Similarly, the exact $RG$ estimates of $\Delta m0u$ (Figure ^{6} and ^{6}) are larger (Table 1), but at the price of a smaller tolerable error on the data. We also note a much more consistent pattern in the deflection for $um$ compared to $u3$. Comparing the formulas, we observe that the inexpensive local $\Theta $ estimates already provide accurate values for classic FWI problems, close to the more refined (and computationally more intensive) exact $\Theta $ and $RG$ estimates. Therefore, we shall only use the local $\Theta $ estimate in the remainder of this section.

### Influence of the initial model

We now investigate the influence of the initial model $m0$ on the size of attraction basins. We consider two initial models: the acoustic Marmousi medium $mm$ and a model $mc$ encompassing objects of high contrast. These are pictured, together with their respective local $\Theta $ estimates $\Delta mju$ given in equation 33, in Figure ^{8}, for the two directions $u1$ (one reflector) and $u3$ (one large reflector). We see that $\Delta mju$ is barely affected by the choice of initial model for the computation. We still observe the decrease in the size of the interval with increasing frequency, and the deterioration in the direction of the large reflector $u3$. Clearly, when comparing amplitudes of our estimates, the direction $u$ has much more influence than the initial model $m0$.

From this first set of experiments, we clearly see that the frequency should evolve from low to high values to facilitate the convergence. When no prior information is known for the initial model, one should take advantage of the larger interval size $\Delta m0u$ given by low frequencies. Here, we motivate the well-known frequency progression of, e.g., Bunks et al. (1995) and Sirgue and Pratt (2004), from quantitative estimates of the size of the basin of attraction, which allows the quantification of the benefits. However, the low frequencies require accurate data, i.e., close to the $F(M)$, as illustrated with the estimates of $Rm0u$. In addition, we mention the fact that seismic data are particularly affected by noise at low frequencies, which are sometimes unusable. This highlights where resides the difficulty: The low frequencies that are required for convergence may be unusable due to the noise, or the inaccuracy may violate the curvature condition.

### Frequency bandwidth data

We have seen that sequential progression in frequency must be conducted from the low to the high regime. We now investigate the case of frequency bandwidth: when the forward problem encompasses not only one, but several frequencies. This approach is particularly natural for the time domain inverse problem in which several frequencies are simultaneously contained in the data. The forward problem is now a vector of $C(nrcv\xd7nsrc\xd7n\omega )$, where $n\omega $ is the number of frequencies taken in the subgroup.

We remain with the smooth background of Figure ^{3} for $m0$. We select groups of 10 consecutive frequencies. The initial group encompasses the frequencies from 0.1 to 1.0 Hz using 0.1 Hz increment. Similarly, we design 15 groups so that the last one is from 14.1 to 15.0 Hz, using an identical increment. The local $\Theta $ estimate of the attraction basin size is shown in Figure ^{9}, together with the largest tolerable error $Rm0u$. We investigate two directions: the Marmousi one of Figure ^{4} and the large reflector of Figure ^{4}. We picture curves and histograms that correspond to sequential or group of frequencies, respectively. Numerical values for the attraction basis are prescribed in Table 2.

The comparisons provided in Figure ^{9} and ^{9} and Table 2 highlight that the size of the attraction basin is dictated by the largest frequency in the group when using the group of frequencies. Therefore, it is better to use the lowest frequency sequentially to have the largest size possible. For instance, using the single frequency 0.1 Hz provides a size of interval one order of magnitude larger than when including frequencies up to 1.0 Hz. Then, at a fixed frequency, taking into account lower frequencies does not affect the size of the interval.

This preference on sequential over multiple frequency is actually advocated by Brenders et al. (2012), motivated, as the frequency progression (Bunks et al., 1995), by the intuition that it reduces the cycle-skipping effect. Here, we justify this choice by a quantitative estimation of the size of the attraction basin. Moreover, we show that once the lowest frequencies are taken individually, incorporating them in higher frequency content does not modify the size of the attraction basin.

Regarding the largest tolerable error on data, Figure ^{9} shows that adding frequency increases the robustness in the direction $um$, except for low frequency. Then, for the direction of the large contrasting object $u3$, we do not observe any improvement, and the allowed distance remains relatively stable, as it was observed in Figure ^{5}.

### Selection of complex frequencies

We compute the estimation of $\Delta m0u$ (using the local $\Theta $ estimate) and $Rm0u$ with complex frequencies to see how it affects the basin of attraction and robustness to noise. We consider $m0$ the smooth background velocity. The estimates are shown in Figures ^{10} and ^{11}, in which we compare the behaviors on a logarithmic scale for the 2D complex frequency plane. Here, $f$ and $\sigma $ vary from 0 to 15, with the frequency given in Hz and the Laplace component in $s\u22121$, excluding the case in which $f$ and $\sigma $ are zero.

It is clear that the incorporation of a damping coefficient in the frequency is beneficial for the convergence because it increases the size of the attraction basin of several orders of magnitude. We observe

- 1)
For a fixed $f$, the interval size increases with increasing damping, with the exception of the case $f=0\u2009\u2009Hz$ (see Figure

^{10}and^{10}). - 2)
At fixed damping $\sigma $, the interval size decreases with increasing $f$.

- 3)
Following a selection from the large to small attraction basin gives

- •
Initial iterations should use $f=0\u2009\u2009Hz$ (the largest interval), and a damping coefficient progression from low to high (see Figure

^{10}and^{10}). - •
Fix $f\u22600\u2009\u2009Hz$ as small as possible, the damping coefficient should evolve from high to low (see Figure

^{10}and^{10}), and then $f$ increases progressively. - •
Finally, one can take $\sigma =0s\u22121$ and $f$ evolves from low to high values.

Regarding the largest tolerable data error $Rm0u$, we see that at fixed $\sigma $, increasing $f$ allows an increase in the distance between the data and the attainable set. Then, the pattern strongly depends on the type of the search direction geometry because it has already been observed in the previous experiment (and contrary to $\Delta m0u$). The distance is allowed to increase with $\sigma $ for the direction $um$, but it decreases in the direction of the strong reflection, $u3$. Also in the direction of the Marmousi reflectors, Figure ^{11}, we observe a band in which the allowed distance decreases, around $\sigma =1s\u22121$. On the other hand, for both directions, the case in which $\sigma =0s\u22121$ appears the most robust (i.e., where the allowed distance between the data and the attainable set is maximal). In terms of the magnitude, we clearly see that the high contrast object has globally much less allowance, except when the damping is close to zero.

It is important to notice that the use of complex frequencies, especially when $f=0\u2009\u2009Hz$, improves the convergence by increasing notably the size of the attraction basin. Therefore, it provides a fundamental benefit for the reconstruction, in particular when no information is initially known. Note that one can also use very low frequencies (actually, the magnitude of the estimates for complex frequencies eventually matches the very low-frequency ones; we can see a continuation in the magnitudes). On the other hand, such frequencies are more sensitive to noise and require accuracy in the data. We refer to Shin and Cha (2008), Shin et al. (2010), Petrov and Newman (2014), and Faucher (2017) for illustrations of the reconstruction.

### Experiments of FWI reconstruction

We illustrate the influence of the subsurface geometry with an acoustic FWI experiment, in which we demonstrate the behavior observed in the previous analysis (Figures ^{5} and ^{10}). In particular, we have shown that a high contrast object complicates the process of reconstruction compared to a Marmousi-like subsurface, by reducing the size of the attraction basin. In addition, we have seen that the very low or complex frequencies, by increasing the size of the basin of attraction, provide benefits on the reconstruction.

Therefore, we perform FWI for 2D geophysical experiments in the acoustic settings. We consider a domain of size $9.2\xd73\u2009\u2009km$, which is the size of the models presented in Figure ^{8}. We use the algorithm to recover the wavespeed $c$ in the Helmholtz equation 1, from measurements of pressure fields (see equation 3). For the seismic acquisition, we consider 91 sources and 183 receivers per source, which are both located at 20 m depth. The measurements consist of synthetic frequency-domain data with added noise.

We apply the subsurface reconstruction FWI procedure, for two targeted subsurface media of different geometries. We consider the reconstruction of the Marmousi model of Figure ^{8}, and the reconstruction of the medium encompassing salt domes of Figure ^{8}. For both experiments, we take the same initial (starting) model, which does not assume any information on the subsurface (see Figure ^{12}).

#### Comparison between Marmousi and salt geometries

The iterative minimization is conducted using frequencies from 2 to 10 Hz, with 1 Hz increment. There are 20 iterations per frequency, which are taken sequentially (from low to high). In Figure ^{12}, we compare the reconstructions of the Marmousi model and the salt dome, using a similar setup: the same model size, the same frequency progression, and the same initial model with no a priori information (actually, the starting model shown in Figure ^{12} is relatively close to the background of the salt medium shown in Figure ^{8} in terms of amplitude).

We observe that only the experiment with the Marmousi medium is able to accurately capture the subsurface structures, in which only the deepest parts are missing in Figure ^{12}. On the other hand, none of the contrasting salt domes are retrieved, see Figure ^{12}, with only artifacts in the reconstruction. It demonstrates the results of the previous analysis that, by reducing the size of the basin of attraction, high-contrast objects are more complicated to reconstruct than Marmousi-like media when using a similar frequency range in the data. Note that to obtain a better reconstruction, one could use an initial model less challenging than the one of Figure ^{12}, i.e., by incorporating a priori information.

#### Low or complex frequencies compensation

It has been shown in Figure ^{5} that low frequencies enlarge the size of the basin. As an alternative, complex frequencies produce the same behavior (see Figure ^{10}). To demonstrate the validity of these observations, we repeat the experiment of salt dome reconstruction, incorporating low or complex frequencies:

- 1)
In Figure

^{13}, we show the reconstruction that starts with lower frequencies, using the sequence ${0.5,0.75,1,2,3,4,5,6,7,8,9,10}\u2009Hz$, hence incorporating three (unrealistically) low frequencies compared to the reconstruction of Figure^{12}. - 2)
In Figure

^{13}, we first use the following set of complex frequencies (labeled as $(\sigma ,f)$ in equation 35): ${(7,2),(5,2),(2,2),(1,2),(7,3),(5,3),(2,3),(1,3)}$, with $f$ given in Hz and $\sigma $ in $s\u22121$, before using the same sequence as for Figure^{12}: from 2 to 10 Hz (where $\sigma =0s\u22121$). Note that our selection of complex frequencies follows our deduction (low to high for $f$, high to low for $\sigma $).

The FWI algorithm benefits from the incorporation of low or complex frequencies, and it is now able to accurately recover the contrasting objects, as expected. We note that the complex frequency experiment produces less accurate results; in particular, it misses the deepest boundary of the salt dome. However, it does not use a frequency of less than than 2 Hz. Then, both experiments show drastic improvement compared with the results of Figure ^{12}.

We have used synthetic experiments to validate the basin of attraction estimates and its dependency on the geometry of the target and on the frequency. The use of very low frequencies (as in the reconstruction of Figure ^{13}) is efficient but unrealistic in seismic application. The complex frequency alternative is probably a better candidate, but also relies on efficient data preprocessing (e.g., noise removal).

To overcome the issue of low-frequency requirement, the MBTT model decomposition has been introduced. The decomposition is expected to ameliorate the robustness with respect to missing low frequencies, in particular for the recovery of the background velocity. In the next section, we investigate its performance, by the means of attraction basin size, and we quantify the benefits of this approach.

## OPTIMIZABILITY: MBTT VERSUS CLASSIC FWI

In this section, we apply the optimizability analysis tools to compare the MBTT formulation of the FWI presented in equation 12 (where the unknown model is parameterized by a smooth background $p$ and a data-space reflectivity $s$), with the original FWI problem stated in equation 5 (where the unknown model is the squared slowness $m$). The MBTT approach was originally designed to overcome the phase shift problem that hampers classic FWI, so the optimizability study of this section is expected to quantify the gain — if any — of the MBTT formulation over classic FWI in this respect.

### Choice of the model

^{3}and $mm$ is the Marmousi model of Figure

^{8}. Note that if we assume that $m0=p0$ generates only the first arrivals as shown in equation 11, we have $s0(\omega )=F\omega (mm)$, so that in our computations, the nominal $s0(\omega )$ coincides with the Marmousi data — but the nominal depth reflectivity $r0(\omega )$ differs from the true Marmousi reflectivity.

^{14}at 4 and 7 Hz.

### Choice of perturbation directions for MBTT decomposition

For the perturbation of the background part of the velocity model, we consider the direction of a 1D ramp $u$, given in Figure ^{15}. In Figure ^{16}, we show the forward problem (i.e., the solution of the Helmholtz equation at the receivers’ location, deprived from the direct arrivals) corresponding to unperturbed and perturbed models, for a single, centrally located source at 4 Hz. The figure shows the effect of applying the perturbation onto the global model (as used in classic FWI) or onto the background unknown $p$ of the MBTT formulation: When the perturbation is applied to the global model $m$, the phases of the signal are shifted, and the amplitudes are incorrect, but when the perturbation is applied to the background MBTT variable $p$, the phases of the signal are preserved, and only the amplitudes are modified.

For the perturbation direction $us$ associated with the data-space reflectivity $s$, we select a random vector. Note that we expect large basins of attraction with respect to $s$ in the MBTT formulation because the FWI misfit functional is known to be nearly quadratic with respect to the reflectivity, i.e., to the high spatial frequency part, and consequently for the MBTT misfit functional with respect to the data-space reflectivity $s$.

### Comparison of local $\Theta $ estimates

^{17}.

From Figure ^{17}, we observe that the size of the attraction basins decreases with frequency, and it is only slightly larger when the perturbation is applied onto $p$ (MBTT) instead of $m$ (classic FWI). The gain appears minimal, and it is not up to the expectations raised as the MBTT parameterization allows us to overcome the phase-shift problem, as shown in Figure ^{16}. Similarly, the attraction basin for $s$ is relatively small compared to what is expected. However, we shall recall that these are only local $\Theta $ estimates that can represent a coarse approximation. Therefore, we compute exact $\Theta $ estimates in the following, to investigate if those preliminary observations still hold.

### Comparison of exact $\Theta $ and $RG$ estimates

To compute the exact $\Theta $ and $RG$ estimates, the perturbation of the model is either applied onto the global model $m$ (FWI) or onto the parameters $p$, $s$ of the MBTT decomposition. Therefore, the deflection and global radius of the curvature maps are computed between the following points, for all $\u2212\Delta \u2264t,t\u2032\u2264\Delta $:

- 1)
FWI (attraction basin for $m$): $F(m0+tu)$ and $F(m0+t\u2032u)$;

- 2)
MBTT (attraction basin for $p$): $F(p0+tu,s0)$ and $F(p0+t\u2032u,s0)$;

- 3)
MBTT (attraction basin for $s$): $F(p0,s0+tus)$ and $F(p0,s0+t\u2032us)$.

In Figures ^{18} and ^{19}, we show the deflection and global radius of curvature maps at 7 Hz, for $m$, $p$, and $s$. The corresponding values for the estimates are provided in Table 3, in which we also give the results for the frequency 4 Hz.

We now observe a clear difference between the MBTT parameterization and the global model one, and it shows the gain (in terms of attraction basin size) of the former approach (see Table 3). In the context of global model FWI (Figures ^{17} and ^{18}) the local and exact estimates are of the same size, as observed in the above section. In sight of the upper bound equation 32, on which the local $\Theta $ estimate is based, one can think that the FWI formulation corresponds to the “worst” situation, in which the image of a segment in the background space is a curve close to an arc of circle in the data space. However, in the MBTT formulation (Figures ^{17} and ^{18}), the exact $\Theta $ estimate is approximately 10 times larger than its local $\Theta $ estimate. Therefore, the exact estimates are needed for a proper comparison of the FWI and MBTT approaches. Then, we observe that the values of deflection are much lower when the background perturbation $u$ is applied to $p$ (MBTT) rather than to $m$ (FWI): Few portions reach $\pi /2$ with MBTT (Figure ^{18}), whereas it is rapidly attained with classic FWI (Figure ^{18}). Consequently, the MBTT formulation produces larger $\Theta $-attraction basins than the standard FWI formulation, roughly by a factor of 10 (Table 3). Regarding the global radius of curvature $RG$, it is strictly positive for the complete investigated interval in the case of MBTT, whereas, in classic FWI, it rapidly decreases to zero outside the diagonal. Therefore, the $RG$-attraction basins are much larger for MBTT, by a factor of 4–8 (Table 3). In addition, the amplitude of $RG$, whose minimum over the attraction basin gives a tolerable error level, is, near the diagonal, larger for FWI than for MBTT.

Regarding the data-space reflectivity $s$, Figure ^{19}, the exact $\Theta $-attraction basin is particularly large (23–54 times the norm of $s0$ depending on frequency), and it is $105$ times larger than its local estimate. We observe the expected behavior because the forward map $F$ is nearly linear with respect to $s$, hence resulting in large attraction basins. It also confirms the necessity of exact estimates for accurate characterization of attraction basins in the context of MBTT parameterization.

To summarize, MBTT significantly extends the size of attraction basins with respect to background perturbations, explaining its success with relatively high-frequency content in the data (Chavent et al., 2015). However, it costs a reduction in the admissible error level in the data.

### Influence of the reflectivity level

The previous comparisons have been performed on a nominal model that has a reflectivity level $\beta $ of 1% (see equation 38). We repeat the same computations, at 7 Hz, changing only the reflectivity of the model, which is fixed to a low level, 0.2% in Figure ^{20} (close to the Born approximation), and to a stronger value, 5% (close to Marmousi reflectivity) in Figure ^{21}. Table 4 summarizes the sizes and corresponding tolerable error of the attraction basins.

Increasing the reflectivity increases the relative importance of multiple reflections — whose phases are *not* controlled by the migration/demigration included in the MBTT approach — with respect to primary reflections, whose phases *are* controlled. It is then expected that the performance of MBTT, when applied to a full-wave model as is the case here, will deteriorate when the reflectivity level increases. This is confirmed in Table 4: The MBTT attraction basin for a reflectivity level of 5% is three times smaller than at 1%, which itself is two times smaller than at 0.2%. On the other hand, the FWI attraction basin remains essentially unaffected by the reflectivity level, and the MBTT basins remain always larger, by a factor of 4–15 with respect to the $\Theta $ estimates and a factor of 8 for the $RG$ estimates.

### Influence of background smoothness

Here, we illustrate an important condition that has to be satisfied for the MBTT decomposition to work. When a background perturbation is applied to $p0$, whereas $s0$ is maintained fixed, the phase shifts of the reflections are controlled, provided these reflections are generated *only* by $s0$. Any backscattered energy generated by the updated background $p0+tu$ — or worse, by $p0$ itself — will lead to reflections whose phase is *not* controlled by the migration-demigration process included in MBTT, thus annihilating its benefits. This appears already in Figure ^{18}, in which one sees that the deflection reaches $\pi /2$ (respectively, the global radius of curvature equates zero) much closer to the diagonal at the top-right corner than at the bottom-left one. This zone (upper right corner of Figure ^{18}) corresponds to large positive values of $t$, $t\u2032$, which lead to velocity models ($p0+tu$ and $p0+t\u2032u$) made of strongly increasing background, consequently generating backscattered energy. At the opposite corner of the diagrams (bottom left), which corresponds to mildly increasing velocities, the amount of scattered energy is smaller and the limiting values are further apart from the diagonal. We have considered for simplicity only attraction basins that are centered at the nominal model $m0$, but on most of the deflection and curvature maps, the attraction basin could, in fact, be extended on the side of negative $t$ (where backgrounds are less steep).

In the case in which the unperturbed background scatters back some energy, as the one of Figure ^{22}, the situation is even worse, as illustrated in the corresponding deflection and curvature maps of Figure ^{22}, computed at 4 Hz. These maps show no enlargement of the attraction basin in favor of MBTT; in fact, it is even smaller than for FWI. That is why it is fundamental to ensure the smoothness of the background in the MBTT decomposition by using an adapted parameterization, e.g., spline functions are used by Chavent et al. (2015) to represent $p$.

### Behavior of the data misfit

We illustrate in this section the behavior of the FWI and MBTT misfit functions in the model space direction $u$ at the nominal model $m0=m(p0,s0)$ for some seismic data $d$, where $p0$ is the smooth background of Figure ^{3}. The cost functions are, respectively, $J(m0+tu)$ and $J(p0+tu,s0)$ for FWI and MBTT, according to equations 5 and 12. This will allow us to verify the theoretical analysis: If the minimum of the misfit function is smaller than the tolerable error $RGu$ associated to an attraction basin, the misfit functional should not have any local minimum over the basin. However, when the data exhibit an error larger than the tolerable error, local minima *may* but do not necessarily appear.

^{8}) in such a way that $\Vert dm\Vert =\Vert d0\Vert $. Then, the data that enter in the FWI and MBTT misfit functionals are defined by

When $\eta $ increases from zero to one, the distance of $d$ to the attainable set increases from zero to $\Vert dm\u2212d0\Vert $, which is larger than the tolerable error associated to the attraction basin of interest (FWI or MBTT).

The computations are done for the nominal model with a 1% reflectivity level, and the background perturbation direction $u$ of Figure ^{15}. Figure ^{23} shows the cost functions for FWI and MBTT at 7 Hz for different values of $\eta $. We also show by vertical lines the exact $RG$-attraction basin associated to zero tolerable error, and we show by horizontal lines the levels $1/2RG2$ corresponding to the tolerable error level associated to the exact $\Theta $-attraction basin. In Figure ^{23}, we observe that the FWI and MBTT attraction basins are essentially independent of the error level $\eta $ in the data. The size of the FWI $RG$-attraction basins is slightly larger than the one anticipated by $\Delta m0u$. When $\eta =0$, $d=d0$ is attainable, with a zero error level, and one sees that the first local minimum is just after the right limit of the exact $RG$-attraction basin; hence, it is consistent with our analysis. Note that the tolerable error level $RG,p0u$ on $d$ for the $\Theta $-attraction basin is attained at $\eta =0.1$. Clearly, when one moves away from the nominal value $t=0$, the local minima appear much earlier with FWI than with MBTT and the size of intervals in which there is no local minimum is well anticipated by the estimates $\Delta u$. In the numerical setting considered here, the MBTT approach appears to be robust with respect to data error levels much larger than the tolerable level because we do not see the local minimum appearing inside the attraction basin.

### Strategy for MBTT inversion

The numerical experiments lead to the following recommendations, for implementing the MBTT reformulation of FWI to obtain large attraction basins for the background reconstruction.

First, the background space for $p$ has to be chosen smooth enough and the reflectivity level $\beta =\Vert r\Vert /\Vert p\Vert $ large enough so that the energy backscattered by $p$ is negligible compared to that backscattered by $r$. On the other hand, the reflectivity level $\beta $ has to be chosen small enough so that the importance of multiples remains small compared to primary reflections in the synthetics. This can be achieved by using an adapted parameterization for the background $p$ (e.g., spline functions) and by tuning the reflectivity level $\beta $ to adjust the migration weight $W$ in equation 37. We are reminded that the relevance of the chosen settings can be checked before performing any iterative optimization algorithm, by computing the size of the $\Theta $ and $RG$-attraction basins and the associated tolerable error. In addition, we have noted that the attraction basins tend to be larger in the direction of slowly increasing (with depth) backgrounds, which suggests underestimating the initial slope of the background velocity with respect to depth.

The dependency of the size of attraction basins with the reflectivity level suggests the following organization for MBTT waveform inversion. First, we need to scale down the given seismic data so that they can be resimulated by a reflectivity level $\beta $ of approximately 0.2%–1.0%, which amounts to bring the forward model close to the Born approximation. Second, we perform alternate minimization with respect to $s$ and $p$, starting with $s$ first. Third, one can rescale (up) the processed seismic data to get closer to the original given data. Then, we pursue the alternate updates of $s$ and $p$. Eventually, we continue until the processed and given original data coincide. Note that all of the difficulties related to the choice of the background and the reflectivity level disappear if one uses a linearized model (the Born approximation) as the forward model.

## SUMMARY OF RESULTS

We have first studied the classic FWI problem for the acoustic wave equation in the time-frequency domain, and we computed directional attraction basins that give quantitative insight into optimizability of FWI. Unsurprisingly, the size of attraction basins becomes smaller when the frequency increases. In our tests, it is only of $\xb12%$ of the norm of the background at 4 Hz and $\xb11%$ at 7 Hz. Clearly, the use of complex frequencies including damping enlarges the attraction basin. However, the largest attraction basins obtained for low (or complex) frequency usually go with smaller tolerable error levels. Therefore, low and complex frequencies suffer more from noise.

We have observed that solving the problem for a sequence of increasing single frequency data is better in terms of the attraction basin than using an increasing sequence of frequency-bandwidth data. As expected, media with salt domes produce smaller attraction basins, and hence they are harder to recover.

We have then studied the MBTT approach, which uses a background/data-space reflectivity parameterization of the velocity model. In a direction of background perturbation, the largest attraction basin for MBTT shows an increase in size by a factor of 4–15 compared to FWI, and it is only barely reduced when the frequency increases. Correlatively, the tolerable error for MBTT is divided approximately by 10 compared to FWI. Nonetheless, the comparison of misfit functions has shown that the local minimum does not appear when the tolerable error is exceeded. In the direction of data-space reflectivity perturbations, in which the forward map is linear up to the multiple reflections, the attraction basin is very large, more than $\xb120$ times the norm of the nominal data-space reflectivity. Furthermore, we have identified the critical parameters for an efficient implementation of MBTT: the background smoothness and the reflectivity level.

## CONCLUSION

In this paper, we provide an a priori analysis of the optimizability of nonlinear least-squares minimization problems. For this purpose, we compute the size of the attraction basin around nominal models and the associated tolerable error level, such that the misfit functional is guaranteed to have only one (hence global) minimum in the attraction basin, provided the noise on the data is below the tolerable error level. These estimates are defined a priori, and they depend only on the first- and second-order directional derivatives of the forward problem.

A first application is carried out with the seismic FWI problem, in which the misfit functional is known to have local minima in the directions associated with low spatial frequencies perturbation (background velocity). Yet, note that our methodology is totally applicable for other least-squares minimization problems. Our findings confirm the inherent difficulty of FWI, which requires unrealistically low-frequency data for the determination of the full velocity model and which we illustrate with experiments of reconstruction.

To see if these limitations could be overcome, the methodology is then applied to compare with the data-space reflectivity MBTT reformulation, originally designed to alleviate the phase shift problem. The results show a substantial increase of the attraction basins for MBTT, which can indeed overcome the requirement for low frequencies in the data.

As a side product of this analysis, we finally provide some guidelines for an efficient implementation of the MBTT reformulation of FWI.

## ACKNOWLEDGMENTS

The authors would like to thank the editor, associate editor, and the three anonymous referees for their valuable comments that have contributed to improve the quality of the paper. F. Faucher acknowledges funding from the Austrian Science Fund (FWF) under the Lise Meitner fellowship M 2791-N. The research of F. Faucher is also supported by the Inria-TOTAL strategic action DIP. The research of H. Barucq has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant agreement no. 777778.

## DATA AND MATERIALS AVAILABILITY

Data associated with this research are confidential and cannot be released.