Oklahoma and Kansas experienced unprecedented seismic activity over the past decade due to earthquakes associated with unconventional hydrocarbon development. The modest natural seismicity and incomplete knowledge of the fault network in the region made it difficult to anticipate the locations of earthquakes with larger magnitudes (Mw4). Here, we show that monitoring of microearthquakes at regional scale using a pretrained neural phase picker and an earthquake relocation algorithm can illuminate unknown fault structures, and deliver information that can be synthesized for earthquake forecasting. We found that 80% of the larger earthquakes that occurred in the past decade could have been anticipated based on the spatial extent of the seismicity clusters that were formed before these earthquakes occurred. We also found that once a seismicity cluster with a length scale enough to host a larger earthquake was formed, there was a ∼5% chance that it would host one or more larger earthquakes within a year. This probability is nearly an order of magnitude higher than one based on Gutenberg–Richter statistics and preceding seismicity. Applying our approach in practice can provide critical information on seismic hazards for risk management and regulatory decision making.

Oklahoma and Kansas experienced unprecedented seismic activity over the past decade due to earthquakes triggered by unconventional hydrocarbon development (Ellsworth, 2013; Keranen et al., 2014). The region did not have a history of frequent earthquake activity, and little was known about the location of the faults that came to host it. For that reason, seismic hazard forecasts relied on the seismicity rate of the previous years without information on the distribution of regional faults (Petersen et al., 2016, 2017, 2018).

Microearthquake activity can illuminate otherwise unknown faults, particularly when events are detected down to low magnitudes and are located precisely. The resulting seismicity can be treated as a sample of the underlying fault geometry and used to estimate important fault attributes, such as length, segmentation, and orientation. In this study, we apply machine learning algorithms to reanalyze publicly available continuous waveform data in Oklahoma and Kansas from the decade 2010 to 2019, to develop a comprehensive regional earthquake catalog. From this improved catalog, we demonstrate that most of the faults that hosted earthquakes with moment magnitude of 4 or greater (Mw4) were illuminated by detectable seismicity in advance, such that they could have been better anticipated had real‐time microearthquake monitoring been operational.

We used continuous waveforms from 17 publicly available seismic networks in Oklahoma and Kansas region (codes 4H, 9L, GS, N4, NP, NQ, NX, O2, OK, TA, US, XR, Y7, Y9, ZD, ZP, and ZQ), comprising 420 stations, from January 2010 to December 2019, which totaled 796 station‐years of data. The locations of the stations are shown in Figure S1, available in the supplemental material to this article, and their details are listed in Table S1. We used a pretrained neural network (PhaseNet; Zhu and Beroza, 2019) to detect earthquakes, and pick P‐ and S‐phase arrival times. Because the picks and the classification scores vary depending on where the arrival is in the 30 s input window, we adopted a small stride of 2 s and used only the picks that were consistently predicted in at least five windows with a prediction score of 0.5 or greater. These picks were then associated using a grid search algorithm (Johnson et al., 1997; Zhang et al., 2019) in which theoretical travel times were computed from the velocity model from Darold et al. (2015). We used the theoretical travel times to constrain further the possible time window ranges between a pair of phases. This allows us to restrict the length of the time window between the first phase pick and the following phase picks during association, which removes potential ambiguities that can be introduced with longer time windows. To declare an earthquake, we not only required the association results to have phase picks from at least three stations in which two of them have both P and S phases picked but also used an adaptive score defined as
(1)Associationscore=i=1Nw(ri)[1{ϕiA}]i=1Nw(ri),
to address seismic network variability, which is important because the extent and density of the network varies strongly with time (Table S1). Here, N is the maximum number of phases that can be observed, that is, twice the number of stations that were operating at the time of the earthquake, and w(ri) is a weighting function. The indicator in the numerator, that is, 1 evaluates to 1 if the ith phase (ϕi) is in the association result (A). To downweight stations at greater distance from the epicenter of the grid search solution, we used the following weighting function:
(2)w(ri)=min(1ri,1R),
in which we set the cutoff radius (R) for constant weight to 10 km. This makes the association criteria more (or less) conservative when there are more (or less) stations near the source. We required the association results to have an association score of 0.3 or greater. We determined initial hypocentral locations with HYPOINVERSE‐2000 (Klein, 2002) and refined them with hypoDD (Waldhauser, 2011) using the same velocity model from Darold et al. (2015). For the final locations, we used the differential travel times calculated from the phase picks supplemented with time‐domain cross‐correlation measurements. For the latter, we detrended and band‐pass filtered the data to 1–20 Hz and used a window length of 1 s (0.2 s before and 0.8 s after each phase pick). We also used the three‐point quadratic interpolation on the cross‐correlation function to obtain subsample precision, and a weighting function that considers both the largest and the second largest cross‐correlation coefficient to capture confidence in the measurements, following Shelly et al. (2013, 2016). We determined local earthquake magnitudes using the procedure and distance correction function developed by Walter et al. (2019) but adopted moment magnitudes from Herrmann et al. (2011) when available. The resulting 310,046 earthquakes are shown in Figure 1 along with mapped faults from Marsh and Holland (2016) and Mw4 earthquakes from this period. The Mw4 earthquakes were chosen based on the adopted moment magnitudes, and the total count was 60. The earthquake catalog is provided in Table S2, and the details of the 60 Mw4 earthquakes are provided in Table S3. We chose deliberately conservative association parameters to avoid false detections, because the quality of the individual earthquakes could not be manually checked due to the large number. In addition, only the magnitudes of the 60 Mw4 earthquakes are used in our analysis, whereas others are only used for plotting the magnitude–frequency distributions in Figure 2 and Figures S5–S11.

Earthquakes form clusters that illuminate the activated segments of underlying fault structures. However, clustering the earthquakes is a challenging task due to the nonuniformity in their spatial distribution and density, location uncertainty, and faults that intersect. Clustering algorithms that are solely based on point coordinates are not robust against these conditions. Because most faults in the area are vertical strike slip, and because focal depths are less certain than the epicenter, we ignored the focal depths and used a ridge detection algorithm (Chen et al., 2015) to map the epicenters into ridges. We then estimated the normal vector at each point along the ridge using principal component analysis on the neighboring ridge points within 50 m. Using both the coordinates and the estimated normal vectors of the ridge points, we segmented the ridges and assigned nearby earthquakes to their closest ridge segment. We did not use the earthquakes that could not be confidently relocated using hypoDD (Waldhauser, 2011), because our analysis more strongly depends on precise locations rather than catalog completeness. Finally, we postprocessed the clustered earthquakes by manually joining clusters that were well aligned and removing the clusters with low aspect ratios that did not look like a planar fault or those with less than 30 earthquakes. Every step, especially the last step, involves some level of subjectivity when choosing parameters for the algorithms and deciding on which clusters to merge. However, the statistics we discuss here were counted after finalizing the clustering and thus were not subjectively biased. Figure 3 demonstrates the workflow applied to one of the most challenging areas to cluster (highlighted in Fig. 1).

For each cluster, we used the minimum bounding box, as shown in Figures 1 and 3e, to measure the length by taking the longest side and the strike angle by taking the slope. We then used the circular fault model (Eshelby, 1957), that is,
(3)r=(716M0Δτ)1/3,
with a constant stress drop (Δτ) of 3 MPa, which is a typical value for induced earthquakes (Huang et al., 2016), and defined the length scale L of an earthquake with a given moment magnitude (Mw) as twice the radius (r), which yields L = 1.14 km for an Mw 4 earthquake (referred to as LM4 hereafter). Based on these assumptions, 430 identified clusters had length scales of at least LM4. The details of these clusters are listed in Table S4, and the distribution of measured strike angles is shown in Figure 1. We also measured the length scales of these clusters in time by repeatedly applying the clustering approach that was used for a sparse catalog (Schoenball and Ellsworth, 2017b) after every added earthquake on each of the 430 clusters. In each iteration, we ignored the isolated earthquakes and measured the length scale of the largest subcluster using a minimum bounding box to ensure that only the dense part contributes to the length.

Illumination of unknown structures

The seismicity in our catalog reveals widespread and previously unknown fault structures in Oklahoma and Kansas, including those that hosted the Mw4 earthquakes. The trend of seismicity shows clear preferred orientations that are close to optimally oriented with respect to the maximum horizontal stress direction (Alt and Zoback, 2016). Some sequences, particularly the Pawnee sequence in Figure 1, display faulting on multiple structures and at all resolvable scales. Another aspect that emerges are clear seismicity boundaries. The Nemaha ridge (Fig. 1a) has been hypothesized as a barrier to flow (Weingarten et al., 2015), and our result strengthens the case. Specifically, the seismicity in eastern Oklahoma tends to be constrained to the east of the ridge south of 36.25° N but to the west of the ridge farther north. Another example is shown in Figure 1b in which the seismicity is confined to the west of the highlighted line, which is nearly parallel to the mapped faults in the vicinity.

Our catalog shows substantial improvement in the earthquake count and spatial resolution compared to the previous studies that used other workflows for improving upon the routine catalog (Schoenball and Ellsworth, 2017a), including template matching (Skoumal et al., 2019). Figure 2 shows an example of the seismicity from these previous studies and from our catalog in the area highlighted in Figure 1 as well as their cumulative magnitude–frequency distributions. The added microearthquakes in our catalog effectively connect the dots between scattered earthquakes and small clusters formed by larger earthquakes, highlighting more extensive and continuous structures. More zoomed‐in examples are shown in Figures S5–S11. However, the magnitudes in the three catalogs were taken from different sources and/or calculated using different correlations, and thus should not be directly compared.

Preillumination of structures with sufficient length scales

Figure 4 shows the seismicity before and after each of the 60 Mw4 earthquakes, and compares their moment magnitudes to those calculated from the measured length of the hosting clusters prior to their occurrence. Based on our analysis, 48 of the 60 Mw4 earthquakes (80%) occurred on clusters that were large enough to host events of at least their size given antecedent seismicity such that they may not have come as a complete surprise. On the other hand, 12 of the 60 Mw4 earthquakes (20%) occurred on clusters that would not have been flagged as large enough, as no clear cluster with sufficient length scale was identified in advance. This could reflect either a genuine lack of seismic activity before these Mw4 earthquakes, or it could be due to insufficient station coverage for seismicity monitoring (Fig. S12), which is most likely to be relevant for earthquakes 1 and 2 in Figure 4. By capturing the aftershocks of these “surprising” earthquakes, however, the hosting clusters became more definite over time such that the measured length scales exceeded those of the following Mw4 earthquakes on those faults. For example, earthquake 34 in Figure 4 was a surprise, but the next seven Mw4 earthquakes (earthquakes 40, 41, 42, 44, 45, 46, and 51), including the 2016 Mw 5.0 Fairview mainshock (earthquake 42), occurred on a clearly defined structure with a sufficient length scale. Such persistently active features may be especially useful for hazard analysis at long‐term injection sites where operating time scales can be years to decades.

Chances for an Mw4 cluster hosting one or more Mw4 earthquakes

Among the 430 clusters we identified, only 32 of them hosted one or more Mw4 earthquakes in the past decade. If we define the clusters that hosted one or more Mw4 earthquakes within a future time interval after reaching a length scale of at least LM4 as true positives (TP) and those that did not in the same time interval as false positives (FP), we can calculate the precision score, that is,
(4)Precision=TPTP+FP.

Similar to the exceedance probabilities used in seismic hazard analysis, this accounting could be interpreted as an empirical probability of having at least one Mw4 earthquake within the set time interval once a fault activates over a length scale of at least LM4. Based on our results, the precision scores were evaluated as 4.8%, 5.5%, 5.7%, and 6.2% for time intervals of 1, 2, 3, and 4 yr, respectively, in which 4 yr was enough to contain all the 26 clusters that produced at least one Mw4 earthquake in the past decade. The TP and FP clusters classified based on the one‐year time interval are shown in Figure 5. We are not counting the false negative cases in which a cluster was either unidentified or had a length scale shorter than LM4 before hosting the first Mw4 earthquake. The result shows that most of the clusters that hosted an Mw4 earthquake did so within the first year after formation (20 out of 26), and the number of those that took longer was relatively small. It is not clear, however, if this is a feature of cluster evolution, a result of changes in oil and gas operations or a consequence of regulatory restrictions on wastewater disposal.

For comparison, the probability of having one or more Mw4 earthquakes in the next year can be calculated under the Poisson assumption, that is,
(5)P(Mw4)=1exp(λ),
in which the rate parameter (λ) can be derived from the Gutenberg–Richter statistics (Gutenberg and Richter, 1944). If we calculate the rate parameters at the midpoint coordinates of the 430 clusters by interpolating the values reported by Petersen et al. (2016, 2017, 2018) and use equation (5), the probabilities of having one or more Mw4 earthquakes in 2016, 2017, and 2018 evaluates to an average of 0.50%, 0.59%, and 0.78%, respectively, with a standard deviation of 0.27%, 0.21%, and 0.35%, respectively (Table S4). Thus, even though only a fraction of the clusters with length scale of at least LM4 produced one or more Mw4 earthquakes within a year, the empirical probability we defined is comparable to those found for time‐dependent forecasting of tectonic earthquakes (Jones, 1985; Marzocchi and Lombardi, 2009) and represents a probability gain (Aki, 1981) of 9.6, 8.1, and 6.2 relative to the already elevated seismicity rates used to produce the one‐year forecasts for 2016, 2017, and 2018, respectively.

These statistics were derived from a simple scaling relationship (equation 3), and the clusters we identified do not necessarily represent a simple, planar fault. However, these approaches could still be used to support decision making when structures with concerning length scales are revealed by antecedent seismicity, such as taking steps for risk reduction (Langenbruch and Zoback, 2016). It could also be used to motivate enhanced geophysical monitoring such as monitoring the pressure in the disposal zone or deploying more seismometers around concerning clusters to improve microearthquake monitoring.

Enhanced microearthquake monitoring has become straightforward due to the advances in automated phase picking algorithms and other well‐established methods of precision seismology. We developed a comprehensive regional earthquake catalog for a decade for Oklahoma and Kansas that can be thought of as retrospective monitoring using a workflow suitable for real‐time operations. We used this catalog to demonstrate that the faults that hosted induced seismicity were distinct from the previously mapped faults, and that locations of most of the Mw4 earthquakes should not have been a complete surprise. Using the correspondence between preilluminated structures and earthquake potential to quantify earthquake probabilities would support improved near real‐time operational and regulatory decision making.

Continuous waveform data for the 17 public seismic networks were downloaded through https://www.iris.edu/ (last accessed May 2022). The links to each seismic network data are 4H: doi: 10.7914/SN/4H_2014, 9L: doi: 10.7914/SN/9L_2013, GS: doi: 10.7914/SN/GS, N4: doi: 10.7914/SN/N4, NP: doi: 10.7914/SN/NP, NQ: doi: 10.7914/SN/NQ, NX: doi: 10.7914/SN/NX, O2: doi: 10.7914/SN/O2, OK: doi: 10.7914/SN/OK, TA: doi: 10.7914/SN/TA, US: doi: 10.7914/SN/US, XR: doi: 10.7914/SN/XR_2016, Y7: doi: 10.7914/SN/Y7_2016, Y9: doi: 10.7914/SN/Y9_2016, ZD: doi: 10.7914/SN/ZD_2014, ZP: doi: 10.7914/SN/ZP_2016, and ZQ: doi: 10.7914/SN/ZQ_2011. The supplemental material contains the details of the stations used in this study are available in Table S1, the earthquake catalog produced from this study is available in Table S2, the details of the 60 Mw4 earthquakes are available in Table S3, and details of the 430 identified clusters are available in Table S4.

The authors declare that there are no conflicts of interest recorded.

This work was supported by the Stanford Center for Induced and Triggered Seismicity, and the Department of Energy (Basic Energy Sciences; Award Number DE‐SC0020445). The authors thank Daniel Trugman and an anonymous reviewer for their constructive feedback.

This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data