Abstract
This study evaluates EQTransformer, a deep learning model, for earthquake detection and phase picking using seismic data from the Southern Alps, New Zealand. Using a robust, independent dataset containing more than 85,000 manual picks from 13 stations spanning almost nine years, we assess EQTransformer’s performance and limitations in a practical application scenario. We investigate key parameters such as overlap and probability threshold and their influences on detection consistency and false positives, respectively. EQTransformer’s probability outputs show a limited correlation with pick accuracy, emphasizing the need for careful interpretation. Our analysis of illustrative signals from three seismic networks highlights challenges of consistently picking first arrivals when reflected or refracted phases are present. We find that an overlap length of 55 s balances detection consistency and computational efficiency, and that a probability threshold of 0.1 balances detection rate and false positives. Our study thus offers insights into EQTransformer’s capabilities and limitations, highlighting the importance of parameter selection for optimal results.
Introduction
Creating a seismicity catalog from raw, three‐component, continuous seismic data involves several steps to ensure accurate and reliable results. In a typical workflow, earthquake waveforms are recorded and picked in seismograms from individual stations. Phase arrivals from multiple stations are then associated with unique seismic sources. In this workflow, reliable earthquake detection and accurate phase arrival estimation are essential for obtaining an accurate catalog of earthquake hypocenters. Comprehensive seismicity catalogs provide first‐order constraints on the geometry, segmentation, seismogenic extent, and kinematics of active faults (e.g., Waldhauser and Schaff, 2008; Shelly et al., 2015; Woo et al., 2019; Liu et al., 2022; Warren‐Smith et al., 2022), and thus on their mechanical behavior and the hazards they pose.
The increased availability of large seismological datasets has intensified the use of automated methods of catalog creation (e.g., Kong et al., 2019; Arrowsmith et al., 2022), especially for time‐consuming event identification and phase picking. The recent advancements in machine learning models greatly aid those procedures, with performances that can rival manual approaches (see, e.g., Cianetti et al., 2021; Scotto di Uccio et al., 2023, for the recent reviews).
Machine learning algorithms are commonly trained using large, labeled datasets (e.g., STanford EArthquake Dataset [STEAD]; Mousavi et al., 2019) and the open‐source distribution of some of these tools facilitates their application to other datasets. One such tool is EQTransformer—a deep learning model used for earthquake signal detection and seismic phase picking (Mousavi et al., 2020). EQTransformer incorporates a neural network with a multitask structure containing recurrent and convolutional layers with attention mechanisms at both global (full earthquake waveform) and local (P and S arrival) levels. Other promising automatic methods exist (e.g., Ross et al., 2018; Zhu and Beroza, 2019), but we focus here on EQTransformer due to its training on global data from diverse tectonic settings, widespread adoption, relatively low false detection rates, and rapid execution times (Münchmeyer et al., 2022).
The previous studies have compared EQTransformer against traditional and other deep learning models (Cianetti et al., 2021; ,García et al., 2022; Münchmeyer et al., 2022; Zhu et al., 2023). Conversely, testing with new continuous data on which EQTransformer has not been trained has been limited, with results varying across studies (Cianetti et al., 2021; García et al., 2022; Scotto di Uccio et al., 2023). Few studies about parameter testing have been conducted in continuous data: Park et al. (2023) remains the sole study analyzing the cause of prediction inconsistencies (see Fig. 1 for an example), attributing these to poor overlap parameter selection, and García et al. (2022) test the probability threshold but while using a default overlap parameter value. It is crucial to understand the relationship between EQTransformer’s output probability and pick accuracy (i.e., timing uncertainty or residuals), as probability outputs from this and other machine learning pickers are being used with phase association algorithms (e.g., GaMMA, Zhu et al., 2022; and REAL, Zhang et al., 2019).
Before using EQTransformer to assemble earthquake catalogs from continuous data it has not been trained on, its reliability and accuracy require further validation. Our study addresses this using a manually picked, multiyear seismicity catalog from the Southern Alps, New Zealand (Michailos et al., 2019) that was not included in EQTransformer’s training. This approach allows an unbiased evaluation of EQTransformer’s performance against manually picked detections, the optimization of parameters, and evaluation of the algorithm’s applicability to new datasets.
Data
The Southern Alps of New Zealand and the Alpine fault that bounds them have been the focus of several recent microearthquake studies using temporary networks (e.g., Boese et al., 2012; Bourguignon et al., 2015; Michailos et al., 2019; Warren‐Smith et al., 2022; and references therein). The Southern Alps Micro‐earthquake Borehole Array (SAMBA; Boese et al., 2012) is one such network adjacent to the central Alpine fault. It comprises 11 borehole sensors and six surface sensors, most of which have been in continuous operation since 2008 and 13 of which remain operational in 2022 (see Fig. 2 for station locations). SAMBA data vary in quality due to equipment upgrades, station outages, and diverse natural, anthropogenic, and equipment noise sources (Boese et al., 2012). We do not specifically address these data inconsistencies, because not only is it time consuming, but also we aim to focus on EQTransformer’s performance using complex raw data.
Michailos et al. (2019) used data from SAMBA and nearby networks to construct a microseismicity catalog for the central Alpine fault. This catalog includes 9111 earthquakes from late 2008 to early 2017. From 150,257 manually reviewed picks in the catalog, we use the 85,733 picks from the SAMBA network to evaluate EQTransformer’s detection accuracy in comparison to manual approaches.
Method: Phase Detection and Picking Using EQTransformer
We use the SeisBench implementation (v.0.2.3) of EQTransformer (Woollam et al., 2022) due to its straightforward installation, ongoing development, extensive documentation, and developer support. Specifically we focus our testing on the “original” EQTransformer weights deployed in SeisBench. Our testing (code link in Data and Resources) reveals that the native EQTransformer and SeisBench implementation yield consistent results such that numerical outputs differ by less than for 5000 random STEAD (Mousavi et al., 2019) cases.
EQTransformer was trained on the STEAD dataset (Mousavi et al., 2019), which includes earthquake and nonearthquake (noise) signals from the global recordings. Seismograms in STEAD start 5–10 s prior to the P‐arrival time. Despite data augmentation techniques employed by Mousavi et al. (2020), EQTransformer achieves optimal performance against new data when phase arrivals align with the training phases in each window analyzed. As shown in Figure 1, the detection of P and S arrivals strongly depends on the phases’ start times within the 60 s window. This detection inconsistency, also reported by Park et al. (2023), is present in both SeisBench and native EQTransformer, and affects the algorithm’s detection capability in continuous data.
Subsequently, we detail our results from tests conducted to determine the optimal parameters when using EQTransformer with continuous data.
Results
Overlap length parameter
EQTransformer processes 60 s long data segments (Mousavi et al., 2020). To handle continuous data, EQTransformer segments the data into 60 s lengths, using moving windows with adjustable overlaps. The default overlap length in EQTransformer version 0.1.58 and SeisBench version 0.2.3 (“original” pretrained weights) is 30%, or 18 s. The native EQTransformer implementation makes independent picks in each 60 s window, which can lead to multiple picks for a single arrival in overlapping regions.
We have implemented the maximum stacking approach in SeisBench to combine probability time series from individual windows (see Data and Resources) to avoid duplicate picks while faithfully retaining all peaks in the native version of EQTransformer. This process combines 50 s long output probability time series from individual windows by selecting the maximum sample across overlaps to give one continuous probability time series within which unique picks can be made. However, this feature is only useful when prediction inconsistencies due to poor overlap values are minimized.
To identify an optimal overlap length, we tested the EQTransformer “original” model on 24 hr long seismic waveforms for two days in 2009 and 2016, representing days with optimal station operations and data availability. An optimal overlap length ensures that P‐phase arrivals are in the location for best detection and picking (within the first 10 s) in at least one window. With sufficient overlap, when the windows are combined by taking the maximum probabilities, the combined probabilities will be representative of the best case window even for randomly distributed phase arrivals.
We tested overlap lengths of 10–59 s. As EQTransformer processes the input waveforms in windows of 60 s at 100 Hz, the maximum overlap length possible is 59.99 s. To ensure that phase arrivals occurred at different locations within the 60 s window, we used Python’s (3.7.6) “random” library to generate 15 random samples from the range 0–59 s sampled every 1 s to shift the start times of the 24 hr long input waveforms. We processed each 24 hr long waveform with these 15 random start time shifts for each overlap length.
To quantitatively assess the dependence of detection performance on overlap length values, we cross correlated the detection probability time series obtained with the 15 different start times, for each overlap, using EQcorrscan (Chamberlain et al., 2018). The mean value of the 105 correlations was used to quantify the performance of the overlap length parameter. Low or zero overlap lengths cause noticeable changes in the probability time series when shifting the traces, indicating inconsistent detection performance. In contrast, high overlap lengths produce consistent probability time series, resulting in high correlations regardless of shift time.
Figure 2 shows the results from the cross correlations for the two days tested. The plot indicates that overlap lengths of 50 s or more provide consistent detection probability outputs, evidenced by high cross‐correlation means. This result is observed across all stations and in stations from the Swiss Seismological Service (SED; Swiss Seismological Service [SED] at ETH Zurich, 1983; see Text S1 and Fig. S1, available in the supplemental material to this article). However, EQTransformer’s computational run time must be considered: higher overlap lengths lead to longer run times due to more windows being analyzed. For instance, processing a 24 hr long waveform from a single station (using an 12th Gen Intel Core i5‐12500 with 12 threads) took 9 s with the default overlap, but this time increased to 56 s when using a 55 s overlap.
Probability threshold
The primary output of EQTransformer is a time series of detection probabilities. A trigger then converts this time series into discrete detections, for which the saved detections depend on the type of trigger used and the minimum probability threshold. The default trigger used in SeisBench (but not in the native EQTransformer) needs to return to a base level—the trigger‐off threshold—before saving another detection. The default probability threshold in SeisBench (v.0.2.3) for the “original” pretrained weights is 0.1 for both P and S picks.
To determine an optimal probability threshold, we first examined the relationship between the output probability and pick quality (accuracy). We ran EQTransformer for the entire catalog duration for all available SAMBA stations. We processed nearly 9 yr of continuous data as 24 hr long waveforms with an overlap length of 55 s, using the default SeisBench trigger and a probability threshold of 0.01 for both P and S detections. We opted for a low‐probability threshold, so that we can analyze how the pick accuracy varies to these minimal probability levels. To evaluate the impact of higher thresholds, we eliminated detections that fell below these elevated probability values.
We compared all of the detections made with EQTransformer against 85,733 manual picks from Michailos et al. (2019) to determine how many of the earthquakes in the catalog were also detected by EQTransformer. A detection was considered recalled if the closest EQTransformer pick was within 2 s of the manual pick. At the lowest probability threshold of 0.01, EQTransformer matched 91% of the manual picks. When filtering picks to a threshold of 0.1 (10 times higher), 86% of the manual picks have a match. Stations REYN and MTBA exhibited the highest percentages of missed detections, with rates of 49% and 19%, respectively, using a 0.01 threshold. REYN struggled with noncontinuous data and channel issues, whereas MTBA had initial configuration errors. For higher‐quality stations, missed detections mainly stemmed from earthquakes with low signal‐to‐noise ratios.
Figure 3 illustrates the relationship between the probability given by EQTransformer and the pick accuracy (time difference between the EQTransformer and manual picks). As shown in panels (a) and (c) of Figure 3, the probability assigned by EQTransformer does not exhibit a direct linear correlation with pick accuracy across all analyzed statistics. This suggests that the assigned probability is not a definitive indicator of pick accuracy. In fact, high pick accuracy is consistently observed regardless of the associated probability. The statistics also show that P picks are more accurate compared to S picks. As illustrated in the histograms in Figure 3, the data distribution is generally symmetrical. Both the statistical distributions and the histograms indicate that a significant majority of the picks have a high probability (85% of P phase picks and 56% of S phase picks are above a 0.7 probability). Furthermore, 87% of P phase picks and 86% of S phase picks show a minimal difference, less than 0.25 s, from the manual selections.
We used the same EQTransformer parameters in a similar process with 1369 picks from the SED, and the results were comparable to those from the central Alpine fault (see Text S2 and Fig. S2). However, the SED dataset contains many fewer measurements, and there is even less indication of a monotonic relationship between probability and accuracy of the picks. Remarkably, 96% of the P and S picks were recalled using a probability threshold of 0.1. Of these, 79% had a high probability (above 0.7), and 96% P phase picks and 94% of S phase picks show a difference of less than 0.25 s. Additional details and visualizations can be found in the supplemental material. In conclusion, although the probability threshold is not always indicative of pick quality, it likely plays a significant role in influencing the false detection rate.
True positive and false positive detections
To assess the proportion of EQTransformer detections that correspond to actual earthquakes, as inferred by Michailos et al. (2019), we manually inspected random detections from two months of data in July 2009 and April 2016. We opted to test three different picking thresholds: 0.7, 0.1, and 0.01. For each of these thresholds, we randomly selected 100 detections each day for both P‐ and S‐pick labeled detections. This means that, for each threshold, we visually inspected and categorized a total of 400 detections (200 P and 200 S) as “True,” “False,” or “Undecided.” The “Undecided” category became necessary as many detections had not been previously labeled and had low signal‐to‐noise, making it challenging to conclusively determine whether they corresponded to seismic events.
As depicted in Figure 4, employing a lower threshold resulted in a notable increase in false positive picks. A significant number of these false positives were digital spikes attributed to the solar controller, or real signals related to quarry activity and glacier movement or landsliding. For a threshold of 0.01, more than 50% of detections were confirmed to be false positives for P phases and even higher percentage for S phases. Although more detections were recalled at a 0.01 threshold, the very high number of false positives is likely to impact the performance of associators, as discussed by Zhu et al. (2022). To optimize the results from the associations, false positives from known sources could be removed using matched‐filtering techniques. In our case, the digital spikes from the solar controller all have the same appearance, and can be easily identified and removed.
Secondary phases
The final test we conducted with EQTransformer was aimed to determine its ability to reliably detect first arrivals in scenarios for which multiple phase arrivals are evident, for example a Pn arrival followed by the direct Pg phase.
To investigate this, we analyzed picks from the central Alpine fault catalog, picks made by the SED (Swiss Seismological Service [SED] At ETH Zurich, 1983), and by the Southern California Earthquake Data Center (Southern California Earthquake Data Center [SCEDC], 2013). Figure 5 displays four distinct examples for which SeisBench’s EQTransformer implementation detects two P arrivals, but only one is retained. Because the trigger‐off threshold is not met between the two detections, the highest peak above the trigger threshold is saved. This observation remained consistent when using both maximum and average stacking (whereby the mean of overlapping windows is returned rather than the maximum) to process the overlapping windows.
We observed variable results for secondary phases, as demonstrated in Figure 5. In some cases (e.g., Fig. 5a,b), the second arriving phase has higher probability and the incorrect phase is returned. In other cases (e.g., Fig. 5c), despite the secondary arrival having a higher amplitude in the raw seismogram, the first arrival is correctly returned due to a proper assignment of probabilities. These results are not restricted to real seismic phases as shown in Figure 5d; pre‐event digital noise can obscure real phase arrivals as well when occurring close in time to the first arrival.
These results indicate that the standard trigger routine in SeisBench is not fully appropriate for parsing the probability time series. Although automated picking and detection methods have significantly advanced over the recent years, they can still be unsuccessful in specific scenarios. Attributing higher probability to the wrong arrival can create challenges in associating detections and diminish the accuracy of event locations. In addition, EQTransformer’s assignment of probabilities is not systematic in these cases, such as assigning higher probability to phases with higher amplitude. Nevertheless, these results demonstrate the possibility of using such machine learning pickers to detect phases beyond the first P and S arrivals. The future models may allow the labeling of phases other than P and S, although this would introduce substantial complexity.
Discussion and Conclusions
To thoroughly assess the effects of key parameters on EQTransformer’s performance, we tested the algorithm against 85,773 manual picks from an extensive 9 yr long catalog from the central Alpine fault. These picks were not part of EQTransformer’s training set, making them independent observations and thereby enabling an unbiased evaluation of EQTransformer’s performance. We also tested 1369 picks from the SED to corroborate results. We recognize the data from the SAMBA network used in the tests present significant variations in quality, making it more challenging not only to EQTransformer, but also to those manually detecting and picking earthquakes.
Our testing corroborates the findings of Park et al. (2023) of a noticeable variance in EQTransformer’s detection performance based on the waveform overlap lengths. Longer overlap lengths showed more consistent results, evident by the high cross‐correlation values irrespective of shift time. However, it is important to note that longer overlap lengths produce correspondingly longer EQTransformer run times due to the increased number of windows analyzed. Taking into consideration computational time and the quantitative analysis carried out, we consider it adequate to use an overlap length of 55 s to detect seismicity in continuous data. Should additional computational resources and time be made available, an overlap length of 59 s would provide more optimal performance.
We analyzed the relationship between the probability given by EQTransformer and the pick accuracy. Our analysis, segmented by probability intervals, reveals that the probability assigned by EQTransformer does not consistently reflect pick accuracy. Pick accuracy was found to be consistently high for the Alpine fault and SED datasets, despite their different sample sizes. Consequently, the utilization of the EQTransformer‐generated probability as a measure of pick uncertainty or as pick weights in other processes (such as association or event location) requires careful consideration.
A particular challenge we identified was high amplitude secondary arrivals and noise, such as digital spikes, being misidentified as first arriving P phases. The probability assigned to detections by EQTransformer could cause secondary arrivals or digital noise to be considered as first arrivals. However, this is not always the case, and EQTransformer sometimes assigns higher probabilities (correctly) to first arrivals despite their lower amplitude in raw seismograms, indicating that this is not a systematic error. We found an increased frequency of this issue at stations where reflections and/or refractions are more common, as observed at the EORO station in the SAMBA network. Recognizing the specific issues or tendencies in earthquakes recorded at these stations is crucial for optimizing the use of EQTransformer’s output. For such scenarios, we recommend the implementation of a trigger that records all peaks, bypassing the need to reach a trigger‐off threshold. Such misclassifications can complicate the association of detections and reduce the accuracy of event locations.
Our evaluation of EQTransformer has reinforced its potential as a robust and efficient tool for seismic event detection and phase picking for datasets it has not previously encountered. With a recall of 91% at a probability threshold of 0.01, the algorithm is found to identify seismic events with high accuracy. Using low probability thresholds does result in a significant increase in false positive detections, which can potentially lead to incorrect pick association. Using a threshold of 0.1 has a 86% recall for our dataset. Therefore, selecting a probability threshold much higher than 0.1 will not only minimize false positives but also compromise the detection rate when using the original EQTransformer model. Subsequent processes such as association or relative relocation can help in discarding false positives.
EQTransformer presents an efficient solution for seismic event detection and phase picking, but caution must be taken in choosing appropriate parameters. Users should be aware of the importance of selecting an optimal overlap length, the possibility of making erroneous picks in the presence of secondary phases, the likelihood of making false positives at lower thresholds, and considering which output parameter (probability or otherwise) is the best indicator of pick accuracy.
Data and Resources
Data from the Southern Alps Micro‐earthquake Borehole Array (SAMBA) network can be accessed through the Incorporated Research Institutions for Seismology Data Management Center (IRIS‐DMC) at https://www.fdsn.org/networks/detail/9F_2008/. Pick catalogs and waveforms from the Swiss Seismological Service (SED; Swiss Seismological Service [SED] At ETH Zurich, 1983) and Southern California Earthquake Data Center (SCEDC; SCEDC, 2013) agencies were accessed through the International Federation of Digital Seismograph Networks (FDSN) using ObsPy. The codes used for comparing the SeisBench implementation with the native EQTransformer, as well as those for testing overlap and probability threshold in the SED data, can be downloaded at doi: 10.5281/zenodo.8271943. The catalogs containing the time differences can be found in the same location. Information about maximum‐stacking in SeisBench is available at https://github.com/seisbench/seisbench/pull/99. All websites were last accessed in November 2023. The supplemental material provides plots and results derived from the SED data.
Declaration of Competing Interests
The authors acknowledge that there are no conflicts of interest recorded.
Acknowledgments
The authors gratefully acknowledge financial support from the Marsden Fund of the Royal Society Te Apārangi (Project Number VUW2008), Toka Tū Ake EQC, via the EQC Programme in Earthquake Seismology and Tectonic Geodesy, and a Victoria University of Wellington—Te Herenga Waka Doctoral Scholarship. The authors thank the Editor, Associate Editor, Jannes Münchmeyer, Stephen Hicks, and an anonymous reviewer for comments, which helped improve the article.