Machine learning (ML) earthquake phase detection algorithms continue to gain popularity and are routinely used to generate research catalogs with thousands of previously uncatalogued events. Many ML algorithms are available pretrained on large, global data sets. These pretrained models promise regional transferability and applications in real‐time monitoring organizations. However, the adoption of these ML algorithms by monitoring agencies requires trusted performance across a wide range of seismic monitoring challenges. We apply a pretrained algorithm to four characteristic studies representing a range of network, tectonic, and environmental challenges. We establish a three‐catalog comparison framework between our ML catalog, a real‐time catalog, and an analyst reviewed catalog. We visually assess and label all ML and real‐time events not found in our analyst reviewed catalog. Finally, we subset all additional events that match our catalog standards, establishing a one‐to‐one performance comparison between our ML and real‐time algorithmic catalogs. For each study, we find our ML catalog provides a consistently higher match to our analyst reviewed catalog than the real‐time catalog. However, the ML catalog from each study introduces additional complexities ranging from the large addition of poorly constrained, smaller magnitude events; misidentification of nonearthquake signals; and missed detection of large magnitude, felt earthquakes. These discrepancies warrant further training data set scrutiny and suggest that the establishment of a location‐based training data set is necessary for consistent and reliable ML performance.

KEY POINTS

  • Globally pretrained machine learning (ML) phase detection algorithms claim broad transferability.

  • ML shortfalls in replicating analyst reviewed catalogs may be addressable with location‐based training data.

  • Incorporating ML approaches at monitoring agencies could reduce personnel time for routine review tasks.

Advancing research demonstrates the potential to improve real‐time earthquake monitoring by incorporating machine learning (ML) methods (Yeck et al., 2021; Zhu, Hou, et al., 2022). Combining traditional and ML approaches promises to improve the quality and completeness of hypocenter solutions while reducing manual workloads (Mousavi and Beroza, 2022a,b; Kubo et al., 2024), even as seismic networks expand to acquire increasingly larger data volumes (Pankow et al., 2020; Ebel et al., 2020; Ruppert and West, 2020). These data volumes provide the foundation for more complete seismic catalogs that are supported by continuously increasing numbers of seismic phases. This catalog completeness even incorporates exotic events (e.g., glacial quakes, landslides, mining blasts, and explosions), representing significant progress for the field of seismology. However, increased data volumes heighten demands placed on seismic analysts. Similar technologies that enable larger data volumes and ML advancements have proliferated within mobile technologies. Continually connected smartphones have altered societal expectations to demand instantaneous information. The confluence of increasing populations in earthquake‐prone regions, expanding seismic data volumes, and public intolerance for delayed information highlights the need for innovation in routine seismic data processing.

Workflows for cataloging seismicity generally rely on three tasks: detection, association, and location. Monitoring organizations approach these tasks with amplitude ratio, short‐time average over long‐time average (STA/LTA) sliding window detection methods, predefined travel‐time pattern association methods, and inversion‐based location methods. Within the past decade, the application of ML algorithms retrospectively to individual tasks (e.g., detection: Ross et al., 2018; Mousavi, Zhu, et al., 2019; Zhu and Beroza, 2019; Mousavi et al., 2020; association: McBrearty et al., 2019; Ross et al., 2019; Yu and Wang, 2022; location: Perol et al., 2018; Saad et al., 2022), and as a cohesive workflow (e.g., Zhang et al., 2022; Zhu, Tai, et al., 2022; Si et al., 2024), have shown promising automated catalog improvements. Many of these studies highlight larger earthquake catalogs with lower magnitudes of completeness (e.g., Mousavi et al., 2020). In contrast, our objective is to produce automated earthquake solutions that meet or exceed human performance for cataloged events.

Analyst derived catalogs are known to contain errors. In fact, ML algorithms have proven to be powerful in identifying errors (Kharita et al., 2024). However, for all but the smallest of events, analyst reviewed catalogs are the standard and preferred product from major seismic networks. Therefore, our study benchmark is an analyst reviewed catalog. We compare two automated, algorithmic catalogs against this benchmark: the current real‐time catalog produced by the Alaska Earthquake Center (AEC) and an ML‐based catalog that uses an ML method for phase detection and holds association and location steps the same. Phase detection and labeling is, by far, the weakest link in our current real‐time catalog assessment and the step most likely to be improved by ML approaches. For this reason, we focus the current study on the expected contributions from the integration of ML‐based phase detection. We quantify each algorithmic catalog through comparison with the published analyst reviewed catalog. We then measure algorithmic catalog success by reproduction of phase arrivals and hypocenters in the published catalog.

We apply our three‐catalog comparison framework to study regions selected to represent diverse monitoring tasks found globally. An ideal station network configuration records events with full azimuthal coverage and no extraneous noise. However, many regional seismic networks are faced with offshore seismicity monitored from distant landmasses or isolated island networks with high environmental noise. Therefore, we select four study regions from diverse tectonic, geographic, and network configurations with a focus on times of elevated seismicity, volcanism, or glacial activity. Each study region highlights a different seismic monitoring challenge, ranging from high‐quality data and excellent station coverage, to data heavily impacted by environmental noise and geographically restricted station coverage. These study regions, monitored by AEC, are the Purcell Mountains in northwestern Alaska, Icy Bay in southeastern Alaska, a segment of the Aleutian Island Arc between the Andreanof and Fox Islands, and the vigorous 24hr aftershock sequence beginning with the 30 November 2018 M 7.1 Anchorage earthquake (Fig. 1). To each study region, we apply a globally pretrained ML earthquake phase detection algorithm and analyze how closely each algorithmic (real time, ML) catalog replicates our analyst benchmark catalog. Our three‐catalog approach provides a framework to quantify the performance benefit of ML‐based phase detection and assess the trade‐off of additional, algorithm‐introduced, poorly constrained earthquake solutions, and nonearthquake events.

For each study region, we establish three earthquake catalogs—an analyst reviewed catalog (Ac), a real‐time catalog (RTc), and a machine learning catalog (MLc). A comparison of the Ac to the RTc (or MLc) provides a direct comparison of events and phases added (or deleted) by analysts during catalog preparation. This RTc‐to‐Ac (or MLc‐to‐Ac) comparison is a measure of human‐to‐algorithm performance. High human‐to‐algorithm performance replicates human performance, reproducing the Ac, including all earthquakes and phases. Comparison between the RTc‐to‐Ac and MLc‐to‐Ac results provide a direct evaluation of the current real‐time algorithm against the ML method. We pair this algorithm‐to‐algorithm comparison with a visual review and labeling of all additional RTc and MLc events not included in the Ac.

For each catalog, our unified data source is the inhouse AEC waveform archives, which comprises a combination of data from AK, AT, AV, CN, II, IM, and IU networks (IM: Various Institutions, 1965; AT: National Oceanic and Atmospheric Administration, 1967; CN: Natural Resources Canada, 1975; II: Scripps Institution of Oceanography, 1986; AK: Alaska Earthquake Center, 1987; AV: Alaska Volcano Observatory, 1988; IU: Albuquerque Seismological Laboratory/USGS, 1988). We tackle catalog creation through three tasks of detection, association, and location (Fig. 2). Detections for the RTc and MLc are two distinct workflows. The RTc uses the Antelope detection tool, orbdetect, and its offline companion, dbdetect, on 1 Hz high‐pass‐filtered waveforms (Antelope version 5.11; Boulder Real Time Technologies, Inc. [BRTT], 2024). Detections for the RTc represent impulsive signals from any seismogenic source and are not phase labeled. In contrast, detections for the MLc target earthquake signals and designate P‐ and S‐phase labels. We compile MLc detections with a standalone EQTransformer installation (Mousavi et al., 2020) using 1 Hz high‐pass filtered, 100 Hz up‐sampled waveforms broken into 60s segments with 30s of overlap. We choose EQTransformer trained on global STEAD (Mousavi, Sheng, et al., 2019) for its pretrained availability, high performance in available benchmarking studies (Münchmeyer et al., 2022), and wide use for retrospective seismic catalog generation and reprocessing (e.g., Jiang et al., 2022; Kapetanidis et al., 2024; Peña Castro et al., 2025). Although there are other ML phase detection and labeling algorithms (e.g., Ross et al., 2018; Zhu and Beroza, 2019; Saad et al., 2022) and labeled earthquake data sets available (e.g., Magrini et al., 2020; Michelini et al., 2021; Cole et al., 2023; Aguilar Suarez and Beroza, 2024), we leverage EQTransformer and STEAD as a proxy. Creating only one ML catalog allows for a detailed visual review of all events and phases. A comprehensive visual review would not be feasible with many large ML‐generated catalogs. Finally, although benchmarked ML algorithm performance is variable, other authors have demonstrated that high‐performing ML algorithms produce reasonably consistent results across the broad‐scale monitoring challenges represented by each study. Münchmeyer et al. (2022) demonstrates remarkably consistent performance between EQTransformer (Mousavi et al., 2020), generalized phase detection (Ross et al., 2018), and PhaseNet (Zhu and Beroza, 2019) on both in‐domain and cross‐domain data sets.

Detections for both algorithmic catalogs, either real‐time unlabeled or ML phase labeled, are fed into Antelope’s grid association algorithm, which is set up with AEC operational parameters. The association algorithm sorts detections into possible travel‐time patterns defined by a grid of hypocenter locations. Unlabeled real‐time detections can be sorted as either P or S phases, whereas ML phase labels are honored. If a group of detections are matched with a grid hypocenter, that hypocenter is automatically relocated off grid with genloc (Pavlis et al., 2004). After detection, association, and location tasks are finished, both the RTc and MLc are complete. The RTc is then reviewed by seismic data analysts to produce the Ac, following AEC’s manual review and quality control procedures. Manual review procedures include visual scans of waveform data to incorporate undetected events, the addition and correction of phase arrivals, relocation of hypocenter solutions, and recalculation of magnitudes. AEC quality control checks impose restrictions on hypocenter solutions, including minimum P‐ and S‐phase arrivals and allowable location and depth errors. The published AEC Ac does not comprehensively catalog all events and phases. Ac catalog completeness is approximately M 1.0 for mainland Alaska and M 1.5 for the Aleutian arc. However, all published events and phases have been extensively reviewed, providing a quality, standardized benchmark for catalog comparison.

After all catalogs are established, we match events between each algorithmic catalog (RTc, MLc) and the Ac. Events are matched by hypocenter location with a latitude and longitude spatial tolerance, time tolerance, and magnitude tolerance. Tolerance values, presented in Table 1, were determined to allow as many matched events as possible but remain restrictive enough to not match multiple algorithmically detected events to a single analyst solution. We note that our event comparison criteria are more restrictive than what is used in the Advanced National Seismic System (ANSS) Comprehensive Catalog (ComCat). For each comparison—RTc‐to‐Ac and MLc‐to‐Ac—the Ac is our benchmark, and all events are labeled as matched, missed, or other. Matched events exist in both the Ac and algorithmic catalog. Missed events exist only in the Ac and have no algorithmic catalog solution. Other events exist only in the algorithmic catalog and have no Ac solution. Other events present a unique challenge. These events fall into several categories encompassing earthquakes not included in the Ac; volcanic, glacial, or landslide signals; teleseismic signals; and seismic noise. All other events are further labeled though visual review, performed by a trained seismic data analyst, as earthquake or nonearthquake signals. Finally, on all RTc/MLc other earthquakes, we impose minimum catalog inclusion criteria derived from AEC’s manual review and quality control procedures. These criteria require an earthquake to have at least six P‐ and two S‐phase arrivals, a calculable magnitude, hypocenter error tolerances within the maximum error range for the study region Ac, and a maximum azimuthal gap of less than 180° where geographically applicable. The subset of other earthquakes passing minimum inclusion criteria represent events that could be included in the Ac if identified through visual scans. The nonpassing subset likely represents small‐magnitude events with limited station coverage. Through comprehensive visual review and by imposing minimum catalog inclusion criteria, we mimic current manual seismic data analyst workflows and develop a complete understanding of both algorithmic catalogs.

Our result is two labeled catalog comparisons per study region. These RTc‐to‐Ac and MLc‐to‐Ac labeled comparisons demonstrate events analysts manually include or remove in each algorithmic catalog. This workload is directly measurable from our missed, matched, and other results. Missed events highlight algorithmic shortcomings. RTc missed events showcase earthquake signals that are either not sufficiently impulsive or have insufficient signal‐to‐noise for STA/LTA style detection. In contrast, MLc missed events represent a statistical departure from the ML‐learned earthquake and phase definitions. Nonearthquake other events showcase areas for targeted algorithmic improvements. In the RTc, these events are generally impulsive, high‐signal‐to‐noise recordings that are prolific at stations with high environmental noise. However, in the MLc, these signals sufficiently match the ML‐learned earthquake and phase definitions. Finally, visually confirmed other earthquakes that meet minimum catalog inclusion criteria imply room for network catalog improvements that may be achievable with ML incorporation.

Each study region represents unique monitoring challenges and seismic signals. For each, we summarize tectonic structure, network configuration, chosen data range, and seismicity characteristics that make each region a valuable case study. Full results for each study region are presented in Table 2.

Purcell Mountains, northwestern Alaska

The Purcell Mountains are located in northwestern Alaska in a region of comparatively low seismicity (Fig. 1). The Purcell region has historically produced earthquakes above magnitude 4 and exhibits timeframes of pronounced activity (Estabrook et al., 1988). In 2019, activity increased significantly following a magnitude 4.8 event on 10 February and has since produced over 9600 cataloged events (Ruppert, 2023). These events are shallow, crustal earthquakes with impulsive phase arrivals that are clearly discernible. A combination of low environmental noise; clear, impulsive phase arrivals; and full station azimuthal coverage makes Purcell the ideal case for ML applications in Alaska. We present results for the Purcell study, which spans one month of observations from February 2019, encompassing the beginning of the Purcell Mountains swarm (Table 2).

The Purcell Ac contains 263 events with magnitudes ranging from 0.7 to 4.8. The RTc has 535 events, with RTc‐to‐Ac comparison resulting in 245 matched events, 18 missed events, and 290 other events. Visual review of RTc other events identifies 224 events as earthquakes, 101 of which pass catalog inclusion criteria. Our MLc includes 979 events. MLc‐to‐Ac comparison returns 258 matched events, 5 missed events, and 721 other events. Of the MLc other events, 716 are earthquakes with 194 passing minimum catalog inclusion criteria.

2018 M 7.1 Anchorage earthquake, southcentral Alaska

In late November 2018, a magnitude 7.1 earthquake occurred 50 km under Anchorage (Fig. 1). The mainshock was followed by a vigorous aftershock sequence with over 300 felt events in the first 6 months. Within the first 8 months more than 10,000 events were cataloged, including over 200 events of magnitude 4 or greater (Ruppert et al., 2020). The 2018 Anchorage mainshock was a normal faulting, intraslab earthquake (West et al., 2020), a source mechanism that Powers et al. (2024) find to be one of the greatest contributors to seismic hazard in southcentral Alaska. The Anchorage earthquake sequence had full station azimuthal coverage; however, station quality varies, with both anthropogenic and site‐specific environmental noise. We present results from 24 hr of seismic activity beginning with the 2018 M 7.1 Anchorage mainshock and including 4 M 5+ events, 19 M 4–5 events, and 166 M 3–4 events (Table 2).

The Anchorage Ac encompasses 1274 events ranging in magnitude from 1.2 to 7.1. The RTc contains 839 events, with RTc‐to‐Ac comparison resulting in 743 matched events, 531 missed events, and 96 other events. Visual review of RTc other events returns 48 earthquakes, 21 of which meet catalog criteria. Our Anchorage MLc includes 1263 events. Ac‐to‐MLc comparison results in 1015 matched events, 259 missed events, and 248 other events. Review of MLc other events identifies 144 as earthquakes, with 47 meeting catalog criteria.

Icy Bay, southeastern Alaska

The Icy Bay study region, located in southeastern Alaska, is situated at a tectonic hinge point along the North American and Pacific plate boundary (Fig. 1). This region is a transition zone between the Fairweather fault transform boundary to the east and subduction zone to the west (Elliott et al., 2013). The Icy Bay study is part of the Chugach‐St. Elias fold‐and‐thrust belt (Pavlis et al., 2012), with earthquake seismicity observed on both strike‐slip and thrust faults at shallow crustal depths. These signals are complex, ranging from impulsive to emergent phase arrivals, and recorded with limited station azimuthal coverage. True to its name, Icy Bay has abundant environmental noise originating from tidewater glaciers (O’Neel et al., 2010). With higher environmental noise, restricted station coverage, and complex signals, Icy Bay represents a unique monitoring challenge. We present Icy Bay results spanning September and October 2022 near the annual peak of glacial activity (Table 2) (Ruppert, 2023).

The Icy Bay Ac reports 330 events—including 44 glacial quakes—with magnitudes ranging from 0.5 to 5.2. The RTc detects 445 events. RTc‐to‐Ac comparison results in 158 matched events, including 13 glacial quakes; 178 missed events; and 293 other events. Visual review of RTc other events finds 137 earthquakes, and 22 glacial quakes. Of the 137 RTc other earthquakes, 3 meet catalog criteria. Our Icy Bay MLc includes 753 events. MLc‐to‐Ac comparison produces 224 matched events, including 11 glacial quakes; 106 missed events; and 529 other events. Visual review labels 348 other events as earthquakes, and 43 as glacial quakes. Of all MLc other labeled earthquakes, 76 meet catalog criteria.

Andreanof to Fox Islands, Aleutian Islands

The 2000km volcanic Aleutian Island Arc is the surface expression of the North American/Pacific plate convergent boundary. This subduction zone produces seismicity within the subducting Pacific plate, at the plate interface, within the overriding North American plate, and from volcanic processes (Ruppert et al., 2012). Stations along the arc record high environmental noise from strong wind and wave conditions, in addition to nonearthquake volcanic signals such as tremor and long‐period volcanic events (McNutt and Roman, 2015). Reliably detecting the highly active and routinely large‐magnitude seismicity of the Aleutian subduction zone is arguably the most challenging seismic monitoring task in Alaska, and analogous to island nations around the Pacific Rim.

We select a portion of the arc between the Andreanof and Fox Islands where a pronounced interisland gap inhibits seismic station coverage (Fig. 1). This region includes seismic coverage at Akutan, Makushin, Bogoslof, Okmok, Cleveland, Atka Complex/Korovin, Great Sitkin, and Kanaga volcanoes. Makushin, Cleveland, Atka Complex/Korovin, and Great Sitkin volcanoes experienced isolated periods of elevated unrest between January and October 2020 (Orr et al., 2024). We present results from two months of seismic activity between Andreanof and Fox Islands, incorporating elevated volcanic unrest sequences, from August to September 2020 (Table 2).

The Aleutian Ac reports 263 events with magnitudes ranging from 0.6 to 5.3. The RTc contains 254 events, with an RTc‐to‐Ac comparison returning 115 matched events, 148 missed events, and 139 other events. RTc visual review returns 95 other earthquakes, with 5 other RTc earthquakes meeting catalog criteria. Our Aleutian Island MLc contains 526 events. MLc‐to‐Ac comparison results in 184 matched events, 79 missed events, and 342 other events. Visual review finds 279 MLc other earthquakes, with 36 meeting catalog criteria.

Our three‐catalog framework compares two algorithmic workflows. The first is representative of AEC’s real‐time process that has been in place since the late 1990s. The second is an ML‐assisted workflow chosen to evaluate overall performance. We acknowledge that EQTransformer (Mousavi et al., 2020) is one of many ML phase detection options (e.g., Ross et al., 2018; Zhu and Beroza, 2019), and we present results from a single set of model weights. Like all algorithms, both EQTransformer and our real‐time workflow are tunable. Our intent is not to provide an algorithm performance benchmark but instead use EQTransformer as a representative of the larger class of ML phase detection approaches. We use our catalog comparison framework to highlight the necessary evolution of expectations, resources, and workloads to meet the potential afforded by machine learning.

For earthquakes magnitude one and greater, we find strong ML performance in all study regions. The Purcell study, our ideal region, has exceptional performance from both the RTc (93% match) and MLc (98% match). The Purcell study’s strong performance allows for an appreciable P‐ and S‐phase comparison, presented in Figure 3. We find the Ac includes significantly more S phases than the RTc, and generally more P phases than the MLc. There is a substantial increase of S phases in the Purcell MLc, which is promising for providing more complete algorithmic solutions. S phases are a challenge in catalog building. Adding S phases requires discerning an onset commonly masked by the P coda. Yet, the presence of S phases is essential for constraining the trade‐off between depth and origin time. When S phases are incorrect, they can introduce significant error. Figure 3b illustrates that S phases added by ML are within similar time residuals as RTc S phases when compared with Ac analyst phases. This indicates that added ML S phases are of acceptable quality. Therefore, the ML solutions require fewer phases to be added manually, thereby reducing analyst time.

In our nonideal studies (Anchorage, Icy Bay, Aleutian) we find the MLc captures 20%–30% more final solutions and over 50% more S phases than the current real‐time system. We can extrapolate this performance to a typical AEC year. An AEC annual catalog contains ∼50,000 events described by roughly 1,000,000 P phases and 500,000 S phases. Our analysis suggests that 10,000–15,000 events identified through visual review and over 250,000 manually added S phases could instead be delivered through an ML‐assisted workflow. Anecdotally, we estimate this could reduce personnel workload by 500–1000 hr annually. An ML‐assisted, real‐time catalog has the potential to both reduce personnel workloads and provide a more detailed catalog. This is a remarkable level of improvement.

Beyond just time savings, a more detailed and accurate real‐time catalog improves our ability to interpret unfolding hazards. Vigorous swarm and aftershock sequences represent a particular challenge for monitoring agencies. The increase in detected events strains manual processing, commonly causing the catalog to remain incomplete at a time of scientific and monitoring need. Swarms may represent renewed activity on a fault or portend volcanic unrest. Alaska has a rich history of eruptions preceded by swarms. The surprise eruptions of Fourpeaked (Gardine et al., 2011), Kasatochi (Ruppert et al., 2011), and Bogoslof (Tepp et al., 2020) volcanoes were all preceded by swarm activity. These swarms were observable on the regional network, but outside of any volcano monitoring regimen. Likewise, the 2022 unrest at then‐dormant Edgecumbe volcano was first noticed by a nearby resident examining publicly available, algorithmic earthquake solutions posted to the AEC webpage (Grapenthin et al., 2022). Hazard identification like these is only as good as the real‐time catalog is complete.

Similarly, in the aftermath of a damaging earthquake, it is largely the algorithmic catalog that is used for the most urgent analyses. Aftershock forecasting relies heavily on seismic productivity in the hours and days after a major earthquake, when manual review processes commonly lag. A more complete real‐time algorithmic catalog leads directly to better aftershock forecasting. To evaluate algorithmic performance on aftershocks, our Anchorage study considers twenty‐four hours of seismicity, including, and following, the M 7.1 Anchorage mainshock. Our Anchorage MLc captures 21% more events than the RTc, including 41 M 3+ earthquakes, and provides 144 additional cataloged aftershocks.

Each M 3+ event is detectable across most of the Alaska seismic network; however, the phase distribution for M 3+ solutions varies between catalogs. Figure 4 demonstrates the station distribution used to locate the mainshock in each catalog. We observe that the MLc provides more S phases than the RTc. The MLc mainshock solution includes 13 S phases, a vast improvement over the RTc, which includes no S phases. However, MLc phases are generally limited to radial epicentral distances of 350 km or less, whereas the Ac includes P phases to ∼1100 km and S phases to ∼330 km. Our MLc observed phase distribution over epicentral distance is likely a representation of the training data set. Our ML algorithm is globally pretrained on STEAD (Mousavi, Sheng, et al., 2019). STEAD is largely constrained within 110 km epicentral distance (92% of training data), with only 8% of the available training data between 110 and 350 km epicentral distance. The difference in geographical phase distribution between our catalogs highlights our need for a training data set that includes larger epicentral distances.

Ideally, a more complete algorithmic catalog is more valuable. Historically, the public record produced by monitoring organizations is visually reviewed. These extensive review procedures provide users (e.g., the public, scientists, and policy makers) reasonable assurance that included events are real, not false or duplicate. However, the personnel hours necessary for current workflows preclude a more complete catalog. Relaxing the standards for inclusion, including reducing signal‐to‐noise thresholds, decreasing the number of required stations and/or phases, and even increasing the allowable travel‐time misfits for a hypocenter can provide more detailed catalogs. Adjusting these thresholds creates more inclusive catalogs; however, these changes also invite errors in the form of egregious mislocations, erroneous magnitudes, and false events.

Our Purcell study showcases ML potential to deliver a more complete algorithmic catalog with very few erroneous events. Our Purcell MLc is a more complete record than the RTc, containing nearly twice as many earthquakes. In addition to matching 98% of the published catalog, our visual review confirms that 716 of 721 (99%) MLc other events are real earthquakes. Anecdotally, reviewing these 716 additional events would take an estimated 48 personnel hours. However, if algorithmic catalog quality is high enough, manual review could be omitted. Then, the smallest magnitude events can be included at little‐to‐no additional personnel cost. Arguably, the Purcell MLc 99% earthquake success rate meets a standard that could forgo human review. Smaller magnitude events from an ML‐assisted catalog may then be published to the public record “as is,” forgoing human review, and accepting a small fraction of false or poorly located events. We propose this tier of catalog be publicly labeled as algorithmic, automatic, or unreviewed, enabling users to choose whether to include these events in analyses.

Our Purcell study, however, is the performance exception. In our Anchorage, Icy Bay, and Aleutian studies, 30%–40% of ML‐added other events are nonearthquake or erroneous. These events include the association of noise, false local events from teleseismic signals, glacial quakes, and duplicate events of real earthquakes. Our Icy Bay study highlights both a nonearthquake and erroneous example. The nonearthquake example is provided by glacial quakes, generally understood to be seismic recordings of glacial calving (O’Neel et al., 2007; Aster and Winberry, 2017). These events, illustrated in Figure 5, are lower frequency and impulsive. For a trained human observer, glacial quakes are distinct from earthquakes. Yet our Icy Bay algorithmic catalogs include 35 (RTc) and 54 (MLc) confirmed glacial quakes. The RTc has no ability to distinguish event types; however, the MLc is trained to explicitly recognize earthquake phases. STEAD (Mousavi, Sheng, et al., 2019), our underlying training data set, has few signals from ice‐bearing regions of the world. Therefore, we observe that our STEAD trained ML is not robust enough to adequately distinguish between earthquake and glacial quake signals. Kharita et al. (2024) have demonstrated the relative ease of distinguishing glacial quakes from earthquakes in this region, indicating that enriching a training data set could address ML glacial quake performance.

Whether nonearthquake seismic events—like glacial quakes, mining blasts, landslides, and volcanic signals—should be considered target or noise remains an important, but open, discussion. An ideal training data set includes these signals because they are frequent in the seismic record. However, curating a training data set inclusive of exotic signals provides an opportunity to re‐examine what should be included by monitoring agencies in public catalogs. This is a decision that should be driven by mission and stakeholder needs. The AEC catalog has long included additional, exotic source types as a natural product from an indiscriminate detection system. Once in the RTc, there has been little reason to remove exotic events, and they are instead labeled by source type. In fact, over the years, the nonearthquake portion of the AEC catalog has evolved into a substantial scientific resource (e.g., Wiemer and Baer, 2000; Kharita et al., 2024). We embrace the idea of richer catalogs that serve diverse scientific interests and note that a view of seismic monitoring beyond earthquakes is within the stated goals of the ANSS (USGS, 2017). However, any specific decisions are beyond this article.

A common erroneous example highlighted in our Icy Bay study are mantle and crustal phases from earthquakes a few hundred kilometers away. Both algorithmic catalogs frequently confuse the mantle Pn and crustal Pg phases for pairs of P and S arrivals. The same is true for Sn and Sg. This results in a single distant earthquake being falsely interpreted as two smaller local events. Example seismograms from a magnitude 4.8 earthquake in interior Alaska, at 450 km epicentral distance, is presented in Figure 6. Although both algorithmic catalogs place detections on the impulsive mantle and crustal arrivals, it is of particular note that the MLc labels these phases. These incorrect labels are passed through the association algorithm, which honors the phase designations. An ML algorithm trained on similar events should designate single P and S phases (or perhaps explicitly designate both mantle and crustal P and S phases), providing the association algorithm with only phases for the single, more distant, larger event. Although STEAD (Mousavi, Sheng, et al., 2019) does not include signals beyond 350 km epicentral distance, it is also worth noting that out‐of‐the‐box EQTransformer (Mousavi et al., 2020) examines seismograms in 60 s windows, with 30 s overlap. The signal presented in Figure 6 is ∼120 s in duration. The 60 s window prevents Pn/Pg and Sn/Sg phases from being analyzed in a single frame. Therefore, the MLc mislabeling of mantle and crustal phases from distant events in the Icy Bay region may be related to ML architecture constraints, and therefore not purely correctable through training data improvements alone.

In all study regions except Purcell, 60%–70% of the MLc other events are unpublished hypocenters. The majority of these unpublished hypocenters fall below magnitude 1.5, with few larger magnitude exceptions from the Anchorage study (Fig. 7). Although the addition of earthquake hypocenters from an automatic source is valuable, we find that only 24% of MLc other hypocenters would be considered for inclusion into our current catalog. In addition, although the MLc surpasses the RTc in matching the performance of the published catalog by 22%–26%, the MLc still misses 20%, 30%, and 32% of the published catalog for the Anchorage, Aleutian, and Icy Bay studies, respectively. These catalogs highlight areas where the current training data set underrepresents the region’s active seismicity. We note that in a few cases associated with the Anchorage study, larger magnitude events are detected algorithmically, but the resulting hypocenter fails one or more of our constrained matching criteria. However, we also find instances in which large magnitude events remain undetected. For example, in the Aleutian study, a notable omission from the MLc is a magnitude 5.3 earthquake occurring in the middle of the study region (Fig. 8). Waveforms from this event are more complex than events in the Purcell study and push the boundary of a 60 s analysis window. However, most of the observable phases are impulsive, which is not always true in the Aleutian subduction zone. We propose that these MLc omissions can guide how to curate a training data set designed for maximum performance in Alaska.

Like any model or algorithm, ML architectures can be adjusted. Hyperparameters can be tuned, and training data sets can be reconfigured to produce different results. Our study‐based approach provides a systematic framework to identify and quantify algorithmic underperformance. Our Purcell, Icy Bay, Aleutian, and Anchorage studies were chosen as representative monitoring challenges in Alaska; however, these challenges are not unique to Alaska. Results from our curated studies inform how to build an optimal training data set. Relative to tests here, an Alaska training data set requires greater epicentral distances, a more representative distribution of nonearthquake signals and noise, and greater representation of complex earthquake signals across all magnitudes.

Our approach provides a comparison of an ML‐derived catalog with the current real‐time catalog by measuring performance against an analyst‐reviewed catalog. Across all studies, we find that our MLc matches our Ac more closely (Table 2). With the exception of the Purcell study, in which both algorithmic catalogs have a match success exceeding 90%, we find our MLc includes over 20% more analyst reviewed events than our RTc. Our MLc also contains more S phases than our RTc, which is a notable improvement toward reproducing the Ac human performance. However, we find that the majority of algorithmic other events visually identified as earthquakes do not pass our current criteria for inclusion in the published catalog, regardless of study region. These real, but ill‐defined, events would require additional personnel time to manually improve, or remove, under current catalog preparation steps to meet reporting standards.

Undeniably, ML is capable of producing a more complete algorithmic catalog. However, this catalog comes at a cost. In the current operational paradigm in which all published earthquakes are examined by eye, it is a simple fact that the larger catalogs available from ML approaches require more personnel time to review—a few minutes per event, depending on the event. And we find considerable trade‐off between trust in ML algorithms to infallibly detect societally relevant, large earthquakes; catalog completeness at small magnitudes; and personnel time to curate the published catalog. As methods and training data sets evolve, these trade‐offs will surely decrease. And yet, part of the solution may require adapting our expectations for public catalogs of record. If we can tolerate a modest, nonzero, error rate at small magnitudes, it is likely possible to forgo human review for small events. Monitoring agencies could then publish catalogs with lower magnitudes of completeness with little‐to‐no additional personnel costs. These additional, low‐magnitude events may be of little interest to the public; however, the benefits for scientific applications are unequivocal. The adoption of ML approaches by monitoring agencies requires algorithms to have repeatable and high‐quality performance. Once these performance conditions are met, however, ML‐augmented published catalogs have the potential to be richer at lower magnitudes and incorporate exotic events.

Our analyst reviewed earthquake catalog was produced by the Alaska Earthquake Center and retrieved from the Advanced National Seismic System (ANSS) Comprehensive Catalog (https://earthquake.usgs.gov/data/comcat/), operated by the U.S. Geological Survey. Seismic waveform data are available via the National Science Foundation (NSF) SAGE Data Management Center (https://service.iris.edu). The pretrained EQTransformer model from Mousavi et al. (2020) is available through listed references. All study catalogs and Antelope parameter files are available on GitHub (https://github.com/aknoel/study_catalogs.git). All websites were last accessed in February 2025.

The authors acknowledge that there are no conflicts of interest recorded.

This project was made possible in part by U.S. Geological Survey (USGS) Advanced National Seismic System (ANSS) cooperative agreements G20AC00032 and G24AC00003 and support from the Office of the Alaska State Seismologist. The authors thank Heather McFarlin and Natalia Ruppert for insights and discussion, as well as two anonymous reviewers for their comments and suggestions.