Earthquake forecasting models express hypotheses about seismogenesis that underpin global and regional probabilistic seismic hazard assessments (PSHAs). An implicit assumption is that the comparatively higher spatiotemporal resolution datasets from which regional models are generated lead to more informative seismicity forecasts than global models, which are however calibrated on greater datasets of large earthquakes. Here, we prospectively assess the ability of the Global Earthquake Activity Rate (GEAR1) model and 19 time‐independent regional models to forecast M 4.95+ seismicity in California, New Zealand, and Italy from 2014 through 2021, using metrics developed by the Collaboratory for the Study of Earthquake Predictability (CSEP). Our results show that regional models that adaptively smooth small earthquake locations perform best in California and Italy during the evaluation period; however, GEAR1, based on global seismicity and geodesy datasets, performs surprisingly well across all testing regions, ranking first in New Zealand, second in California, and third in Italy. Furthermore, the performance of the models is highly sensitive to spatial smoothing, and the optimal smoothing likely depends on the regional tectonic setting. Acknowledging the limited prospective test data, these results provide preliminary support for using GEAR1 as a global reference M 4.95+ seismicity model that could inform eight‐year regional and global PSHAs.

The increasing availability and quality of geophysical datasets, including earthquake catalogs, geological records, and interseismic strain rates, has enabled the creation of regional and global earthquake forecasting models, some of which underpin probabilistic seismic hazard assessments (e.g., Pagani et al., 2020; Danciu et al., 2021). Regional models provide detailed approximations of seismic sources and sophisticated earthquake forecasts, given the comparatively high spatiotemporal resolution of the datasets on which they are based (e.g., the Uniform California Earthquake Rupture Forecast, Version 3; Field et al., 2017). However, the relative lack of large earthquakes makes it difficult to assess whether regional models can translate the greater data availability and resolution into more skillful forecasts, or whether they might be overfitting their calibration datasets and providing lower skill than expected. Global models, in contrast, offer greater testability against the large damaging earthquakes due to the more frequent seismic activity observed around the world, but are neccesarily coarser, and information coverage can vary considerably in space. Thus, regional models are largely expected to be far superior; however, model evaluations must be done prospectively, that is, against future earthquakes, as regional models may have the advantage of better fitting the past. In this study, we test the assumption that regionally calibrated seismicity models are more informative than global models by benchmarking the predictive skills of a set of regional and global stationary earthquake‐rate models that were submitted to the Collaboratory for the Study of Earthquake Predictability (CSEP; e.g., Michael and Werner, 2018) for truly prospective evaluation.

Starting in 2007, the Working Group of the Regional Earthquake Likelihood Models (Field, 2007) launched a prospective earthquake forecasting experiment in California that involved multiple datasets, models, and tests formalizing different hypotheses about seismogenesis. This experiment sparked a series of subsequent forecast experiments in other seismically active regions, including Japan (Tsuruoka et al., 2012), New Zealand (Gerstenberger and Rhoades, 2010), and Italy (Marzocchi et al., 2010), with the goal of developing and prospectively evaluating regional earthquake forecasting models. On a global scale, Strader et al. (2018) evaluated four time‐invariant seismicity models against two years of prospective data, finding that the hybrid Global Earthquake Activity Rate (GEAR1; Bird et al., 2015) model was the most informative. GEAR1 results from a multiplicative combination of smoothed seismicity and interseismic strain rates, globally calibrated to Mw 5.767+, shallow (≤70 km) earthquakes reported from 1977 to 2013 in the Global Centroid Moment Tensor catalog (i.e., Ekström et al., 2012). Compared to GEAR1, regional models that participated in the aforementioned CSEP experiments rely on parameters that might be better constrained by higher‐resolution regional information, or indeed they are based on datasets that may never be uniformly available on a global scale, such as historical earthquake catalogs (e.g., the EBEL, HAZGRIDX, and TRIPLES‐HYBRID models), adaptive smoothed M 2+ seismicity (e.g., HKJ, HRSS‐CSI), detailed geological fault maps (e.g., NZHM), spatial variations in the Gutenberg–Richter b‐value (e.g., Asperity Likelihood Model [ALM], HALM, and ALM‐IT), tectonic zonations (e.g., MPS04‐AFTER), and kinematic models of surface velocities (e.g., NEOKINEMA). In Table 1 and Figure 1, we provide a brief description of the main features of the time‐independent mainshock + aftershock models (i.e., models forecasting all earthquakes) that we consider in this study. For more details about these models, see Field (2007), Gerstenberger and Rhoades (2010), Marzocchi et al. (2010), and the references therein.

Prior to evaluating any seismicity model, CSEP provides strict definitions of the testing region, period, magnitude range, target earthquakes, and testing methods (e.g., Schorlemmer et al., 2010). Accordingly, we prospectively evaluate the forecasting skills of regional and global models using M 4.95+ earthquakes reported in California, New Zealand, and Italy from 1 January 2014 to 1 January 2022 in the Comprehensive Earthquake Catalog of the Advanced National Seismic System (ANSS; Guy et al., 2015), the GeoNet catalog (Ristau, 2013), and the Italian seismic bulletin (Bollettino Sismico Italiano; Amato et al., 2006), respectively. In California and Italy, the target earthquakes are within the 0–30 km depth range, whereas, in New Zealand, these are confined to the upper 40 km of the crust (see Fig. 2).

We project GEAR1 forecasts onto the CSEP‐California, New Zealand, and Italy testing regions to create three globally calibrated regional seismicity models (see Fig. 1). For the sake of simplicity, we extrapolate the original GEAR1 Mw 5.95+ earthquake rates to a magnitude threshold of 4.95, assuming a global b‐value of 1 (e.g., Petruccelli et al., 2019). In the case of Italy, we divide GEAR1 rates by a factor of 1.602 to correct for differences in magnitude scales used in model calibration (moment magnitude) and model testing (local magnitude; Schorlemmer et al., 2010). Furthermore, we assume that, at a magnitude threshold of 4.95, the contribution of earthquakes from within the depth ranges of 30–70 km in California and Italy, and 40–70 km in New Zealand to the long‐term seismicity rate is minimal, so that the comparison between GEAR1 (which extends to a depth of 70 km) and the regional models is more compatible. In the specific case of New Zealand, only 6% of the M 4.95+ seismicity recorded instrumentally from 2003 to 2021 is within the 40–70 km depth range (Ristau, 2013), whereas none of the M 4.95+ earthquakes observed between 1985 and 2021 in Italy are located deeper than 30 km.

Using the CSEP’s pyCSEP toolkit (Savran, Bayona, et al., 2022; Savran, Werner, et al., 2022), we then conduct the paired T‐test of Rhoades et al. (2011) to comparatively evaluate the performance of the regional models with that of GEAR1. The T‐test is based on the Information Gain per Earthquake (IGPE), here obtained by a (regional) model A over a (global) model B, and its 95% confidence intervals estimated from the Student’s t‐distribution (see equations 17 and 18 in Rhoades et al., 2011). Each IGPE can be used as a measure of the relative amount of information about future target earthquakes that two competing models can provide. Thus, a regional model can be considered statistically more informative than GEAR1 if the IGPE is positive and its confidence interval does not include zero; whereas, if the IGPE is negative and the confidence interval does not include zero, it can be considered that the regional model is statistically less informative than the global model. Alternatively, if the confidence interval does contain zero, the two models can be considered statistically equally informative, that is, the null hypothesis of equal forecasting performance cannot be rejected at a 0.05 significance level.

We also evaluate the spatial performance of the models by conducting the spatial S‐test of Zechar et al. (2010) and the binary S‐test of Bayona et al. (2022). Similar to the S‐test, the binary S‐test assumes that earthquakes in individual cells are independent; however, this test is based on the likelihood of observing either 0 or 1+ earthquakes in each cell, rather than observing a specific number of events ω, which makes the assumption by Zechar et al. (2010) that earthquakes are spatially Poisson distributed less critical. Given the differences in the number of 0.1° × 0.1° cells in each testing region (7682 in California, 6343 in New Zealand, and 8993 in Italy) and the number of target events observed during the 8 yr evaluation period (38 in California, 47 in New Zealand, and 11 in Italy), we calculate joint log‐likelihood scores per earthquake and active cell, that is, the sum of log‐likelihood scores obtained by each forecast in each individual cell divided by the total number of observed earthquakes and activated cells, respectively, to analyze and compare the spatial forecasting capabilities of all the models (see equations 5 and 12 in Bayona et al., 2022). In addition, we compute Gini coefficients (Gc; Breiman et al., 1984) for each model to measure how localized or smooth their predicted earthquake rates are and thus better interpret our spatial test results. The Gc is equal to twice the area between the receiver operating characteristic (ROC; e.g., Molchan and Keilis‐Borok, 2008) curve and the diagonal line, with the ROC curve being the normalized cumulative forecast rate in cells sorted in decreasing order, and takes values in the interval [0,1). A Gc that tends to zero indicates that the earthquakes are expected to be homogeneously distributed throughout the testing region, whereas a Gc that tends to one denotes very localized seismicity (see the TRIPLE_S‐CPTI [Gc0.5] and ALM‐IT [Gc0.8] models for Italy in Fig. 1 as two contrasting examples). Finally, we apply the number N‐test (i.e., equations 5–8 in Zechar et al., 2010) to assess the consistency between the observed and expected number of earthquakes and share these results in the supplemental material to this article (see Data and Resources).

Prospective T‐test results for California show that the regional HKJ model is the most informative, obtaining an IGPE of approximately 1.0 over GEAR1, whereas GEAR1, ranking second, exceeds the predictive skill of the NEOKINEMA, Pattern Informatics (PI), and EBEL models (see Fig. 3). These results provide additional evidence that small earthquakes and geodetic data are useful for delineating areas in California where moderate‐to‐large events may occur (e.g., Kafka and Levin, 2000; Zeng et al., 2018), as HKJ and GEAR1 use M 2+ earthquakes and interseismic strain rates to forecast seismicity, respectively. Furthermore, these results suggest that California is currently in a period of relatively low on‐fault seismicity, as NEOKINEMA—a kinematic model of surface velocities based on fault information, markedly overestimates the observed seismicity rate and poorly explains the spatial distribution of earthquakes (see Fig. S1, available in the supplemental material to this article and the additional resources referenced in Data and Resources). Similarly, PI and EBEL perform poorly, because the EBEL model significantly overpredicts the number of M 4.95+ earthquakes, and because PI is particularly sensitive to the unlikely occurrence of the 2019 Ridgecrest sequence (Bayona et al., 2022).

In New Zealand, T‐test results show that GEAR1 is the most skillful model to forecast M 4.95+ seismicity during the target period, obtaining IGPEs of about 0.5 over the 2002 NZHM (Stirling et al., 2002) and Proximity to Past Earthquakes (PPE) (see Fig. 3). These results are due in part to the fact that the NZHM was designed to forecast mainshocks only; nonetheless, the model provides a number of target earthquakes that is consistent with the observations (see Fig. S1). A more important factor influencing these results is the spatial dimension of the models. Residual analyses of (Poisson) spatial‐likelihood scores obtained by GEAR1 and NZHM in individual cells show that the NZHM outperforms GEAR1 in forecasting the occurrence of the Mw 7.8 Kaikōura earthquake and some of its surrounding aftershocks, whereas GEAR1 outperforms the NZHM in forecasting the M ∼ 5 events that nucleated about 170 km northeast from the mainshock (see Fig. 2b and Figs. S2a,b). These differences can be explained considering that the NZHM contains fault information near the Kaikōura rupture area that adds predictive skill, whereas GEAR1, through its seismicity model component (i.e., the KJSS model), incorporates earthquake information derived from the 2013 Cook Strait sequence (Hamling et al., 2014), which is not present in the regional models, because they were trained with data through 2006. In addition, GEAR1, through its geodesy model component (i.e., the SHIFT_GSRM2f model), includes interseismic strain rates, which appear to be useful for mapping the location of the “closest” aftershocks of the Mw 7.8 Kaikōura earthquake (see how the relative performance of GEAR1 to NZHM is superior to that of KJSS in Fig. S2).

The results of the T-test for Italy show that only relative intensity (RI) and the adaptively smoothed HRSS‐CSI seismicity model can be considered statistically more informative than GEAR1, as they obtain IGPEs of about 0.7 and 0.6 over the global model, respectively (see Fig. 3). As in California, these results suggest that small (M 3+) earthquakes might be useful for forecasting the occurrence of larger events in Italy. GEAR1, on the other hand, outperforms the ALM and TRIPLE_S model families, and can be considered as informative as the HZATI, HAZGRIDX, and MPS04‐AFTER models, the latter being the official time-independent model of the 2004 National Seismic Hazard Model for Italy (MPS Working Group, 2004). These results differ in part from the results obtained by Taroni et al. (2018), who prospectively found that HZATI, MPS04‐AFTER, and TRIPLE_S-CPTI were the best-performing time-invariant models during the 2009–2014 period. These discrepancies are mainly due to the absence of the 2012 Emilia earthquake sequence in our testing catalog, which, with 11 M 4.95+ events, heavily influenced the results of the 5 yr experiment. Thus, these results reflect the variability in model performance that stems from a relatively limited number of prospective target earthquakes, so we strongly recommend further prospective evaluations in these regions to investigate the timescale over which test results are reasonably stable.

Nonetheless, the prospective performance of GEAR1 is surprising, considering that California, New Zealand, and Italy are some of the best‐instrumented and studied earthquake‐prone regions in the world; so one might expect regionally calibrated models to be better informed and therefore more informative than GEAR1. Yet, GEAR1, using global datasets compiled no earlier than 1977, performs just as well (e.g., California and Italy) and sometimes better (e.g., New Zealand) than regional models, raising questions around which regional datasets make forecast models truly informative, and to what extent other factors, such as spatial smoothing or calibration period, add predictive skill. Overall, our comparative test results indicate that smoothing the locations of small earthquakes and combining geodetic data with smoothed M 5.767+ seismicity in a multiplicative fashion are two useful methods for forecasting moderate‐to‐large (M 4.95+) earthquakes in California, New Zealand, and Italy over a period of eight years. Furthermore, our analysis of the spatial dimension of the models shows that the regional models that provide forecasts that are too smooth (low Gc’s) or too localized (high Gc’s) perform poorly (see Fig. 4). The spatially uniform Poisson (SUP) and TRIPLE_S models, for example, indicate when more smoothing leads to systematic information loss, whereas the PI and ALM models, based on great specificity, illustrate overconfident localized forecasts. In contrast, models with Gc’s between ∼0.6 and 0.75, such as HKJ, GEAR1 (New Zealand), and NZHM, obtain the highest likelihood scores, showing the range of smoothing in which spatial forecasting, in the context of time‐independent forecasts over a period of eight years, appears optimal. Thus, we recommend earthquake forecasters consider “intermediate” smoothing procedures, for example, adaptive smoothing, to add predictive skill to their time‐invariant seismicity models.

Finally, we note that the 47 earthquakes in New Zealand appear to be spatially more predictable than the 38 target events in California, and the 11 prospective earthquakes in Italy, respectively, as the New Zealand models obtain the highest spatial probability scores per earthquake (see Fig. 4a). In fact, some of the target earthquakes in New Zealand nucleated in the Cook Strait region in 2016, only a few years after the occurrence of three M 5.7+ earthquakes (Hamling et al., 2014). Interestingly, this pattern is not observed when estimating spatial‐likelihood scores per active cell, as the California and Italian models appear to “benefit” more than the New Zealand models from the reduced sensitivity of the binary S‐test to clustering (see Fig. 4b). These results may stem from temporary fluctuations in seismic activity, as various target earthquakes in New Zealand nucleated on known faults, but not many on‐fault events occurred in California and Italy during the testing period. Alternatively, these observations may be due to fundamental regional differences, likely related to the frequency and localization of earthquakes in each tectonic setting (e.g., Petruccelli et al., 2019).

We evaluate and compare the performance of regional and global time‐independent seismicity models for California, New Zealand, and Italy, using eight years of prospective data. Our comparative test results show that a regional model based on adaptive smoothed M 2+ seismicity is the most informative in California, whereas GEAR1, based on global datasets of M 5.767+ earthquakes and interseismic strain rates, ranks second. In New Zealand, GEAR1 outperforms all the models, including the former 2002 New Zealand National Seismic Hazard Model of Stirling et al. (2002), mainly due to the incorporation of seismicity information that is not present in the regional models. In Italy, a regional model based on relative seismic intensities and a model using adaptively smoothed M 3+ seismicity are the most informative, whereas the globally calibrated model, ranking third, can be considered as informative as the model informing the Italian National Seismic Hazard Model of 2004 (i.e., MPS Working Group, 2004). The performance of GEAR1 is unexpected, considering the comparatively lower resolution of its global training datasets. Thus, these results suggest that the adaptive smoothing of small earthquake locations and the multiplicative blend of geodetic data with smoothed seismicity are useful methods for forecasting the occurrence of moderate earthquakes in these regions over a period of nearly a decade.

Upon analyzing differences in model performance, we find that some regional models appear to underfit or overfit their calibration datasets, which may result in uniformative and overconfident forecasts, respectively. Specifically, models that are too smooth or too localized perform poorly, whereas models that apply “intermediate” smoothing procedures, such as adaptive smoothing, show better performance in forecasting the spatial distribution of observed seismicity. Hence, we recommend earthquake forecasters consider adaptive smoothing techniques to add spatial predictive skill to their time‐invariant models. In addition, we find that the global and regional models for New Zealand obtain higher spatial‐likelihood scores per earthquake than the California and Italy models, respectively, meaning that some target events in New Zealand, past and prospective, occurred in similar locations. These observations might be due to temporal variations in on‐fault and off‐fault seismicity, or could be related to the frequency and spatial localization of earthquakes that are characteristic of each tectonic environment.

Acknowledging that the number of M 4.95+ earthquakes in the evaluation datasets is limited, and that earthquake clustering is likely to influence the model ranking, these prospective test results, together with the results of prospective global earthquake forecasting experiments (e.g., Strader et al., 2018; Bayona et al., 2021), provide preliminary support for the use of GEAR1 as a reference model forecasting moderate‐to‐large seismicity in California, New Zealand, Italy, and presumably elsewhere around the globe over an eight‐year period. GEAR1 could be used as a benchmark against seismicity models in future CSEP forecast experiments in these and other testing regions, or against earthquake models informing global and regional seismic hazard assessments, including the 2020 European Seismic Hazard Model (Danciu et al., 2021) and the 2018 Global Earthquake Model (Pagani et al., 2020). All forecasts, data, and tests in this article are openly available for anyone to do so (see Data and Resources).

The Comprehensive Earthquake Catalog of the Advanced National Seismic System (ANSS; Guy et al., 2015) is available at, the GeoNet catalog (Ristau, 2013) is available at, and the Italian seismic bulletin (Amato et al., 2006) can be found at The forecast files are openly accessible on Zenodo at, whereas the reproducibility package for this article, which also contains consistency test results for all earthquake forecasting models, can be found on GitHub at The pyCSEP software is freely available on GitHub at, and the documentation can be found at Two additional figures showing prospective number test results for all the models and residuals between spatial log‐likelihood scores obtained by the Global Earthquake Activity Rate (GEAR1) model in New Zealand and New Zealand Hazard Model (NZHM), and between the seismicity model component of GEAR1 and NZHM, are provided in the supplemental material to this article. All websites were last accessed in January 2023.

The authors acknowledge that there are no conflicts of interest recorded.

The authors sincerely thank editor Keith D. Koper and two anonymous reviewers for their thorough reading and constructive comments to improve the quality of this work. The authors also thank David Rhoades for valuable comments that helped them improve this article. José A. Bayona, Maximilian J. Werner, Danijel Schorlemmer, and Warner Marzocchi were supported by the European Union H2020 program (Grant Number 821115, Real‐time earthquake rIsk reduction for a reSilient Europe [RISE]). Maximilian J. Werner also received support from the Natural Environment Research Council (NERC; NE/R017956/1, “EQUIPT4RISK”). In addition, this research was supported by the Southern California Earthquake Center (SCEC, Contribution Number 12706). The SCEC is funded by the National Science Foundation (NSF) Cooperative Agreement EAR‐1600087 and U.S. Geological Survey (USGS) Cooperative Agreement G17AC00047.

This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data