The five-year experiment of the Regional Earthquake Likelihood Models (RELM) working group was designed to compare several prospective forecasts of earthquake rates in latitude–longitude–magnitude bins in and around California. This forecast format is being used as a blueprint for many other earthquake predictability experiments around the world, and therefore it is important to consider how to evaluate the performance of such forecasts. Two tests that are currently used are based on the likelihood of the observed distribution of earthquakes given a forecast; one test compares the binned space–rate–magnitude observation and forecast, and the other compares only the rate forecast and the number of observed earthquakes. In this article, we discuss a subtle flaw in the current test of rate forecasts, and we propose two new tests that isolate the spatial and magnitude component, respectively, of a space–rate–magnitude forecast. For illustration, we consider the RELM forecasts and the distribution of earthquakes observed during the first half of the ongoing RELM experiment. We show that a space–rate–magnitude forecast may appear to be consistent with the distribution of observed earthquakes despite the spatial forecast being inconsistent with the spatial distribution of observed earthquakes, and we suggest that these new tests should be used to provide increased detail in earthquake forecast evaluation. We also discuss the statistical power of each of the likelihood-based tests and the stability (with respect to earthquake catalog uncertainties) of results from the likelihood-based tests.