Evaluating the performances of earthquake forecasting/prediction models is the main rationale behind some recent international efforts such as the Regional Earthquake Likelihood Model (RELM) and the Collaboratory for the Study of Earthquake Predictability (CSEP). Basically, the evaluation process consists of two steps: (1) to run simultaneously all codes to forecast future seismicity in well-deﬁned testing regions; (2) to compare the forecasts through a suite of statistical tests. The tests are based on the likelihood score, and they check both the time and space performances. All these tests rely on some basic assumptions that have never been deeply discussed and analyzed. In particular, models are required to specify a rate in space-time-magnitude bins; it is assumed that these rates are independent and characterized by Poisson uncertainty. In this work we have explored in detail these assumptions and their impact on CSEP testing procedures when applied to a widely used class of models, that is, the Epidemic-Type Aftershock Sequence (ETAS) models. Our results show that, if an ETAS model is an accurate representation of seismicity, the same right model is rejected by the current CSEP testing procedures a number of times significantly higher than expected. We show that this deficiency is due to the fact that the ETAS models produce forecasts with a variability significantly higher than that of a Poisson process, invalidating one of the main assumptions that stands behind the CSEP/RELM evaluation process. Certainly, this shortcoming does not negate the paramount importance of the CSEP experiments as a whole, but it does call for a specific revision of the testing procedures to allow a better understanding of the results of such experiments.