Using a score to generalize the model performance into one numeric value has been one of the most popular approaches to empirically evaluate ground‐motion models (GMMs). This approach has an advantage of simplifying model comparison. We study the effects of data correlation and score variability on the evaluation of GMMs. Most modern GMMs are hierarchical, in which ground motions from the same earthquake are modeled as correlated. We demonstrate, with examples, that incorrect results could occur if such hierarchical GMMs are evaluated by a score that does not duly address the data correlation. We propose to use the multivariate logarithmic score, a natural extension of the widely used univariate logarithmic score (referred to as LLH in the seismological literature), to correctly score hierarchical GMMs. The score variability affects the interpretation of model ranking. We demonstrate that the cluster bootstrap is a better bootstrap strategy, compared with other strategies proposed in the literature, to study the score variability. The bootstrap allows computing two useful quantities: the distinctness index that indicates if two models are truly different given the score variability and the frequency weight, a data‐driven weighting scheme that represents the frequentist’s interpretation of the weight of a logic‐tree branch. The frequency weight has a direct link to the current practice of using multiple GMMs in a probabilistic seismic hazard assessment.