[edit]

# Interpretable Meta-Score for Model Performance: Extended Abstract

*ECMLPKDD Workshop on Meta-Knowledge Transfer*, PMLR 191:75-77, 2022.

#### Abstract

In benchmarking different machine learning models, we are seeing a real boom—in many applications such as computer vision or neural language processing, more and more challenges are being created. We are therefore collecting more and more data on the performance of individual models or architectures, but the question remains as to how to decide which model gives the best results and whether one model is significantly better than another. The most commonly used measures, such as AUC, accuracy, or RMSE, return a numerical assessment of how well the predictions of the selected model satisfy specific properties: they correctly assign the probability of belonging to the chosen class, they are not wrong in assigning the predicted class, or the difference between the predictions and the true values is not large. From an application point of view, however, we lack information: (i) what is the probability that a given model gets a better performance model than another; (ii) whether the differences we observe between models are statistically significant; (iii) in most cases, the values of the selected model performance metrics are incomparable between different datasets, i.e., how to compare a model’s AUC improvement by $0.01$ if for one dataset the best achieved AUC is of the order of $0.9$ and for the other $0.7$. To address these shortcomings, in an earlier paper we introduce a new meta-measure of model performance—EPP. It is inspired by the Elo ranking used in chess and other sports games. By comparing the rankings of two players and transforming them accordingly, we obtain information on the probability that one player is better than the other. EPP adapts this property to the specific conditions of benchmarks in machine learning but allows for universal application in many benchmarking schemes. We emphasize this by introducing a unified terminology, the Unified Benchmark Ontology, and the description of the new measure is given in these terms. Hence, models are referred to as players and model performance to score.