Position: All Current Generative Fidelity and Diversity Metrics are Flawed

Ossi Räisä, Boris Van Breugel, Mihaela Van Der Schaar
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:82016-82050, 2025.

Abstract

Any method’s development and practical application is limited by our ability to measure its reliability. The popularity of generative modeling emphasizes the importance of good synthetic data metrics. Unfortunately, previous works have found many failure cases in current metrics, for example lack of outlier robustness and unclear lower and upper bounds. We propose a list of desiderata for synthetic data metrics, and a suite of sanity checks: carefully chosen simple experiments that aim to detect specific and known generative modeling failure modes. Based on these desiderata and the results of our checks, we arrive at our position: all current generative fidelity and diversity metrics are flawed. This significantly hinders practical use of synthetic data. Our aim is to convince the research community to spend more effort in developing metrics, instead of models. Additionally, through analyzing how current metrics fail, we provide practitioners with guidelines on how these metrics should (not) be used.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-raisa25a, title = {Position: All Current Generative Fidelity and Diversity Metrics are Flawed}, author = {R\"{a}is\"{a}, Ossi and Van Breugel, Boris and Van Der Schaar, Mihaela}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {82016--82050}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/raisa25a/raisa25a.pdf}, url = {https://proceedings.mlr.press/v267/raisa25a.html}, abstract = {Any method’s development and practical application is limited by our ability to measure its reliability. The popularity of generative modeling emphasizes the importance of good synthetic data metrics. Unfortunately, previous works have found many failure cases in current metrics, for example lack of outlier robustness and unclear lower and upper bounds. We propose a list of desiderata for synthetic data metrics, and a suite of sanity checks: carefully chosen simple experiments that aim to detect specific and known generative modeling failure modes. Based on these desiderata and the results of our checks, we arrive at our position: all current generative fidelity and diversity metrics are flawed. This significantly hinders practical use of synthetic data. Our aim is to convince the research community to spend more effort in developing metrics, instead of models. Additionally, through analyzing how current metrics fail, we provide practitioners with guidelines on how these metrics should (not) be used.} }
Endnote
%0 Conference Paper %T Position: All Current Generative Fidelity and Diversity Metrics are Flawed %A Ossi Räisä %A Boris Van Breugel %A Mihaela Van Der Schaar %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-raisa25a %I PMLR %P 82016--82050 %U https://proceedings.mlr.press/v267/raisa25a.html %V 267 %X Any method’s development and practical application is limited by our ability to measure its reliability. The popularity of generative modeling emphasizes the importance of good synthetic data metrics. Unfortunately, previous works have found many failure cases in current metrics, for example lack of outlier robustness and unclear lower and upper bounds. We propose a list of desiderata for synthetic data metrics, and a suite of sanity checks: carefully chosen simple experiments that aim to detect specific and known generative modeling failure modes. Based on these desiderata and the results of our checks, we arrive at our position: all current generative fidelity and diversity metrics are flawed. This significantly hinders practical use of synthetic data. Our aim is to convince the research community to spend more effort in developing metrics, instead of models. Additionally, through analyzing how current metrics fail, we provide practitioners with guidelines on how these metrics should (not) be used.
APA
Räisä, O., Van Breugel, B. & Van Der Schaar, M.. (2025). Position: All Current Generative Fidelity and Diversity Metrics are Flawed. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:82016-82050 Available from https://proceedings.mlr.press/v267/raisa25a.html.

Related Material