Reliable Fidelity and Diversity Metrics for Generative Models

Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh, Yunjey Choi, Jaejun Yoo
Proceedings of the 37th International Conference on Machine Learning, PMLR 119:7176-7185, 2020.

Abstract

Devising indicative evaluation metrics for the image generation task remains an open problem. The most widely used metric for measuring the similarity between real and generated images has been the Frechet Inception Distance (FID) score. Since it does not differentiate the fidelity and diversity aspects of the generated images, recent papers have introduced variants of precision and recall metrics to diagnose those properties separately. In this paper, we show that even the latest version of the precision and recall metrics are not reliable yet. For example, they fail to detect the match between two identical distributions, they are not robust against outliers, and the evaluation hyperparameters are selected arbitrarily. We propose density and coverage metrics that solve the above issues. We analytically and experimentally show that density and coverage provide more interpretable and reliable signals for practitioners than the existing metrics.

Cite this Paper


BibTeX
@InProceedings{pmlr-v119-naeem20a, title = {Reliable Fidelity and Diversity Metrics for Generative Models}, author = {Naeem, Muhammad Ferjad and Oh, Seong Joon and Uh, Youngjung and Choi, Yunjey and Yoo, Jaejun}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {7176--7185}, year = {2020}, editor = {III, Hal Daumé and Singh, Aarti}, volume = {119}, series = {Proceedings of Machine Learning Research}, month = {13--18 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v119/naeem20a/naeem20a.pdf}, url = {https://proceedings.mlr.press/v119/naeem20a.html}, abstract = {Devising indicative evaluation metrics for the image generation task remains an open problem. The most widely used metric for measuring the similarity between real and generated images has been the Frechet Inception Distance (FID) score. Since it does not differentiate the fidelity and diversity aspects of the generated images, recent papers have introduced variants of precision and recall metrics to diagnose those properties separately. In this paper, we show that even the latest version of the precision and recall metrics are not reliable yet. For example, they fail to detect the match between two identical distributions, they are not robust against outliers, and the evaluation hyperparameters are selected arbitrarily. We propose density and coverage metrics that solve the above issues. We analytically and experimentally show that density and coverage provide more interpretable and reliable signals for practitioners than the existing metrics.} }
Endnote
%0 Conference Paper %T Reliable Fidelity and Diversity Metrics for Generative Models %A Muhammad Ferjad Naeem %A Seong Joon Oh %A Youngjung Uh %A Yunjey Choi %A Jaejun Yoo %B Proceedings of the 37th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Hal Daumé III %E Aarti Singh %F pmlr-v119-naeem20a %I PMLR %P 7176--7185 %U https://proceedings.mlr.press/v119/naeem20a.html %V 119 %X Devising indicative evaluation metrics for the image generation task remains an open problem. The most widely used metric for measuring the similarity between real and generated images has been the Frechet Inception Distance (FID) score. Since it does not differentiate the fidelity and diversity aspects of the generated images, recent papers have introduced variants of precision and recall metrics to diagnose those properties separately. In this paper, we show that even the latest version of the precision and recall metrics are not reliable yet. For example, they fail to detect the match between two identical distributions, they are not robust against outliers, and the evaluation hyperparameters are selected arbitrarily. We propose density and coverage metrics that solve the above issues. We analytically and experimentally show that density and coverage provide more interpretable and reliable signals for practitioners than the existing metrics.
APA
Naeem, M.F., Oh, S.J., Uh, Y., Choi, Y. & Yoo, J.. (2020). Reliable Fidelity and Diversity Metrics for Generative Models. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:7176-7185 Available from https://proceedings.mlr.press/v119/naeem20a.html.

Related Material