Towards Reliable Dermatology Evaluation Benchmarks

Fabian Gröger; Simone Lionetti; Philippe Gottfrois; Alvaro Gonzalez-Jimenez; Matthew Groh; Roxana Daneshjou; Labelling Consortium; Alexander A. Navarini; Marc Pouly

Towards Reliable Dermatology Evaluation Benchmarks

Fabian Gröger, Simone Lionetti, Philippe Gottfrois, Alvaro Gonzalez-Jimenez, Matthew Groh, Roxana Daneshjou, Labelling Consortium, Alexander A. Navarini, Marc Pouly

Proceedings of the 3rd Machine Learning for Health Symposium, PMLR 225:101-128, 2023.

Abstract

Benchmark datasets for digital dermatology unwittingly contain inaccuracies that reduce trust in model performance estimates. We propose a resource-efficient data-cleaning protocol to identify issues that escaped previous curation. The protocol leverages an existing algorithmic cleaning strategy and is followed by a confirmation process terminated by an intuitive stopping criterion. Based on confirmation by multiple dermatologists, we remove irrelevant samples and near duplicates and estimate the percentage of label errors in six dermatology image datasets for model evaluation promoted by the isic . Along with this paper, we publish revised file lists for each dataset which should be used for model evaluation. https://github.com/Digital-Dermatology/SelfClean-Revised-Benchmarks Our work paves the way for more trustworthy performance assessment in digital dermatology.

Cite this Paper

BibTeX


@InProceedings{pmlr-v225-groger23a,
  title = 	 {Towards Reliable Dermatology Evaluation Benchmarks},
  author =       {Gr\"oger, Fabian and Lionetti, Simone and Gottfrois, Philippe and Gonzalez-Jimenez, Alvaro and Groh, Matthew and Daneshjou, Roxana and Consortium, Labelling and Navarini, Alexander A. and Pouly, Marc},
  booktitle = 	 {Proceedings of the 3rd Machine Learning for Health Symposium},
  pages = 	 {101--128},
  year = 	 {2023},
  editor = 	 {Hegselmann, Stefan and Parziale, Antonio and Shanmugam, Divya and Tang, Shengpu and Asiedu, Mercy Nyamewaa and Chang, Serina and Hartvigsen, Tom and Singh, Harvineet},
  volume = 	 {225},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {10 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v225/groger23a/groger23a.pdf},
  url = 	 {https://proceedings.mlr.press/v225/groger23a.html},
  abstract = 	 {Benchmark datasets for digital dermatology unwittingly contain inaccuracies that reduce trust in model performance estimates. We propose a resource-efficient data-cleaning protocol to identify issues that escaped previous curation. The protocol leverages an existing algorithmic cleaning strategy and is followed by a confirmation process terminated by an intuitive stopping criterion. Based on confirmation by multiple dermatologists, we remove irrelevant samples and near duplicates and estimate the percentage of label errors in six dermatology image datasets for model evaluation promoted by the isic . Along with this paper, we publish revised file lists for each dataset which should be used for model evaluation. https://github.com/Digital-Dermatology/SelfClean-Revised-Benchmarks Our work paves the way for more trustworthy performance assessment in digital dermatology.}
}

Endnote

%0 Conference Paper
%T Towards Reliable Dermatology Evaluation Benchmarks
%A Fabian Gröger
%A Simone Lionetti
%A Philippe Gottfrois
%A Alvaro Gonzalez-Jimenez
%A Matthew Groh
%A Roxana Daneshjou
%A Labelling Consortium
%A Alexander A. Navarini
%A Marc Pouly
%B Proceedings of the 3rd Machine Learning for Health Symposium
%C Proceedings of Machine Learning Research
%D 2023
%E Stefan Hegselmann
%E Antonio Parziale
%E Divya Shanmugam
%E Shengpu Tang
%E Mercy Nyamewaa Asiedu
%E Serina Chang
%E Tom Hartvigsen
%E Harvineet Singh	
%F pmlr-v225-groger23a
%I PMLR
%P 101--128
%U https://proceedings.mlr.press/v225/groger23a.html
%V 225
%X Benchmark datasets for digital dermatology unwittingly contain inaccuracies that reduce trust in model performance estimates. We propose a resource-efficient data-cleaning protocol to identify issues that escaped previous curation. The protocol leverages an existing algorithmic cleaning strategy and is followed by a confirmation process terminated by an intuitive stopping criterion. Based on confirmation by multiple dermatologists, we remove irrelevant samples and near duplicates and estimate the percentage of label errors in six dermatology image datasets for model evaluation promoted by the isic . Along with this paper, we publish revised file lists for each dataset which should be used for model evaluation. https://github.com/Digital-Dermatology/SelfClean-Revised-Benchmarks Our work paves the way for more trustworthy performance assessment in digital dermatology.

APA


Gröger, F., Lionetti, S., Gottfrois, P., Gonzalez-Jimenez, A., Groh, M., Daneshjou, R., Consortium, L., Navarini, A.A. & Pouly, M.. (2023). Towards Reliable Dermatology Evaluation Benchmarks. Proceedings of the 3rd Machine Learning for Health Symposium, in Proceedings of Machine Learning Research 225:101-128 Available from https://proceedings.mlr.press/v225/groger23a.html.

Towards Reliable Dermatology Evaluation Benchmarks

Abstract

Cite this Paper

Related Material