Appropriate Evaluation of Diagnostic Utility of Machine Learning Algorithm Generated Images

Young Joon Kwon, Danielle Toussie, Lea Azour, Jose Concepcion, Corey Eber, G. Anthony Reina, Ping Tak Peter Tang, Amish H. Doshi, Eric K. Oermann, Anthony B. Costa
Proceedings of the Machine Learning for Health NeurIPS Workshop, PMLR 136:179-193, 2020.

Abstract

Generative machine learning (ML) methods can reduce time, cost, and radiation associated with medical image acquisition, compression, or generation techniques. While quantitative metrics are commonly used in the evaluation of ML generated images, it is unknown how well these quantitative metrics relate to the diagnostic utility of images. Here, fellowship-trained radiologists provided diagnoses and qualitative evaluations on chest radiographs reconstructed from the current standard JPEG2000 or variational autoencoder (VAE) techniques. Cohen’s kappa coefficient measured the agreement of diagnoses based on different reconstructions. Methods that produced similar Fréchet inception distance (FID) showed similar diagnostic performances. Thus in place of time-intensive expert radiologist verification, an appropriate target FID – an objective quantitative metric – can evaluate the clinical utility of ML generated medical images.

Cite this Paper


BibTeX
@InProceedings{pmlr-v136-kwon20a, title = {Appropriate Evaluation of Diagnostic Utility of Machine Learning Algorithm Generated Images}, author = {Kwon, Young Joon and Toussie, Danielle and Azour, Lea and Concepcion, Jose and Eber, Corey and Reina, G. Anthony and Tang, Ping Tak Peter and Doshi, Amish H. and Oermann, Eric K. and Costa, Anthony B.}, booktitle = {Proceedings of the Machine Learning for Health NeurIPS Workshop}, pages = {179--193}, year = {2020}, editor = {Emily Alsentzer and Matthew B. A. McDermott and Fabian Falck and Suproteem K. Sarkar and Subhrajit Roy and Stephanie L. Hyland}, volume = {136}, series = {Proceedings of Machine Learning Research}, month = {11 Dec}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v136/kwon20a/kwon20a.pdf}, url = {http://proceedings.mlr.press/v136/kwon20a.html}, abstract = {Generative machine learning (ML) methods can reduce time, cost, and radiation associated with medical image acquisition, compression, or generation techniques. While quantitative metrics are commonly used in the evaluation of ML generated images, it is unknown how well these quantitative metrics relate to the diagnostic utility of images. Here, fellowship-trained radiologists provided diagnoses and qualitative evaluations on chest radiographs reconstructed from the current standard JPEG2000 or variational autoencoder (VAE) techniques. Cohen’s kappa coefficient measured the agreement of diagnoses based on different reconstructions. Methods that produced similar Fréchet inception distance (FID) showed similar diagnostic performances. Thus in place of time-intensive expert radiologist verification, an appropriate target FID – an objective quantitative metric – can evaluate the clinical utility of ML generated medical images.} }
Endnote
%0 Conference Paper %T Appropriate Evaluation of Diagnostic Utility of Machine Learning Algorithm Generated Images %A Young Joon Kwon %A Danielle Toussie %A Lea Azour %A Jose Concepcion %A Corey Eber %A G. Anthony Reina %A Ping Tak Peter Tang %A Amish H. Doshi %A Eric K. Oermann %A Anthony B. Costa %B Proceedings of the Machine Learning for Health NeurIPS Workshop %C Proceedings of Machine Learning Research %D 2020 %E Emily Alsentzer %E Matthew B. A. McDermott %E Fabian Falck %E Suproteem K. Sarkar %E Subhrajit Roy %E Stephanie L. Hyland %F pmlr-v136-kwon20a %I PMLR %P 179--193 %U http://proceedings.mlr.press/v136/kwon20a.html %V 136 %X Generative machine learning (ML) methods can reduce time, cost, and radiation associated with medical image acquisition, compression, or generation techniques. While quantitative metrics are commonly used in the evaluation of ML generated images, it is unknown how well these quantitative metrics relate to the diagnostic utility of images. Here, fellowship-trained radiologists provided diagnoses and qualitative evaluations on chest radiographs reconstructed from the current standard JPEG2000 or variational autoencoder (VAE) techniques. Cohen’s kappa coefficient measured the agreement of diagnoses based on different reconstructions. Methods that produced similar Fréchet inception distance (FID) showed similar diagnostic performances. Thus in place of time-intensive expert radiologist verification, an appropriate target FID – an objective quantitative metric – can evaluate the clinical utility of ML generated medical images.
APA
Kwon, Y.J., Toussie, D., Azour, L., Concepcion, J., Eber, C., Reina, G.A., Tang, P.T.P., Doshi, A.H., Oermann, E.K. & Costa, A.B.. (2020). Appropriate Evaluation of Diagnostic Utility of Machine Learning Algorithm Generated Images. Proceedings of the Machine Learning for Health NeurIPS Workshop, in Proceedings of Machine Learning Research 136:179-193 Available from http://proceedings.mlr.press/v136/kwon20a.html.

Related Material