Baselines for Chest X-Ray Report Generation

William Boag, Tzu-Ming Harry Hsu, Matthew Mcdermott, Gabriela Berner, Emily Alesentzer, Peter Szolovits
Proceedings of the Machine Learning for Health NeurIPS Workshop, PMLR 116:126-140, 2020.

Abstract

With advances in deep learning and image captioning over the past few years, researchers have recently begun applying computer vision methods to radiology report generation. Typically, these generated reports have been evaluated using general domain natural language generation (NLG) metrics like CIDEr and BLEU. However, there is little work assessing how appropriate these metrics are for healthcare, where correctness is critically important. In this work, we profile a number of models for automatic report generation on this dataset, including: random report retrieval, nearest neighbor report retrieval, n-gram language models, and neural network approaches. These models serve to calibrate our understanding for what the opaque general domain NLG metrics mean. In particular, we find that the standard NLG metrics (e.g. BLEU, CIDEr) actually assign higher scores to random (but grammatical) clinical sentences over n-gram-derived sentences, despite the n-gram sentences achieving higher clinical accuracy. This casts doubt on the usefulness of these domain-agnostic metrics, though unsurprisingly we find that the best performance-on both CIDEr/BLEU and clinical correctness-was achieved by more sophisticated models.

Cite this Paper


BibTeX
@InProceedings{pmlr-v116-boag20a, title = {{Baselines for Chest X-Ray Report Generation}}, author = {Boag, William and Hsu, Tzu-Ming Harry and Mcdermott, Matthew and Berner, Gabriela and Alesentzer, Emily and Szolovits, Peter}, booktitle = {Proceedings of the Machine Learning for Health NeurIPS Workshop}, pages = {126--140}, year = {2020}, editor = {Dalca, Adrian V. and McDermott, Matthew B.A. and Alsentzer, Emily and Finlayson, Samuel G. and Oberst, Michael and Falck, Fabian and Beaulieu-Jones, Brett}, volume = {116}, series = {Proceedings of Machine Learning Research}, month = {13 Dec}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v116/boag20a/boag20a.pdf}, url = {https://proceedings.mlr.press/v116/boag20a.html}, abstract = {With advances in deep learning and image captioning over the past few years, researchers have recently begun applying computer vision methods to radiology report generation. Typically, these generated reports have been evaluated using general domain natural language generation (NLG) metrics like CIDEr and BLEU. However, there is little work assessing how appropriate these metrics are for healthcare, where correctness is critically important. In this work, we profile a number of models for automatic report generation on this dataset, including: random report retrieval, nearest neighbor report retrieval, n-gram language models, and neural network approaches. These models serve to calibrate our understanding for what the opaque general domain NLG metrics mean. In particular, we find that the standard NLG metrics (e.g. BLEU, CIDEr) actually assign higher scores to random (but grammatical) clinical sentences over n-gram-derived sentences, despite the n-gram sentences achieving higher clinical accuracy. This casts doubt on the usefulness of these domain-agnostic metrics, though unsurprisingly we find that the best performance-on both CIDEr/BLEU and clinical correctness-was achieved by more sophisticated models.} }
Endnote
%0 Conference Paper %T Baselines for Chest X-Ray Report Generation %A William Boag %A Tzu-Ming Harry Hsu %A Matthew Mcdermott %A Gabriela Berner %A Emily Alesentzer %A Peter Szolovits %B Proceedings of the Machine Learning for Health NeurIPS Workshop %C Proceedings of Machine Learning Research %D 2020 %E Adrian V. Dalca %E Matthew B.A. McDermott %E Emily Alsentzer %E Samuel G. Finlayson %E Michael Oberst %E Fabian Falck %E Brett Beaulieu-Jones %F pmlr-v116-boag20a %I PMLR %P 126--140 %U https://proceedings.mlr.press/v116/boag20a.html %V 116 %X With advances in deep learning and image captioning over the past few years, researchers have recently begun applying computer vision methods to radiology report generation. Typically, these generated reports have been evaluated using general domain natural language generation (NLG) metrics like CIDEr and BLEU. However, there is little work assessing how appropriate these metrics are for healthcare, where correctness is critically important. In this work, we profile a number of models for automatic report generation on this dataset, including: random report retrieval, nearest neighbor report retrieval, n-gram language models, and neural network approaches. These models serve to calibrate our understanding for what the opaque general domain NLG metrics mean. In particular, we find that the standard NLG metrics (e.g. BLEU, CIDEr) actually assign higher scores to random (but grammatical) clinical sentences over n-gram-derived sentences, despite the n-gram sentences achieving higher clinical accuracy. This casts doubt on the usefulness of these domain-agnostic metrics, though unsurprisingly we find that the best performance-on both CIDEr/BLEU and clinical correctness-was achieved by more sophisticated models.
APA
Boag, W., Hsu, T.H., Mcdermott, M., Berner, G., Alesentzer, E. & Szolovits, P.. (2020). Baselines for Chest X-Ray Report Generation. Proceedings of the Machine Learning for Health NeurIPS Workshop, in Proceedings of Machine Learning Research 116:126-140 Available from https://proceedings.mlr.press/v116/boag20a.html.

Related Material