Multimodal Machine Learning for Automated ICD Coding

Keyang Xu, Mike Lam, Jingzhi Pang, Xin Gao, Charlotte Band, Piyush Mathur, Frank Papay, Ashish K. Khanna, Jacek B. Cywinski, Kamal Maheshwari, Pengtao Xie, Eric P. Xing
Proceedings of the 4th Machine Learning for Healthcare Conference, PMLR 106:197-215, 2019.

Abstract

This study presents a multimodal machine learning model to predict ICD-10 diagnostic codes. We developed separate machine learning models that can handle data from different modalities, including unstructured text, semi-structured text and structured tabular data. We further employed an ensemble method to integrate all modality-specific models to generate ICD codes. Key evidence was also extracted to make our prediction more convincing and explainable. We used the Medical Information Mart for Intensive Care III (MIMIC-III) dataset to validate our approach. For ICD code prediction, our best-performing model (micro-F1 = 0.7633, micro-AUC = 0.9541) significantly outperforms other baseline models including TF-IDF (micro-F1 = 0.6721, micro-AUC = 0.7879) and Text-CNN model (micro-F1 = 0.6569, micro-AUC = 0.9235). For interpretability, our approach achieves a Jaccard Similarity Coecient (JSC) of 0.1806 on text data and 0.3105 on tabular data, where well-trained physicians achieve 0.2780 and 0.5002 respectively.

Cite this Paper


BibTeX
@InProceedings{pmlr-v106-xu19a, title = {Multimodal Machine Learning for Automated ICD Coding}, author = {Xu, Keyang and Lam, Mike and Pang, Jingzhi and Gao, Xin and Band, Charlotte and Mathur, Piyush and Papay, Frank and Khanna, Ashish K. and Cywinski, Jacek B. and Maheshwari, Kamal and Xie, Pengtao and Xing, Eric P.}, booktitle = {Proceedings of the 4th Machine Learning for Healthcare Conference}, pages = {197--215}, year = {2019}, editor = {Doshi-Velez, Finale and Fackler, Jim and Jung, Ken and Kale, David and Ranganath, Rajesh and Wallace, Byron and Wiens, Jenna}, volume = {106}, series = {Proceedings of Machine Learning Research}, month = {09--10 Aug}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v106/xu19a/xu19a.pdf}, url = {https://proceedings.mlr.press/v106/xu19a.html}, abstract = {This study presents a multimodal machine learning model to predict ICD-10 diagnostic codes. We developed separate machine learning models that can handle data from different modalities, including unstructured text, semi-structured text and structured tabular data. We further employed an ensemble method to integrate all modality-specific models to generate ICD codes. Key evidence was also extracted to make our prediction more convincing and explainable. We used the Medical Information Mart for Intensive Care III (MIMIC-III) dataset to validate our approach. For ICD code prediction, our best-performing model (micro-F1 = 0.7633, micro-AUC = 0.9541) significantly outperforms other baseline models including TF-IDF (micro-F1 = 0.6721, micro-AUC = 0.7879) and Text-CNN model (micro-F1 = 0.6569, micro-AUC = 0.9235). For interpretability, our approach achieves a Jaccard Similarity Coecient (JSC) of 0.1806 on text data and 0.3105 on tabular data, where well-trained physicians achieve 0.2780 and 0.5002 respectively.} }
Endnote
%0 Conference Paper %T Multimodal Machine Learning for Automated ICD Coding %A Keyang Xu %A Mike Lam %A Jingzhi Pang %A Xin Gao %A Charlotte Band %A Piyush Mathur %A Frank Papay %A Ashish K. Khanna %A Jacek B. Cywinski %A Kamal Maheshwari %A Pengtao Xie %A Eric P. Xing %B Proceedings of the 4th Machine Learning for Healthcare Conference %C Proceedings of Machine Learning Research %D 2019 %E Finale Doshi-Velez %E Jim Fackler %E Ken Jung %E David Kale %E Rajesh Ranganath %E Byron Wallace %E Jenna Wiens %F pmlr-v106-xu19a %I PMLR %P 197--215 %U https://proceedings.mlr.press/v106/xu19a.html %V 106 %X This study presents a multimodal machine learning model to predict ICD-10 diagnostic codes. We developed separate machine learning models that can handle data from different modalities, including unstructured text, semi-structured text and structured tabular data. We further employed an ensemble method to integrate all modality-specific models to generate ICD codes. Key evidence was also extracted to make our prediction more convincing and explainable. We used the Medical Information Mart for Intensive Care III (MIMIC-III) dataset to validate our approach. For ICD code prediction, our best-performing model (micro-F1 = 0.7633, micro-AUC = 0.9541) significantly outperforms other baseline models including TF-IDF (micro-F1 = 0.6721, micro-AUC = 0.7879) and Text-CNN model (micro-F1 = 0.6569, micro-AUC = 0.9235). For interpretability, our approach achieves a Jaccard Similarity Coecient (JSC) of 0.1806 on text data and 0.3105 on tabular data, where well-trained physicians achieve 0.2780 and 0.5002 respectively.
APA
Xu, K., Lam, M., Pang, J., Gao, X., Band, C., Mathur, P., Papay, F., Khanna, A.K., Cywinski, J.B., Maheshwari, K., Xie, P. & Xing, E.P.. (2019). Multimodal Machine Learning for Automated ICD Coding. Proceedings of the 4th Machine Learning for Healthcare Conference, in Proceedings of Machine Learning Research 106:197-215 Available from https://proceedings.mlr.press/v106/xu19a.html.

Related Material