Reproducible Survival Prediction with SEER Cancer Data

Stefan Hegselmann, Leonard Gruelich, Julian Varghese, Martin Dugas
Proceedings of the 3rd Machine Learning for Healthcare Conference, PMLR 85:49-66, 2018.

Abstract

Survival prediction for cancer patients can increase the prognostic accuracy and might ultimately lead to better informed decision making. To this end, many studies apply machine learning to cancer data of the Surveillance, Epidemiology, and End Results (SEER) program. The first part of this report contains a literature review to obtain a systematic overview of these studies. We identify 34 publications and extract information about experimental setups and efforts to ensure reproducibility. The review shows that only one of the identified studies mentions reproducibility and no study contains straightforward reproducible results. This motivates the second part of this work. We demonstrate the feasibility of reproducible cohort selection and survival prediction with SEER cancer data. Experiments are performed for 1- and 5-year survival of breast and lung cancer with cases diagnosed between 2004 and 2009. We compare minimal data preprocessing with 1-n encoding of categorical inputs and apply logistic regression and multilayer perceptron (MLP) models. Encoding with 1-n vectors proves beneficial throughout all experiments. For lung cancer, MLP models show a slightly superior performance. Moreover, importance of input attributes is analyzed with logistic regression weights and ablation analysis for MLPs.

Cite this Paper


BibTeX
@InProceedings{pmlr-v85-hegselmann18a, title = {Reproducible Survival Prediction with SEER Cancer Data}, author = {Hegselmann, Stefan and Gruelich, Leonard and Varghese, Julian and Dugas, Martin}, booktitle = {Proceedings of the 3rd Machine Learning for Healthcare Conference}, pages = {49--66}, year = {2018}, editor = {Doshi-Velez, Finale and Fackler, Jim and Jung, Ken and Kale, David and Ranganath, Rajesh and Wallace, Byron and Wiens, Jenna}, volume = {85}, series = {Proceedings of Machine Learning Research}, month = {17--18 Aug}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v85/hegselmann18a/hegselmann18a.pdf}, url = {https://proceedings.mlr.press/v85/hegselmann18a.html}, abstract = {Survival prediction for cancer patients can increase the prognostic accuracy and might ultimately lead to better informed decision making. To this end, many studies apply machine learning to cancer data of the Surveillance, Epidemiology, and End Results (SEER) program. The first part of this report contains a literature review to obtain a systematic overview of these studies. We identify 34 publications and extract information about experimental setups and efforts to ensure reproducibility. The review shows that only one of the identified studies mentions reproducibility and no study contains straightforward reproducible results. This motivates the second part of this work. We demonstrate the feasibility of reproducible cohort selection and survival prediction with SEER cancer data. Experiments are performed for 1- and 5-year survival of breast and lung cancer with cases diagnosed between 2004 and 2009. We compare minimal data preprocessing with 1-n encoding of categorical inputs and apply logistic regression and multilayer perceptron (MLP) models. Encoding with 1-n vectors proves beneficial throughout all experiments. For lung cancer, MLP models show a slightly superior performance. Moreover, importance of input attributes is analyzed with logistic regression weights and ablation analysis for MLPs.} }
Endnote
%0 Conference Paper %T Reproducible Survival Prediction with SEER Cancer Data %A Stefan Hegselmann %A Leonard Gruelich %A Julian Varghese %A Martin Dugas %B Proceedings of the 3rd Machine Learning for Healthcare Conference %C Proceedings of Machine Learning Research %D 2018 %E Finale Doshi-Velez %E Jim Fackler %E Ken Jung %E David Kale %E Rajesh Ranganath %E Byron Wallace %E Jenna Wiens %F pmlr-v85-hegselmann18a %I PMLR %P 49--66 %U https://proceedings.mlr.press/v85/hegselmann18a.html %V 85 %X Survival prediction for cancer patients can increase the prognostic accuracy and might ultimately lead to better informed decision making. To this end, many studies apply machine learning to cancer data of the Surveillance, Epidemiology, and End Results (SEER) program. The first part of this report contains a literature review to obtain a systematic overview of these studies. We identify 34 publications and extract information about experimental setups and efforts to ensure reproducibility. The review shows that only one of the identified studies mentions reproducibility and no study contains straightforward reproducible results. This motivates the second part of this work. We demonstrate the feasibility of reproducible cohort selection and survival prediction with SEER cancer data. Experiments are performed for 1- and 5-year survival of breast and lung cancer with cases diagnosed between 2004 and 2009. We compare minimal data preprocessing with 1-n encoding of categorical inputs and apply logistic regression and multilayer perceptron (MLP) models. Encoding with 1-n vectors proves beneficial throughout all experiments. For lung cancer, MLP models show a slightly superior performance. Moreover, importance of input attributes is analyzed with logistic regression weights and ablation analysis for MLPs.
APA
Hegselmann, S., Gruelich, L., Varghese, J. & Dugas, M.. (2018). Reproducible Survival Prediction with SEER Cancer Data. Proceedings of the 3rd Machine Learning for Healthcare Conference, in Proceedings of Machine Learning Research 85:49-66 Available from https://proceedings.mlr.press/v85/hegselmann18a.html.

Related Material