Reproducible Survival Prediction with SEER Cancer Data


Stefan Hegselmann, Leonard Gruelich, Julian Varghese, Martin Dugas ;
Proceedings of the 3rd Machine Learning for Healthcare Conference, PMLR 85:49-66, 2018.


Survival prediction for cancer patients can increase the prognostic accuracy and might ultimately lead to better informed decision making. To this end, many studies apply machine learning to cancer data of the Surveillance, Epidemiology, and End Results (SEER) program. The first part of this report contains a literature review to obtain a systematic overview of these studies. We identify 34 publications and extract information about experimental setups and efforts to ensure reproducibility. The review shows that only one of the identified studies mentions reproducibility and no study contains straightforward reproducible results. This motivates the second part of this work. We demonstrate the feasibility of reproducible cohort selection and survival prediction with SEER cancer data. Experiments are performed for 1- and 5-year survival of breast and lung cancer with cases diagnosed between 2004 and 2009. We compare minimal data preprocessing with 1-n encoding of categorical inputs and apply logistic regression and multilayer perceptron (MLP) models. Encoding with 1-n vectors proves beneficial throughout all experiments. For lung cancer, MLP models show a slightly superior performance. Moreover, importance of input attributes is analyzed with logistic regression weights and ablation analysis for MLPs.

Related Material