[edit]
Minimax Risk Classifiers for Mislabeled Data: a Study on Patient Outcome Prediction Tasks
Proceedings of the 9th Machine Learning for Healthcare Conference, PMLR 252, 2024.
Abstract
Healthcare datasets are often impacted by incorrect or mislabeled data, due to imperfect annotations, data collection issues, ambiguity, and subjective interpretations. Incorrectly classified data, referred to as "noisy labels", can significantly degrade the performance of supervised learning models. Namely, noisy labels hinder the algorithm’s ability to accurately capture the true underlying patterns from observed data. More importantly, evaluating the performance of a classifier when only noisy test labels are available is a significant complication. We hereby tackle the challenge of trusting the labelling process both in training and testing, as noisy patient outcome labels in healthcare raise methodological and ethical considerations. We propose a novel adaptation of Minimax Risk Classifiers (MRCs) for data subject to noisy labels, both in training and evaluation. We show that the upper bound of the MRC’s expected loss can serve as a useful estimator for the classifier’s performance, especially in situations where clean test data is not available. We demonstrate the benefits of the proposed methodology in healthcare tasks where patient outcomes are predicted from mislabeled data. The proposed technique is accurate and stable, avoiding overly optimistic assessments of prediction error, a significantly harmful burden in patient outcome prediction tasks in healthcare.