Why Out-of-Distribution detection experiments are not reliable - subtle experimental details muddle the OOD detector rankings
Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, PMLR 216:2078-2088, 2023.
Reliable detection of out-of-distribution (OOD) instances is becoming a critical requirement for machine learning systems deployed in safety-critical applications. Recently, many OOD detectors have been developed in the literature, and their performance has been evaluated using empirical studies based on well-established benchmark datasets. However, these studies do not provide a conclusive recommendation because the performance of OOD detection depends on the benchmark datasets. In this work, we want to question the reliability of the OOD detection performance numbers obtained from many of these empirical experiments. We report several experimental conditions that are not controlled and lead to significant changes in OOD detector performance and rankings of OOD methods. These include the technicalities related to how the DNN was trained (such as seed, train/test split, etc.), which do not change the accuracy of closed-set DNN models but may significantly change the performance of OOD detection methods that rely on representation from these DNNs. We performed extensive sensitivity studies in image and text domains to quantify the instability of OOD performance measures due to unintuitive experimental factors. These factors need to be more rigorously controlled and accounted for in many current OOD experiments. Experimental studies in OOD detection should improve methodological standards regarding experiment control and replication.