Classifier Performance Estimation with Unbalanced, Partially Labeled Data

Benjamin A. Miller, Jeremy Vila, Malina Kirn, Joseph R. Zipkin
Proceedings of The International Workshop on Cost-Sensitive Learning, PMLR 88:4-16, 2018.

Abstract

Class imbalance and lack of ground truth are two significant problems in modern machine learning research. These problems are especially pressing in operational contexts where the total number of data points is extremely large and the cost of obtaining labels is very high. In the face of these issues, accurate estimation of the performance of a detection or classification system is crucial to inform decisions based on the observations. This paper presents a framework for estimating performance of a binary classifier in such a context. We focus on the scenario where each set of measurements has been reduced to a score, and the operator only investigates data when the score exceeds a threshold. The operator is blind to the number of missed detections, so performance estimation targets two quantities: recall and the derivative of precision with respect to recall. Measuring with respect to error in these two metrics, simulations in this context demonstrate that labeling outliers not only outperforms random labeling, but often matches performance of an adaptive method that attempts to choose the optimal data for labeling. Application to real anomaly detection data confirms the utility of the approach, and suggests direction for future work.

Cite this Paper


BibTeX
@InProceedings{pmlr-v88-miller18a, title = {Classifier Performance Estimation with Unbalanced, Partially Labeled Data}, author = {Miller, Benjamin A. and Vila, Jeremy and Kirn, Malina and Zipkin, Joseph R.}, booktitle = {Proceedings of The International Workshop on Cost-Sensitive Learning}, pages = {4--16}, year = {2018}, editor = {Torgo, Luís and Matwin, Stan and Weiss, Gary and Moniz, Nuno and Branco, Paula}, volume = {88}, series = {Proceedings of Machine Learning Research}, month = {05 May}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v88/miller18a/miller18a.pdf}, url = {https://proceedings.mlr.press/v88/miller18a.html}, abstract = {Class imbalance and lack of ground truth are two significant problems in modern machine learning research. These problems are especially pressing in operational contexts where the total number of data points is extremely large and the cost of obtaining labels is very high. In the face of these issues, accurate estimation of the performance of a detection or classification system is crucial to inform decisions based on the observations. This paper presents a framework for estimating performance of a binary classifier in such a context. We focus on the scenario where each set of measurements has been reduced to a score, and the operator only investigates data when the score exceeds a threshold. The operator is blind to the number of missed detections, so performance estimation targets two quantities: recall and the derivative of precision with respect to recall. Measuring with respect to error in these two metrics, simulations in this context demonstrate that labeling outliers not only outperforms random labeling, but often matches performance of an adaptive method that attempts to choose the optimal data for labeling. Application to real anomaly detection data confirms the utility of the approach, and suggests direction for future work.} }
Endnote
%0 Conference Paper %T Classifier Performance Estimation with Unbalanced, Partially Labeled Data %A Benjamin A. Miller %A Jeremy Vila %A Malina Kirn %A Joseph R. Zipkin %B Proceedings of The International Workshop on Cost-Sensitive Learning %C Proceedings of Machine Learning Research %D 2018 %E Luís Torgo %E Stan Matwin %E Gary Weiss %E Nuno Moniz %E Paula Branco %F pmlr-v88-miller18a %I PMLR %P 4--16 %U https://proceedings.mlr.press/v88/miller18a.html %V 88 %X Class imbalance and lack of ground truth are two significant problems in modern machine learning research. These problems are especially pressing in operational contexts where the total number of data points is extremely large and the cost of obtaining labels is very high. In the face of these issues, accurate estimation of the performance of a detection or classification system is crucial to inform decisions based on the observations. This paper presents a framework for estimating performance of a binary classifier in such a context. We focus on the scenario where each set of measurements has been reduced to a score, and the operator only investigates data when the score exceeds a threshold. The operator is blind to the number of missed detections, so performance estimation targets two quantities: recall and the derivative of precision with respect to recall. Measuring with respect to error in these two metrics, simulations in this context demonstrate that labeling outliers not only outperforms random labeling, but often matches performance of an adaptive method that attempts to choose the optimal data for labeling. Application to real anomaly detection data confirms the utility of the approach, and suggests direction for future work.
APA
Miller, B.A., Vila, J., Kirn, M. & Zipkin, J.R.. (2018). Classifier Performance Estimation with Unbalanced, Partially Labeled Data. Proceedings of The International Workshop on Cost-Sensitive Learning, in Proceedings of Machine Learning Research 88:4-16 Available from https://proceedings.mlr.press/v88/miller18a.html.

Related Material