Which is Better for Learning with Noisy Labels: The Semi-supervised Method or Modeling Label Noise?

Yu Yao, Mingming Gong, Yuxuan Du, Jun Yu, Bo Han, Kun Zhang, Tongliang Liu
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:39660-39673, 2023.

Abstract

In real life, accurately annotating large-scale datasets is sometimes difficult. Datasets used for training deep learning models are likely to contain label noise. To make use of the dataset containing label noise, two typical methods have been proposed. One is to employ the semi-supervised method by exploiting labeled confident examples and unlabeled unconfident examples. The other one is to model label noise and design statistically consistent classifiers. A natural question remains unsolved: which one should be used for a specific real-world application? In this paper, we answer the question from the perspective of causal data generative process. Specifically, the performance of the semi-supervised based method depends heavily on the data generative process while the method modeling label-noise is not influenced by the generation process. For example, for a given dataset, if it has a causal generative structure that the features cause the label, the semi-supervised based method would not be helpful. When the causal structure is unknown, we provide an intuitive method to discover the causal structure for a given dataset containing label noise.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-yao23a, title = {Which is Better for Learning with Noisy Labels: The Semi-supervised Method or Modeling Label Noise?}, author = {Yao, Yu and Gong, Mingming and Du, Yuxuan and Yu, Jun and Han, Bo and Zhang, Kun and Liu, Tongliang}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {39660--39673}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/yao23a/yao23a.pdf}, url = {https://proceedings.mlr.press/v202/yao23a.html}, abstract = {In real life, accurately annotating large-scale datasets is sometimes difficult. Datasets used for training deep learning models are likely to contain label noise. To make use of the dataset containing label noise, two typical methods have been proposed. One is to employ the semi-supervised method by exploiting labeled confident examples and unlabeled unconfident examples. The other one is to model label noise and design statistically consistent classifiers. A natural question remains unsolved: which one should be used for a specific real-world application? In this paper, we answer the question from the perspective of causal data generative process. Specifically, the performance of the semi-supervised based method depends heavily on the data generative process while the method modeling label-noise is not influenced by the generation process. For example, for a given dataset, if it has a causal generative structure that the features cause the label, the semi-supervised based method would not be helpful. When the causal structure is unknown, we provide an intuitive method to discover the causal structure for a given dataset containing label noise.} }
Endnote
%0 Conference Paper %T Which is Better for Learning with Noisy Labels: The Semi-supervised Method or Modeling Label Noise? %A Yu Yao %A Mingming Gong %A Yuxuan Du %A Jun Yu %A Bo Han %A Kun Zhang %A Tongliang Liu %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-yao23a %I PMLR %P 39660--39673 %U https://proceedings.mlr.press/v202/yao23a.html %V 202 %X In real life, accurately annotating large-scale datasets is sometimes difficult. Datasets used for training deep learning models are likely to contain label noise. To make use of the dataset containing label noise, two typical methods have been proposed. One is to employ the semi-supervised method by exploiting labeled confident examples and unlabeled unconfident examples. The other one is to model label noise and design statistically consistent classifiers. A natural question remains unsolved: which one should be used for a specific real-world application? In this paper, we answer the question from the perspective of causal data generative process. Specifically, the performance of the semi-supervised based method depends heavily on the data generative process while the method modeling label-noise is not influenced by the generation process. For example, for a given dataset, if it has a causal generative structure that the features cause the label, the semi-supervised based method would not be helpful. When the causal structure is unknown, we provide an intuitive method to discover the causal structure for a given dataset containing label noise.
APA
Yao, Y., Gong, M., Du, Y., Yu, J., Han, B., Zhang, K. & Liu, T.. (2023). Which is Better for Learning with Noisy Labels: The Semi-supervised Method or Modeling Label Noise?. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:39660-39673 Available from https://proceedings.mlr.press/v202/yao23a.html.

Related Material