Deep Clustering with Incomplete Noisy Pairwise Annotations: A Geometric Regularization Approach

Tri Nguyen, Shahana Ibrahim, Xiao Fu
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:25980-26007, 2023.

Abstract

The recent integration of deep learning and pairwise similarity annotation-based constrained clustering—i.e., deep constrained clustering (DCC)—has proven effective for incorporating weak supervision into massive data clustering: Less than 1% of pair similarity annotations can often substantially enhance the clustering accuracy. However, beyond empirical successes, there is a lack of understanding of DCC. In addition, many DCC paradigms are sensitive to annotation noise, but performance-guaranteed noisy DCC methods have been largely elusive. This work first takes a deep look into a recently emerged logistic loss function of DCC, and characterizes its theoretical properties. Our result shows that the logistic DCC loss ensures the identifiability of data membership under reasonable conditions, which may shed light on its effectiveness in practice. Building upon this understanding, a new loss function based on geometric factor analysis is proposed to fend against noisy annotations. It is shown that even under unknown annotation confusions, the data membership can still be provably identified under our proposed learning criterion. The proposed approach is tested over multiple datasets to validate our claims.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-nguyen23d, title = {Deep Clustering with Incomplete Noisy Pairwise Annotations: A Geometric Regularization Approach}, author = {Nguyen, Tri and Ibrahim, Shahana and Fu, Xiao}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {25980--26007}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/nguyen23d/nguyen23d.pdf}, url = {https://proceedings.mlr.press/v202/nguyen23d.html}, abstract = {The recent integration of deep learning and pairwise similarity annotation-based constrained clustering—i.e., deep constrained clustering (DCC)—has proven effective for incorporating weak supervision into massive data clustering: Less than 1% of pair similarity annotations can often substantially enhance the clustering accuracy. However, beyond empirical successes, there is a lack of understanding of DCC. In addition, many DCC paradigms are sensitive to annotation noise, but performance-guaranteed noisy DCC methods have been largely elusive. This work first takes a deep look into a recently emerged logistic loss function of DCC, and characterizes its theoretical properties. Our result shows that the logistic DCC loss ensures the identifiability of data membership under reasonable conditions, which may shed light on its effectiveness in practice. Building upon this understanding, a new loss function based on geometric factor analysis is proposed to fend against noisy annotations. It is shown that even under unknown annotation confusions, the data membership can still be provably identified under our proposed learning criterion. The proposed approach is tested over multiple datasets to validate our claims.} }
Endnote
%0 Conference Paper %T Deep Clustering with Incomplete Noisy Pairwise Annotations: A Geometric Regularization Approach %A Tri Nguyen %A Shahana Ibrahim %A Xiao Fu %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-nguyen23d %I PMLR %P 25980--26007 %U https://proceedings.mlr.press/v202/nguyen23d.html %V 202 %X The recent integration of deep learning and pairwise similarity annotation-based constrained clustering—i.e., deep constrained clustering (DCC)—has proven effective for incorporating weak supervision into massive data clustering: Less than 1% of pair similarity annotations can often substantially enhance the clustering accuracy. However, beyond empirical successes, there is a lack of understanding of DCC. In addition, many DCC paradigms are sensitive to annotation noise, but performance-guaranteed noisy DCC methods have been largely elusive. This work first takes a deep look into a recently emerged logistic loss function of DCC, and characterizes its theoretical properties. Our result shows that the logistic DCC loss ensures the identifiability of data membership under reasonable conditions, which may shed light on its effectiveness in practice. Building upon this understanding, a new loss function based on geometric factor analysis is proposed to fend against noisy annotations. It is shown that even under unknown annotation confusions, the data membership can still be provably identified under our proposed learning criterion. The proposed approach is tested over multiple datasets to validate our claims.
APA
Nguyen, T., Ibrahim, S. & Fu, X.. (2023). Deep Clustering with Incomplete Noisy Pairwise Annotations: A Geometric Regularization Approach. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:25980-26007 Available from https://proceedings.mlr.press/v202/nguyen23d.html.

Related Material