Crowdclustering with Partition Labels

Junxiang Chen, Yale Chang, Peter Castaldi, Michael Cho, Brian Hobbs, Jennifer Dy
Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, PMLR 84:1127-1136, 2018.

Abstract

Crowdclustering is a practical way to incorporate domain knowledge into clustering, by combining opinions from multiple domain experts. Existing crowdclustering methods analyze binary pairwise similarity labels. However, in some applications, experts might provide partition labels. If we convert partition labels into pairwise similarity, then it would be difficult to understand the relationships between clustering solutions from different experts. In this paper, we propose a crowdclustering model that directly analyzes partition labels. The proposed model adopts a novel approach based on a modified multinomial logistic regression model, which simultaneously learns the number of clusters and determines hyper-planes that partition samples into clusters. The proposed model also learns a mapping between the latent clusters and expert labels, revealing the agreements and disagreements between experts. Experiments on benchmark data demonstrate that the proposed model simultaneously learns the number of clusters and discovers the clustering structure. An experiment on disease subtyping problem illustrates that the proposed model helps us understand the agreement and disagreement between experts.

Cite this Paper


BibTeX
@InProceedings{pmlr-v84-chen18c, title = {Crowdclustering with Partition Labels}, author = {Chen, Junxiang and Chang, Yale and Castaldi, Peter and Cho, Michael and Hobbs, Brian and Dy, Jennifer}, booktitle = {Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics}, pages = {1127--1136}, year = {2018}, editor = {Storkey, Amos and Perez-Cruz, Fernando}, volume = {84}, series = {Proceedings of Machine Learning Research}, month = {09--11 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v84/chen18c/chen18c.pdf}, url = {https://proceedings.mlr.press/v84/chen18c.html}, abstract = {Crowdclustering is a practical way to incorporate domain knowledge into clustering, by combining opinions from multiple domain experts. Existing crowdclustering methods analyze binary pairwise similarity labels. However, in some applications, experts might provide partition labels. If we convert partition labels into pairwise similarity, then it would be difficult to understand the relationships between clustering solutions from different experts. In this paper, we propose a crowdclustering model that directly analyzes partition labels. The proposed model adopts a novel approach based on a modified multinomial logistic regression model, which simultaneously learns the number of clusters and determines hyper-planes that partition samples into clusters. The proposed model also learns a mapping between the latent clusters and expert labels, revealing the agreements and disagreements between experts. Experiments on benchmark data demonstrate that the proposed model simultaneously learns the number of clusters and discovers the clustering structure. An experiment on disease subtyping problem illustrates that the proposed model helps us understand the agreement and disagreement between experts.} }
Endnote
%0 Conference Paper %T Crowdclustering with Partition Labels %A Junxiang Chen %A Yale Chang %A Peter Castaldi %A Michael Cho %A Brian Hobbs %A Jennifer Dy %B Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2018 %E Amos Storkey %E Fernando Perez-Cruz %F pmlr-v84-chen18c %I PMLR %P 1127--1136 %U https://proceedings.mlr.press/v84/chen18c.html %V 84 %X Crowdclustering is a practical way to incorporate domain knowledge into clustering, by combining opinions from multiple domain experts. Existing crowdclustering methods analyze binary pairwise similarity labels. However, in some applications, experts might provide partition labels. If we convert partition labels into pairwise similarity, then it would be difficult to understand the relationships between clustering solutions from different experts. In this paper, we propose a crowdclustering model that directly analyzes partition labels. The proposed model adopts a novel approach based on a modified multinomial logistic regression model, which simultaneously learns the number of clusters and determines hyper-planes that partition samples into clusters. The proposed model also learns a mapping between the latent clusters and expert labels, revealing the agreements and disagreements between experts. Experiments on benchmark data demonstrate that the proposed model simultaneously learns the number of clusters and discovers the clustering structure. An experiment on disease subtyping problem illustrates that the proposed model helps us understand the agreement and disagreement between experts.
APA
Chen, J., Chang, Y., Castaldi, P., Cho, M., Hobbs, B. & Dy, J.. (2018). Crowdclustering with Partition Labels. Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 84:1127-1136 Available from https://proceedings.mlr.press/v84/chen18c.html.

Related Material