Crowdclustering with Partition Labels


Junxiang Chen, Yale Chang, Peter Castaldi, Michael Cho, Brian Hobbs, Jennifer Dy ;
Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, PMLR 84:1127-1136, 2018.


Crowdclustering is a practical way to incorporate domain knowledge into clustering, by combining opinions from multiple domain experts. Existing crowdclustering methods analyze binary pairwise similarity labels. However, in some applications, experts might provide partition labels. If we convert partition labels into pairwise similarity, then it would be difficult to understand the relationships between clustering solutions from different experts. In this paper, we propose a crowdclustering model that directly analyzes partition labels. The proposed model adopts a novel approach based on a modified multinomial logistic regression model, which simultaneously learns the number of clusters and determines hyper-planes that partition samples into clusters. The proposed model also learns a mapping between the latent clusters and expert labels, revealing the agreements and disagreements between experts. Experiments on benchmark data demonstrate that the proposed model simultaneously learns the number of clusters and discovers the clustering structure. An experiment on disease subtyping problem illustrates that the proposed model helps us understand the agreement and disagreement between experts.

Related Material