[edit]
Global-Local Dirichlet Processes for Clustering Grouped Data in the Presence of Group-Specific Idiosyncratic Variables
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:7214-7249, 2025.
Abstract
We consider the problem of clustering grouped data for which the observations may include group-specific variables in addition to the variables that are shared across groups. This type of data is quite common; for example, in cancer genomic studies, molecular information is available for all cancers whereas cancer-specific clinical information may only be available for certain cancers. Existing grouped clustering methods only consider the shared variables but ignore valuable information from the group-specific variables. To allow for these group-specific variables to aid in the clustering, we propose a novel Bayesian nonparametric approach, termed global-local (GLocal) Dirichlet process, that models the "global-local" structure of the observations across groups. We characterize the GLocal Dirichlet process using the stick-breaking representation and the representation as a limit of a finite mixture model. We theoretically quantify the approximation errors of the truncated prior, the corresponding finite mixture model, and the associated posterior distribution. We develop a fast variational Bayes algorithm for scalable posterior inference, which we illustrate with extensive simulations and a TCGA pan-gastrointestinal cancer dataset.