A Sparse Representation-Based Approach to Linear Regression with Partially Shuffled Labels
Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, PMLR 115:38-48, 2020.
Several recent papers have discussed a modification of linear regression in which the correspondence between input variables and labels is missing or erroneous, referred to as "Linear Regression with Unknown Permutation", or "Linear Regression with Shuffled Data". Prior studies of this setup have shed light on the associated statistical limits. However, practical and well-founded approaches that overcome the computational challenges of the setup are still scarce. In this paper, we translate the problem of linear regression with unknown permutation to a robust subspace recovery problem, or alternatively, outlier recovery problem. Outliers correspond to mismatched data, and a specific sparse representation of the data is used to detect the outliers. The proposed approach requires at least a reasonably large fraction (but potentially significantly less than 50%) of the response-predictor pairs to be in correct correspondence. This assumption is often justified, e.g., if record linkage based on additional contextual information is sufficiently accurate to enable correct matching of such fraction of the data. It turns out that in this situation, estimation of the regression parameter on the one hand and recovery of the underlying permutation on the other hand can be decoupled so that the computational hardness associated with the latter can be sidestepped.