[edit]
Sample Complexity Bounds for Differentially Private Learning
Proceedings of the 24th Annual Conference on Learning Theory, PMLR 19:155-186, 2011.
Abstract
This work studies the problem of privacy-preserving classification – namely, learning a classifier from sensitive data while preserving the privacy of individuals in the training set. In particular, the learning algorithm is required in this problem to guarantee differential privacy, a very strong notion of privacy that has gained significant attention in recent years. A natural question to ask is: what is the sample requirement of a learning algorithm that guarantees a certain level of privacy and accuracy? We address this question in the context of learning with infinite hypothesis classes when the data is drawn from a continuous distribution. We first show that even for very simple hypothesis classes, any algorithmth at uses a finite number of examples and guarantees differential privacy must fail to return an accurate classifier for at least some unlabeled data distributions. This result is unlike the case with either finite hypothesis classes or discrete data domains, in which distribution-free private learning is possible, as previously shown by Kasiviswanathan et al. (2008). We then consider two approaches to differentially private learning that get around this lower bound. The first approach is to use prior knowledge about the unlabeled data distribution in the form of a reference distribution $\mathcal{U}$ chosen independently of the sensitive data. Given such a reference $\mathcal{U}$, we provide an upper bound on the sample requirement that depends (among other things) on a measure of closeness between $\mathcal{U}$ and the unlabeled data distribution. Our upper bound applies to the non-realizable as well as the realizable case. The second approach is to relax the privacy requirement, by requiring only label-privacy –namely, that the only labels (and not the unlabeled parts of the examples)be considered sensitive information. An upper bound on the sample requirement of learning with label privacy was shown by Chaudhuri et al. (2006); in this work, we show a lower bound.