Phase transition in PCA with missing data: Reduced signal-to-noise ratio, not sample size!

Niels Ipsen, Lars Kai Hansen
Proceedings of the 36th International Conference on Machine Learning, PMLR 97:2951-2960, 2019.

Abstract

How does missing data affect our ability to learn signal structures? It has been shown that learning signal structure in terms of principal components is dependent on the ratio of sample size and dimensionality and that a critical number of observations is needed before learning starts (Biehl and Mietzner, 1993). Here we generalize this analysis to include missing data. Probabilistic principal component analysis is regularly used for estimating signal structures in datasets with missing data. Our analytic result suggest that the effect of missing data is to effectively reduce signal-to-noise ratio rather than - as generally believed - to reduce sample size. The theory predicts a phase transition in the learning curves and this is indeed found both in simulation data and in real datasets.

Cite this Paper


BibTeX
@InProceedings{pmlr-v97-ipsen19a, title = {Phase transition in {PCA} with missing data: Reduced signal-to-noise ratio, not sample size!}, author = {Ipsen, Niels and Hansen, Lars Kai}, booktitle = {Proceedings of the 36th International Conference on Machine Learning}, pages = {2951--2960}, year = {2019}, editor = {Kamalika Chaudhuri and Ruslan Salakhutdinov}, volume = {97}, series = {Proceedings of Machine Learning Research}, month = {09--15 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v97/ipsen19a/ipsen19a.pdf}, url = { http://proceedings.mlr.press/v97/ipsen19a.html }, abstract = {How does missing data affect our ability to learn signal structures? It has been shown that learning signal structure in terms of principal components is dependent on the ratio of sample size and dimensionality and that a critical number of observations is needed before learning starts (Biehl and Mietzner, 1993). Here we generalize this analysis to include missing data. Probabilistic principal component analysis is regularly used for estimating signal structures in datasets with missing data. Our analytic result suggest that the effect of missing data is to effectively reduce signal-to-noise ratio rather than - as generally believed - to reduce sample size. The theory predicts a phase transition in the learning curves and this is indeed found both in simulation data and in real datasets.} }
Endnote
%0 Conference Paper %T Phase transition in PCA with missing data: Reduced signal-to-noise ratio, not sample size! %A Niels Ipsen %A Lars Kai Hansen %B Proceedings of the 36th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2019 %E Kamalika Chaudhuri %E Ruslan Salakhutdinov %F pmlr-v97-ipsen19a %I PMLR %P 2951--2960 %U http://proceedings.mlr.press/v97/ipsen19a.html %V 97 %X How does missing data affect our ability to learn signal structures? It has been shown that learning signal structure in terms of principal components is dependent on the ratio of sample size and dimensionality and that a critical number of observations is needed before learning starts (Biehl and Mietzner, 1993). Here we generalize this analysis to include missing data. Probabilistic principal component analysis is regularly used for estimating signal structures in datasets with missing data. Our analytic result suggest that the effect of missing data is to effectively reduce signal-to-noise ratio rather than - as generally believed - to reduce sample size. The theory predicts a phase transition in the learning curves and this is indeed found both in simulation data and in real datasets.
APA
Ipsen, N. & Hansen, L.K.. (2019). Phase transition in PCA with missing data: Reduced signal-to-noise ratio, not sample size!. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:2951-2960 Available from http://proceedings.mlr.press/v97/ipsen19a.html .

Related Material