Guarantees for Nonlinear Representation Learning: Non-identical Covariates, Dependent Data, Fewer Samples

Thomas Tck Zhang, Bruce D Lee, Ingvar Ziemann, George J. Pappas, Nikolai Matni
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:59126-59147, 2024.

Abstract

A driving force behind the diverse applicability of modern machine learning is the ability to extract meaningful features across many sources. However, many practical domains involve data that are non-identically distributed across sources, and possibly statistically dependent within its source, violating vital assumptions in existing theoretical studies of representation learning. Toward addressing these issues, we establish statistical guarantees for learning general nonlinear representations from multiple data sources that admit different input distributions and possibly dependent data. Specifically, we study the sample-complexity of learning $T+1$ functions $f_\star^{(t)} \circ g_\star$ from a function class $\mathcal{F} \times \mathcal{G}$, where $f_\star^{(t)}$ are task specific linear functions and $g_\star$ is a shared non-linear representation. An approximate representation $\hat g$ is estimated using $N$ samples from each of $T$ source tasks, and a fine-tuning function $\hat f^{(0)}$ is fit using $N’$ samples from a target task passed through $\hat g$. Our results show that the excess risk of the estimate $\hat f^{(0)} \circ \hat g$ on the target task decays as $\tilde{\mathcal{O}}\Big(\frac{\mathrm{C}(\mathcal{G})}{N T} + \frac{\text{dim}(\mathcal{F})}{N’}\Big)$, where $\mathrm{C}(\mathcal{G})$ denotes the complexity of $\mathcal{G}$. Notably, our rates match that of the iid setting, while requiring fewer samples per task than prior analysis and admitting no dependence on the mixing time. We support our analysis with numerical experiments performing imitation learning over non-linear dynamical systems.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-zhang24ab, title = {Guarantees for Nonlinear Representation Learning: Non-identical Covariates, Dependent Data, Fewer Samples}, author = {Zhang, Thomas Tck and Lee, Bruce D and Ziemann, Ingvar and Pappas, George J. and Matni, Nikolai}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {59126--59147}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/zhang24ab/zhang24ab.pdf}, url = {https://proceedings.mlr.press/v235/zhang24ab.html}, abstract = {A driving force behind the diverse applicability of modern machine learning is the ability to extract meaningful features across many sources. However, many practical domains involve data that are non-identically distributed across sources, and possibly statistically dependent within its source, violating vital assumptions in existing theoretical studies of representation learning. Toward addressing these issues, we establish statistical guarantees for learning general nonlinear representations from multiple data sources that admit different input distributions and possibly dependent data. Specifically, we study the sample-complexity of learning $T+1$ functions $f_\star^{(t)} \circ g_\star$ from a function class $\mathcal{F} \times \mathcal{G}$, where $f_\star^{(t)}$ are task specific linear functions and $g_\star$ is a shared non-linear representation. An approximate representation $\hat g$ is estimated using $N$ samples from each of $T$ source tasks, and a fine-tuning function $\hat f^{(0)}$ is fit using $N’$ samples from a target task passed through $\hat g$. Our results show that the excess risk of the estimate $\hat f^{(0)} \circ \hat g$ on the target task decays as $\tilde{\mathcal{O}}\Big(\frac{\mathrm{C}(\mathcal{G})}{N T} + \frac{\text{dim}(\mathcal{F})}{N’}\Big)$, where $\mathrm{C}(\mathcal{G})$ denotes the complexity of $\mathcal{G}$. Notably, our rates match that of the iid setting, while requiring fewer samples per task than prior analysis and admitting no dependence on the mixing time. We support our analysis with numerical experiments performing imitation learning over non-linear dynamical systems.} }
Endnote
%0 Conference Paper %T Guarantees for Nonlinear Representation Learning: Non-identical Covariates, Dependent Data, Fewer Samples %A Thomas Tck Zhang %A Bruce D Lee %A Ingvar Ziemann %A George J. Pappas %A Nikolai Matni %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-zhang24ab %I PMLR %P 59126--59147 %U https://proceedings.mlr.press/v235/zhang24ab.html %V 235 %X A driving force behind the diverse applicability of modern machine learning is the ability to extract meaningful features across many sources. However, many practical domains involve data that are non-identically distributed across sources, and possibly statistically dependent within its source, violating vital assumptions in existing theoretical studies of representation learning. Toward addressing these issues, we establish statistical guarantees for learning general nonlinear representations from multiple data sources that admit different input distributions and possibly dependent data. Specifically, we study the sample-complexity of learning $T+1$ functions $f_\star^{(t)} \circ g_\star$ from a function class $\mathcal{F} \times \mathcal{G}$, where $f_\star^{(t)}$ are task specific linear functions and $g_\star$ is a shared non-linear representation. An approximate representation $\hat g$ is estimated using $N$ samples from each of $T$ source tasks, and a fine-tuning function $\hat f^{(0)}$ is fit using $N’$ samples from a target task passed through $\hat g$. Our results show that the excess risk of the estimate $\hat f^{(0)} \circ \hat g$ on the target task decays as $\tilde{\mathcal{O}}\Big(\frac{\mathrm{C}(\mathcal{G})}{N T} + \frac{\text{dim}(\mathcal{F})}{N’}\Big)$, where $\mathrm{C}(\mathcal{G})$ denotes the complexity of $\mathcal{G}$. Notably, our rates match that of the iid setting, while requiring fewer samples per task than prior analysis and admitting no dependence on the mixing time. We support our analysis with numerical experiments performing imitation learning over non-linear dynamical systems.
APA
Zhang, T.T., Lee, B.D., Ziemann, I., Pappas, G.J. & Matni, N.. (2024). Guarantees for Nonlinear Representation Learning: Non-identical Covariates, Dependent Data, Fewer Samples. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:59126-59147 Available from https://proceedings.mlr.press/v235/zhang24ab.html.

Related Material