Asymptotic Degradation of Linear Regression Estimates with Strategic Data Sources
Proceedings of The 33rd International Conference on Algorithmic Learning Theory, PMLR 167:931-967, 2022.
We consider the problem of linear regression from strategic data sources with a public good component, i.e., when data is provided by strategic agents who seek to minimize an individual provision cost for increasing their data’s precision while benefiting from the model’s overall precision. In contrast to previous works, our model tackles the case where there is uncertainty on the attributes characterizing the agents’ data—a critical aspect of the problem when the number of agents is large. We provide a characterization of the game’s equilibrium, which reveals an interesting connection with optimal design. Subsequently, we focus on the asymptotic behavior of the covariance of the linear regression parameters estimated via generalized least squares as the number of data sources becomes large. We provide upper and lower bounds for this covariance matrix and we show that, when the agents’ provision costs are superlinear, the model’s covariance converges to zero but at a slower rate relative to virtually all learning problems with exogenous data. On the other hand, if the agents’ provision costs are linear, this covariance fails to converge. This shows that even the basic property of consistency of generalized least squares estimators is compromised when the data sources are strategic.