An Instrumental Value for Data Production and its Application to Data Pricing

Rui Ai, Boxiang Lyu, Zhaoran Wang, Zhuoran Yang, Haifeng Xu
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:822-856, 2025.

Abstract

We develop a framework for capturing the instrumental value of data production processes, which accounts for two key factors: (a) the context of the agent’s decision-making; (b) how much data or information the buyer already possesses. We "micro-found" our data valuation function by establishing its connection to classic notions of signals and information design in economics. When instantiated in Bayesian linear regression, our value naturally corresponds to information gain. Applying our proposed data value in Bayesian linear regression for monopoly pricing, we show that if the seller can fully customize data production, she can extract the first-best revenue (i.e., full surplus) from any population of buyers, i.e., achieving first-degree price discrimination. If data can only be constructed from an existing data pool, this limits the seller’s ability to customize, and achieving first-best revenue becomes generally impossible. However, we design a mechanism that achieves seller revenue at most $\log(\kappa)$ less than the first-best, where $\kappa$ is the condition number associated with the data matrix. As a corollary, the seller extracts the first-best revenue in the multi-armed bandits special case.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-ai25c, title = {An Instrumental Value for Data Production and its Application to Data Pricing}, author = {Ai, Rui and Lyu, Boxiang and Wang, Zhaoran and Yang, Zhuoran and Xu, Haifeng}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {822--856}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/ai25c/ai25c.pdf}, url = {https://proceedings.mlr.press/v267/ai25c.html}, abstract = {We develop a framework for capturing the instrumental value of data production processes, which accounts for two key factors: (a) the context of the agent’s decision-making; (b) how much data or information the buyer already possesses. We "micro-found" our data valuation function by establishing its connection to classic notions of signals and information design in economics. When instantiated in Bayesian linear regression, our value naturally corresponds to information gain. Applying our proposed data value in Bayesian linear regression for monopoly pricing, we show that if the seller can fully customize data production, she can extract the first-best revenue (i.e., full surplus) from any population of buyers, i.e., achieving first-degree price discrimination. If data can only be constructed from an existing data pool, this limits the seller’s ability to customize, and achieving first-best revenue becomes generally impossible. However, we design a mechanism that achieves seller revenue at most $\log(\kappa)$ less than the first-best, where $\kappa$ is the condition number associated with the data matrix. As a corollary, the seller extracts the first-best revenue in the multi-armed bandits special case.} }
Endnote
%0 Conference Paper %T An Instrumental Value for Data Production and its Application to Data Pricing %A Rui Ai %A Boxiang Lyu %A Zhaoran Wang %A Zhuoran Yang %A Haifeng Xu %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-ai25c %I PMLR %P 822--856 %U https://proceedings.mlr.press/v267/ai25c.html %V 267 %X We develop a framework for capturing the instrumental value of data production processes, which accounts for two key factors: (a) the context of the agent’s decision-making; (b) how much data or information the buyer already possesses. We "micro-found" our data valuation function by establishing its connection to classic notions of signals and information design in economics. When instantiated in Bayesian linear regression, our value naturally corresponds to information gain. Applying our proposed data value in Bayesian linear regression for monopoly pricing, we show that if the seller can fully customize data production, she can extract the first-best revenue (i.e., full surplus) from any population of buyers, i.e., achieving first-degree price discrimination. If data can only be constructed from an existing data pool, this limits the seller’s ability to customize, and achieving first-best revenue becomes generally impossible. However, we design a mechanism that achieves seller revenue at most $\log(\kappa)$ less than the first-best, where $\kappa$ is the condition number associated with the data matrix. As a corollary, the seller extracts the first-best revenue in the multi-armed bandits special case.
APA
Ai, R., Lyu, B., Wang, Z., Yang, Z. & Xu, H.. (2025). An Instrumental Value for Data Production and its Application to Data Pricing. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:822-856 Available from https://proceedings.mlr.press/v267/ai25c.html.

Related Material