Position: When Incentives Backfire, Data Stops Being Human

Sebastin Santy, Prasanta Bhattacharya, Manoel Horta Ribeiro, Kelsey R Allen, Sewoong Oh
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:82151-82165, 2025.

Abstract

Progress in AI has relied on human-generated data, from annotator marketplaces to the wider Internet. However, the widespread use of large language models now threatens the quality and integrity of human-generated data on these very platforms. We argue that this issue goes beyond the immediate challenge of filtering AI-generated content – it reveals deeper flaws in how data collection systems are designed. Existing systems often prioritize speed, scale, and efficiency at the cost of intrinsic human motivation, leading to declining engagement and data quality. We propose that rethinking data collection systems to align with contributors’ intrinsic motivations – rather than relying solely on external incentives – can help sustain high-quality data sourcing at scale while maintaining contributor trust and long-term participation.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-santy25a, title = {Position: When Incentives Backfire, Data Stops Being Human}, author = {Santy, Sebastin and Bhattacharya, Prasanta and Ribeiro, Manoel Horta and Allen, Kelsey R and Oh, Sewoong}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {82151--82165}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/santy25a/santy25a.pdf}, url = {https://proceedings.mlr.press/v267/santy25a.html}, abstract = {Progress in AI has relied on human-generated data, from annotator marketplaces to the wider Internet. However, the widespread use of large language models now threatens the quality and integrity of human-generated data on these very platforms. We argue that this issue goes beyond the immediate challenge of filtering AI-generated content – it reveals deeper flaws in how data collection systems are designed. Existing systems often prioritize speed, scale, and efficiency at the cost of intrinsic human motivation, leading to declining engagement and data quality. We propose that rethinking data collection systems to align with contributors’ intrinsic motivations – rather than relying solely on external incentives – can help sustain high-quality data sourcing at scale while maintaining contributor trust and long-term participation.} }
Endnote
%0 Conference Paper %T Position: When Incentives Backfire, Data Stops Being Human %A Sebastin Santy %A Prasanta Bhattacharya %A Manoel Horta Ribeiro %A Kelsey R Allen %A Sewoong Oh %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-santy25a %I PMLR %P 82151--82165 %U https://proceedings.mlr.press/v267/santy25a.html %V 267 %X Progress in AI has relied on human-generated data, from annotator marketplaces to the wider Internet. However, the widespread use of large language models now threatens the quality and integrity of human-generated data on these very platforms. We argue that this issue goes beyond the immediate challenge of filtering AI-generated content – it reveals deeper flaws in how data collection systems are designed. Existing systems often prioritize speed, scale, and efficiency at the cost of intrinsic human motivation, leading to declining engagement and data quality. We propose that rethinking data collection systems to align with contributors’ intrinsic motivations – rather than relying solely on external incentives – can help sustain high-quality data sourcing at scale while maintaining contributor trust and long-term participation.
APA
Santy, S., Bhattacharya, P., Ribeiro, M.H., Allen, K.R. & Oh, S.. (2025). Position: When Incentives Backfire, Data Stops Being Human. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:82151-82165 Available from https://proceedings.mlr.press/v267/santy25a.html.

Related Material