Recovering Labels from Crowdsourced Data: an Optimal and Polynomial-Time Method

Emmanuel Pilliat
Proceedings of Thirty Eighth Conference on Learning Theory, PMLR 291:4565-4595, 2025.

Abstract

Crowdsourcing involves aggregating meaningful information from partial and noisy data provided by a pool of $n$ workers across $d$ tasks. Traditional models, such as the Dawid-Skene model, assume that workers’ abilities are independent of tasks, limiting their applicability in real-world scenarios where worker ability often varies significantly across tasks. Recent advances have proposed permutation-based models, which relax these assumptions by imposing only isotonicity constraints on worker abilities. In this work, we study a permutation-based model where each worker $i$ has an ability $M_{ik}$ to recover a binary label $x_k^*\in\{-1,1\}$ for task $k$. The ability matrix $M$ is assumed to be isotonic up to a permutation of its rows, and only a fraction $\lambda$ of the worker-task pairs is observed. We focus on three primary objectives: recovering the true labels, ranking the workers, and estimating the ability matrix $M$. We introduce a polynomial-time and minimax optimal procedure to recover the labels, contradicting a conjecture in the literature regarding the existence of a statistical-computational gap for this problem. Additionally, building on the literature on ranking, we further introduce a polynomial-time procedure to rank the workers and to estimate their abilities. Notably, we show that ranking the workers or estimating their abilities is no harder when the true labels are unknown than when they are known, within the main regimes of interest in the isotonic model.

Cite this Paper


BibTeX
@InProceedings{pmlr-v291-pilliat25a, title = {Recovering Labels from Crowdsourced Data: an Optimal and Polynomial-Time Method}, author = {Pilliat, Emmanuel}, booktitle = {Proceedings of Thirty Eighth Conference on Learning Theory}, pages = {4565--4595}, year = {2025}, editor = {Haghtalab, Nika and Moitra, Ankur}, volume = {291}, series = {Proceedings of Machine Learning Research}, month = {30 Jun--04 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v291/main/assets/pilliat25a/pilliat25a.pdf}, url = {https://proceedings.mlr.press/v291/pilliat25a.html}, abstract = {Crowdsourcing involves aggregating meaningful information from partial and noisy data provided by a pool of $n$ workers across $d$ tasks. Traditional models, such as the Dawid-Skene model, assume that workers’ abilities are independent of tasks, limiting their applicability in real-world scenarios where worker ability often varies significantly across tasks. Recent advances have proposed permutation-based models, which relax these assumptions by imposing only isotonicity constraints on worker abilities. In this work, we study a permutation-based model where each worker $i$ has an ability $M_{ik}$ to recover a binary label $x_k^*\in\{-1,1\}$ for task $k$. The ability matrix $M$ is assumed to be isotonic up to a permutation of its rows, and only a fraction $\lambda$ of the worker-task pairs is observed. We focus on three primary objectives: recovering the true labels, ranking the workers, and estimating the ability matrix $M$. We introduce a polynomial-time and minimax optimal procedure to recover the labels, contradicting a conjecture in the literature regarding the existence of a statistical-computational gap for this problem. Additionally, building on the literature on ranking, we further introduce a polynomial-time procedure to rank the workers and to estimate their abilities. Notably, we show that ranking the workers or estimating their abilities is no harder when the true labels are unknown than when they are known, within the main regimes of interest in the isotonic model.} }
Endnote
%0 Conference Paper %T Recovering Labels from Crowdsourced Data: an Optimal and Polynomial-Time Method %A Emmanuel Pilliat %B Proceedings of Thirty Eighth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2025 %E Nika Haghtalab %E Ankur Moitra %F pmlr-v291-pilliat25a %I PMLR %P 4565--4595 %U https://proceedings.mlr.press/v291/pilliat25a.html %V 291 %X Crowdsourcing involves aggregating meaningful information from partial and noisy data provided by a pool of $n$ workers across $d$ tasks. Traditional models, such as the Dawid-Skene model, assume that workers’ abilities are independent of tasks, limiting their applicability in real-world scenarios where worker ability often varies significantly across tasks. Recent advances have proposed permutation-based models, which relax these assumptions by imposing only isotonicity constraints on worker abilities. In this work, we study a permutation-based model where each worker $i$ has an ability $M_{ik}$ to recover a binary label $x_k^*\in\{-1,1\}$ for task $k$. The ability matrix $M$ is assumed to be isotonic up to a permutation of its rows, and only a fraction $\lambda$ of the worker-task pairs is observed. We focus on three primary objectives: recovering the true labels, ranking the workers, and estimating the ability matrix $M$. We introduce a polynomial-time and minimax optimal procedure to recover the labels, contradicting a conjecture in the literature regarding the existence of a statistical-computational gap for this problem. Additionally, building on the literature on ranking, we further introduce a polynomial-time procedure to rank the workers and to estimate their abilities. Notably, we show that ranking the workers or estimating their abilities is no harder when the true labels are unknown than when they are known, within the main regimes of interest in the isotonic model.
APA
Pilliat, E.. (2025). Recovering Labels from Crowdsourced Data: an Optimal and Polynomial-Time Method. Proceedings of Thirty Eighth Conference on Learning Theory, in Proceedings of Machine Learning Research 291:4565-4595 Available from https://proceedings.mlr.press/v291/pilliat25a.html.

Related Material