Sampling Boundary for Causal Effect Estimation

Yue Yin, Jiaoyun Yang, Ning An, Lian Li
Proceedings of the 17th Asian Conference on Machine Learning, PMLR 304:527-542, 2025.

Abstract

In causal effect estimation, determining the appropriate sampling size is critical for ensuring reliability and validity in both experimental and observational studies, a challenge closely tied to robust model generalization under limited data conditions in machine learning. This paper tackles these challenges by leveraging the Probably Approximately Correct (PAC) theory to establish a theoretically grounded framework for determining sampling boundaries. We utilize Hoeffding’s inequality and Vapnik–Chervonenkis (VC) dimension to set upper boundaries for dataset adequacy in diverse scenarios: no confounders, confounders with a finite hypothesis space, and confounders with an infinite hypothesis space. Our work ensures that if the dataset size exceeds the upper boundary, the error probability for the estimated causal effect stays within a specified threshold at the given confidence level. Additionally, we demonstrate that when the dataset size is inadequate, the error of the estimated average treatment effects is bounded by the estimation of the outcome variable, which forms the theoretical basis for data augmentation strategies to improve the accuracy of causal effect estimation. Extensive experiments on synthetic and semi-synthetic datasets validate the correctness of our presented sampling upper limitations under different error and confidence level constraints. Our findings not only offer a systematic and reliable method for determining sample size in causal effect estimation but also provide actionable guidance for developing causal inference models in data-scarce environments, enhancing their applicability and robustness across fields such as healthcare, social sciences, and policy evaluation.

Cite this Paper


BibTeX
@InProceedings{pmlr-v304-yin25a, title = {Sampling Boundary for Causal Effect Estimation}, author = {Yin, Yue and Yang, Jiaoyun and An, Ning and Li, Lian}, booktitle = {Proceedings of the 17th Asian Conference on Machine Learning}, pages = {527--542}, year = {2025}, editor = {Lee, Hung-yi and Liu, Tongliang}, volume = {304}, series = {Proceedings of Machine Learning Research}, month = {09--12 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v304/main/assets/yin25a/yin25a.pdf}, url = {https://proceedings.mlr.press/v304/yin25a.html}, abstract = {In causal effect estimation, determining the appropriate sampling size is critical for ensuring reliability and validity in both experimental and observational studies, a challenge closely tied to robust model generalization under limited data conditions in machine learning. This paper tackles these challenges by leveraging the Probably Approximately Correct (PAC) theory to establish a theoretically grounded framework for determining sampling boundaries. We utilize Hoeffding’s inequality and Vapnik–Chervonenkis (VC) dimension to set upper boundaries for dataset adequacy in diverse scenarios: no confounders, confounders with a finite hypothesis space, and confounders with an infinite hypothesis space. Our work ensures that if the dataset size exceeds the upper boundary, the error probability for the estimated causal effect stays within a specified threshold at the given confidence level. Additionally, we demonstrate that when the dataset size is inadequate, the error of the estimated average treatment effects is bounded by the estimation of the outcome variable, which forms the theoretical basis for data augmentation strategies to improve the accuracy of causal effect estimation. Extensive experiments on synthetic and semi-synthetic datasets validate the correctness of our presented sampling upper limitations under different error and confidence level constraints. Our findings not only offer a systematic and reliable method for determining sample size in causal effect estimation but also provide actionable guidance for developing causal inference models in data-scarce environments, enhancing their applicability and robustness across fields such as healthcare, social sciences, and policy evaluation.} }
Endnote
%0 Conference Paper %T Sampling Boundary for Causal Effect Estimation %A Yue Yin %A Jiaoyun Yang %A Ning An %A Lian Li %B Proceedings of the 17th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Hung-yi Lee %E Tongliang Liu %F pmlr-v304-yin25a %I PMLR %P 527--542 %U https://proceedings.mlr.press/v304/yin25a.html %V 304 %X In causal effect estimation, determining the appropriate sampling size is critical for ensuring reliability and validity in both experimental and observational studies, a challenge closely tied to robust model generalization under limited data conditions in machine learning. This paper tackles these challenges by leveraging the Probably Approximately Correct (PAC) theory to establish a theoretically grounded framework for determining sampling boundaries. We utilize Hoeffding’s inequality and Vapnik–Chervonenkis (VC) dimension to set upper boundaries for dataset adequacy in diverse scenarios: no confounders, confounders with a finite hypothesis space, and confounders with an infinite hypothesis space. Our work ensures that if the dataset size exceeds the upper boundary, the error probability for the estimated causal effect stays within a specified threshold at the given confidence level. Additionally, we demonstrate that when the dataset size is inadequate, the error of the estimated average treatment effects is bounded by the estimation of the outcome variable, which forms the theoretical basis for data augmentation strategies to improve the accuracy of causal effect estimation. Extensive experiments on synthetic and semi-synthetic datasets validate the correctness of our presented sampling upper limitations under different error and confidence level constraints. Our findings not only offer a systematic and reliable method for determining sample size in causal effect estimation but also provide actionable guidance for developing causal inference models in data-scarce environments, enhancing their applicability and robustness across fields such as healthcare, social sciences, and policy evaluation.
APA
Yin, Y., Yang, J., An, N. & Li, L.. (2025). Sampling Boundary for Causal Effect Estimation. Proceedings of the 17th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 304:527-542 Available from https://proceedings.mlr.press/v304/yin25a.html.

Related Material