An LLM-based Data Augmentation Method for Different Personas to Enhance Alcohol User Prediction at the Population-Level

Doaa Ibrahim Swailum, Ruba Skaik, Diana Inkpen, Hussein AlOsman
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:588-599, 2026.

Abstract

Alcohol is one of the most widely consumed psychoactive substances globally and is associated with considerable health, social, and legal consequences. This study presents an automated framework for the early identification of alcohol users by classifying their social media posts, addressing the substantial class imbalance commonly observed in such data. To mitigate the underrepresentation of alcohol users, our framework employs a dual-phase augmentation strategy: we first utilize classical data augmentation techniques, and then significantly enhance this approach by integrating generative AI models to synthesize realistic user data and achieve near-balanced datasets. As the core methodological innovation, we introduce the Persona-driven Data Augmentation Method (P-DAM). This technique leverages well-established psychological theories to generate diverse personas that closely resemble real individuals, thereby substantially enhancing the quality of synthetic training data. Models trained using P-DAM demonstrate highly accurate prediction of alcohol users from unlabelled X posts representative of the Canadian population and yield population-level estimates that align with Health Canada statistics, with a minimal deviation of 1.72%. This work not only validates the effectiveness of psychologically based data augmentation but also demonstrates the potential of persona-driven, LLM-based predictive models as a robust and cost-effective alternative to traditional population surveys for estimating national alcohol use prevalence and, in the future, could be applied to other national health trends.

Cite this Paper


BibTeX
@InProceedings{pmlr-v318-swailum26a, title = {An LLM-based Data Augmentation Method for Different Personas to Enhance Alcohol User Prediction at the Population-Level}, author = {Swailum, Doaa Ibrahim and Skaik, Ruba and Inkpen, Diana and AlOsman, Hussein}, booktitle = {Proceedings of the The 39th Canadian Conference on Artificial Intelligence}, pages = {588--599}, year = {2026}, editor = {Bouzar-Benlabiod, Lydia and Leung, Carson}, volume = {318}, series = {Proceedings of Machine Learning Research}, month = {25--29 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v318/main/assets/swailum26a/swailum26a.pdf}, url = {https://proceedings.mlr.press/v318/swailum26a.html}, abstract = {Alcohol is one of the most widely consumed psychoactive substances globally and is associated with considerable health, social, and legal consequences. This study presents an automated framework for the early identification of alcohol users by classifying their social media posts, addressing the substantial class imbalance commonly observed in such data. To mitigate the underrepresentation of alcohol users, our framework employs a dual-phase augmentation strategy: we first utilize classical data augmentation techniques, and then significantly enhance this approach by integrating generative AI models to synthesize realistic user data and achieve near-balanced datasets. As the core methodological innovation, we introduce the Persona-driven Data Augmentation Method (P-DAM). This technique leverages well-established psychological theories to generate diverse personas that closely resemble real individuals, thereby substantially enhancing the quality of synthetic training data. Models trained using P-DAM demonstrate highly accurate prediction of alcohol users from unlabelled X posts representative of the Canadian population and yield population-level estimates that align with Health Canada statistics, with a minimal deviation of 1.72%. This work not only validates the effectiveness of psychologically based data augmentation but also demonstrates the potential of persona-driven, LLM-based predictive models as a robust and cost-effective alternative to traditional population surveys for estimating national alcohol use prevalence and, in the future, could be applied to other national health trends.} }
Endnote
%0 Conference Paper %T An LLM-based Data Augmentation Method for Different Personas to Enhance Alcohol User Prediction at the Population-Level %A Doaa Ibrahim Swailum %A Ruba Skaik %A Diana Inkpen %A Hussein AlOsman %B Proceedings of the The 39th Canadian Conference on Artificial Intelligence %C Proceedings of Machine Learning Research %D 2026 %E Lydia Bouzar-Benlabiod %E Carson Leung %F pmlr-v318-swailum26a %I PMLR %P 588--599 %U https://proceedings.mlr.press/v318/swailum26a.html %V 318 %X Alcohol is one of the most widely consumed psychoactive substances globally and is associated with considerable health, social, and legal consequences. This study presents an automated framework for the early identification of alcohol users by classifying their social media posts, addressing the substantial class imbalance commonly observed in such data. To mitigate the underrepresentation of alcohol users, our framework employs a dual-phase augmentation strategy: we first utilize classical data augmentation techniques, and then significantly enhance this approach by integrating generative AI models to synthesize realistic user data and achieve near-balanced datasets. As the core methodological innovation, we introduce the Persona-driven Data Augmentation Method (P-DAM). This technique leverages well-established psychological theories to generate diverse personas that closely resemble real individuals, thereby substantially enhancing the quality of synthetic training data. Models trained using P-DAM demonstrate highly accurate prediction of alcohol users from unlabelled X posts representative of the Canadian population and yield population-level estimates that align with Health Canada statistics, with a minimal deviation of 1.72%. This work not only validates the effectiveness of psychologically based data augmentation but also demonstrates the potential of persona-driven, LLM-based predictive models as a robust and cost-effective alternative to traditional population surveys for estimating national alcohol use prevalence and, in the future, could be applied to other national health trends.
APA
Swailum, D.I., Skaik, R., Inkpen, D. & AlOsman, H.. (2026). An LLM-based Data Augmentation Method for Different Personas to Enhance Alcohol User Prediction at the Population-Level. Proceedings of the The 39th Canadian Conference on Artificial Intelligence, in Proceedings of Machine Learning Research 318:588-599 Available from https://proceedings.mlr.press/v318/swailum26a.html.

Related Material