[edit]
An LLM-based Data Augmentation Method for Different Personas to Enhance Alcohol User Prediction at the Population-Level
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:588-599, 2026.
Abstract
Alcohol is one of the most widely consumed psychoactive substances globally and is associated with considerable health, social, and legal consequences. This study presents an automated framework for the early identification of alcohol users by classifying their social media posts, addressing the substantial class imbalance commonly observed in such data. To mitigate the underrepresentation of alcohol users, our framework employs a dual-phase augmentation strategy: we first utilize classical data augmentation techniques, and then significantly enhance this approach by integrating generative AI models to synthesize realistic user data and achieve near-balanced datasets. As the core methodological innovation, we introduce the Persona-driven Data Augmentation Method (P-DAM). This technique leverages well-established psychological theories to generate diverse personas that closely resemble real individuals, thereby substantially enhancing the quality of synthetic training data. Models trained using P-DAM demonstrate highly accurate prediction of alcohol users from unlabelled X posts representative of the Canadian population and yield population-level estimates that align with Health Canada statistics, with a minimal deviation of 1.72%. This work not only validates the effectiveness of psychologically based data augmentation but also demonstrates the potential of persona-driven, LLM-based predictive models as a robust and cost-effective alternative to traditional population surveys for estimating national alcohol use prevalence and, in the future, could be applied to other national health trends.