[edit]
CERA: Context-Engineered Reviews Architecture for Synthetic Dataset Generation
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:975-982, 2026.
Abstract
Aspect-Based Sentiment Analysis (ABSA) models require large-scale annotated datasets that are scarce, expensive to create, and suffer from class imbalance. While Large Language Models (LLMs) offer promising synthetic data generation, existing approaches lack factual grounding and provide limited aspect-level control. We present CERA (Context-Engineered Reviews Architecture), a training-free framework that generates realistic, controllable synthetic review text for ABSA through structured context engineering, i.e., carefully composing what an LLM receives as input rather than modifying the model itself. CERA’s three-phase pipeline integrates agentic web-search factual grounding with multi-agent verification, demographic-grounded persona diversity, and configurable polarity balance. Evaluated across three review domains and four architectures, CERA achieves Real-data-level corpus diversity (Distinct-2 of 0.736 vs. Real’s 0.776) while heuristic prompting collapses to 0.254, and scales to 8,000 reviews without quality degradation. Human evaluation confirms CERA reviews approach chance-level detection in a triplet Turing test (30% vs. 33% chance), nearly twice the rate of heuristic prompting (18%).