CERA: Context-Engineered Reviews Architecture for Synthetic Dataset Generation

Kap Thang, Daniel Ebrat, Luis Rueda
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:975-982, 2026.

Abstract

Aspect-Based Sentiment Analysis (ABSA) models require large-scale annotated datasets that are scarce, expensive to create, and suffer from class imbalance. While Large Language Models (LLMs) offer promising synthetic data generation, existing approaches lack factual grounding and provide limited aspect-level control. We present CERA (Context-Engineered Reviews Architecture), a training-free framework that generates realistic, controllable synthetic review text for ABSA through structured context engineering, i.e., carefully composing what an LLM receives as input rather than modifying the model itself. CERA’s three-phase pipeline integrates agentic web-search factual grounding with multi-agent verification, demographic-grounded persona diversity, and configurable polarity balance. Evaluated across three review domains and four architectures, CERA achieves Real-data-level corpus diversity (Distinct-2 of 0.736 vs. Real’s 0.776) while heuristic prompting collapses to 0.254, and scales to 8,000 reviews without quality degradation. Human evaluation confirms CERA reviews approach chance-level detection in a triplet Turing test (30% vs. 33% chance), nearly twice the rate of heuristic prompting (18%).

Cite this Paper


BibTeX
@InProceedings{pmlr-v318-thang26a, title = {CERA: Context-Engineered Reviews Architecture for Synthetic Dataset Generation}, author = {Thang, Kap and Ebrat, Daniel and Rueda, Luis}, booktitle = {Proceedings of the The 39th Canadian Conference on Artificial Intelligence}, pages = {975--982}, year = {2026}, editor = {Bouzar-Benlabiod, Lydia and Leung, Carson}, volume = {318}, series = {Proceedings of Machine Learning Research}, month = {25--29 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v318/main/assets/thang26a/thang26a.pdf}, url = {https://proceedings.mlr.press/v318/thang26a.html}, abstract = {Aspect-Based Sentiment Analysis (ABSA) models require large-scale annotated datasets that are scarce, expensive to create, and suffer from class imbalance. While Large Language Models (LLMs) offer promising synthetic data generation, existing approaches lack factual grounding and provide limited aspect-level control. We present CERA (Context-Engineered Reviews Architecture), a training-free framework that generates realistic, controllable synthetic review text for ABSA through structured context engineering, i.e., carefully composing what an LLM receives as input rather than modifying the model itself. CERA’s three-phase pipeline integrates agentic web-search factual grounding with multi-agent verification, demographic-grounded persona diversity, and configurable polarity balance. Evaluated across three review domains and four architectures, CERA achieves Real-data-level corpus diversity (Distinct-2 of 0.736 vs. Real’s 0.776) while heuristic prompting collapses to 0.254, and scales to 8,000 reviews without quality degradation. Human evaluation confirms CERA reviews approach chance-level detection in a triplet Turing test (30% vs. 33% chance), nearly twice the rate of heuristic prompting (18%).} }
Endnote
%0 Conference Paper %T CERA: Context-Engineered Reviews Architecture for Synthetic Dataset Generation %A Kap Thang %A Daniel Ebrat %A Luis Rueda %B Proceedings of the The 39th Canadian Conference on Artificial Intelligence %C Proceedings of Machine Learning Research %D 2026 %E Lydia Bouzar-Benlabiod %E Carson Leung %F pmlr-v318-thang26a %I PMLR %P 975--982 %U https://proceedings.mlr.press/v318/thang26a.html %V 318 %X Aspect-Based Sentiment Analysis (ABSA) models require large-scale annotated datasets that are scarce, expensive to create, and suffer from class imbalance. While Large Language Models (LLMs) offer promising synthetic data generation, existing approaches lack factual grounding and provide limited aspect-level control. We present CERA (Context-Engineered Reviews Architecture), a training-free framework that generates realistic, controllable synthetic review text for ABSA through structured context engineering, i.e., carefully composing what an LLM receives as input rather than modifying the model itself. CERA’s three-phase pipeline integrates agentic web-search factual grounding with multi-agent verification, demographic-grounded persona diversity, and configurable polarity balance. Evaluated across three review domains and four architectures, CERA achieves Real-data-level corpus diversity (Distinct-2 of 0.736 vs. Real’s 0.776) while heuristic prompting collapses to 0.254, and scales to 8,000 reviews without quality degradation. Human evaluation confirms CERA reviews approach chance-level detection in a triplet Turing test (30% vs. 33% chance), nearly twice the rate of heuristic prompting (18%).
APA
Thang, K., Ebrat, D. & Rueda, L.. (2026). CERA: Context-Engineered Reviews Architecture for Synthetic Dataset Generation. Proceedings of the The 39th Canadian Conference on Artificial Intelligence, in Proceedings of Machine Learning Research 318:975-982 Available from https://proceedings.mlr.press/v318/thang26a.html.

Related Material