Toward Believable Health & Wellness Conversational Agents: A Post-LLM Turing-like Evaluation Framework (Position Paper)

Bruce Matichuk, Randy Goebel, Osmar Zaiane
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:812-816, 2026.

Abstract

Large language model (LLM) conversational agents can be remarkably fluent yet still fail to feel fully “real” to users, especially in multi-session and higher-stakes interactions. This paper argues that the limiting problem is no longer surface language quality but \emph{believability}: the conditions under which an artificial conversational partner is experienced as a coherent social mind rather than a fluent text generator. We frame believability as an empirical limit case and propose an operational criterion of \emph{bounded practical indistinguishability} relative to an interaction envelope defined by a judge population, interaction contexts, and a time horizon. We then outline a “post-LLM Turing-like” evaluation approach that stress-tests modern detection cues using contextual scenario families, longitudinal re-contact, and multi-signal measurement combining human judgments with behavioral metrics. Finally, we instantiate the framework for a health and wellness agent being developed with an \emph{industry partner} (details anonymized), arguing that wellness settings sharply amplify the importance of epistemic calibration, continuity, and boundary management. The goal is not to advocate deceptive deployment, but to make believability mechanistic and measurable so that both capabilities and risks can be assessed with clarity.

Cite this Paper


BibTeX
@InProceedings{pmlr-v318-matichuk26a, title = {Toward Believable Health & Wellness Conversational Agents: A Post-LLM Turing-like Evaluation Framework (Position Paper)}, author = {Matichuk, Bruce and Goebel, Randy and Zaiane, Osmar}, booktitle = {Proceedings of the The 39th Canadian Conference on Artificial Intelligence}, pages = {812--816}, year = {2026}, editor = {Bouzar-Benlabiod, Lydia and Leung, Carson}, volume = {318}, series = {Proceedings of Machine Learning Research}, month = {25--29 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v318/main/assets/matichuk26a/matichuk26a.pdf}, url = {https://proceedings.mlr.press/v318/matichuk26a.html}, abstract = {Large language model (LLM) conversational agents can be remarkably fluent yet still fail to feel fully “real” to users, especially in multi-session and higher-stakes interactions. This paper argues that the limiting problem is no longer surface language quality but \emph{believability}: the conditions under which an artificial conversational partner is experienced as a coherent social mind rather than a fluent text generator. We frame believability as an empirical limit case and propose an operational criterion of \emph{bounded practical indistinguishability} relative to an interaction envelope defined by a judge population, interaction contexts, and a time horizon. We then outline a “post-LLM Turing-like” evaluation approach that stress-tests modern detection cues using contextual scenario families, longitudinal re-contact, and multi-signal measurement combining human judgments with behavioral metrics. Finally, we instantiate the framework for a health and wellness agent being developed with an \emph{industry partner} (details anonymized), arguing that wellness settings sharply amplify the importance of epistemic calibration, continuity, and boundary management. The goal is not to advocate deceptive deployment, but to make believability mechanistic and measurable so that both capabilities and risks can be assessed with clarity.} }
Endnote
%0 Conference Paper %T Toward Believable Health & Wellness Conversational Agents: A Post-LLM Turing-like Evaluation Framework (Position Paper) %A Bruce Matichuk %A Randy Goebel %A Osmar Zaiane %B Proceedings of the The 39th Canadian Conference on Artificial Intelligence %C Proceedings of Machine Learning Research %D 2026 %E Lydia Bouzar-Benlabiod %E Carson Leung %F pmlr-v318-matichuk26a %I PMLR %P 812--816 %U https://proceedings.mlr.press/v318/matichuk26a.html %V 318 %X Large language model (LLM) conversational agents can be remarkably fluent yet still fail to feel fully “real” to users, especially in multi-session and higher-stakes interactions. This paper argues that the limiting problem is no longer surface language quality but \emph{believability}: the conditions under which an artificial conversational partner is experienced as a coherent social mind rather than a fluent text generator. We frame believability as an empirical limit case and propose an operational criterion of \emph{bounded practical indistinguishability} relative to an interaction envelope defined by a judge population, interaction contexts, and a time horizon. We then outline a “post-LLM Turing-like” evaluation approach that stress-tests modern detection cues using contextual scenario families, longitudinal re-contact, and multi-signal measurement combining human judgments with behavioral metrics. Finally, we instantiate the framework for a health and wellness agent being developed with an \emph{industry partner} (details anonymized), arguing that wellness settings sharply amplify the importance of epistemic calibration, continuity, and boundary management. The goal is not to advocate deceptive deployment, but to make believability mechanistic and measurable so that both capabilities and risks can be assessed with clarity.
APA
Matichuk, B., Goebel, R. & Zaiane, O.. (2026). Toward Believable Health & Wellness Conversational Agents: A Post-LLM Turing-like Evaluation Framework (Position Paper). Proceedings of the The 39th Canadian Conference on Artificial Intelligence, in Proceedings of Machine Learning Research 318:812-816 Available from https://proceedings.mlr.press/v318/matichuk26a.html.

Related Material