Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies

Gati V Aher; Rosa I. Arriaga; Adam Tauman Kalai

Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies

Gati V Aher, Rosa I. Arriaga, Adam Tauman Kalai

Proceedings of the 40th International Conference on Machine Learning, PMLR 202:337-371, 2023.

Abstract

We introduce a new type of test, called a Turing Experiment (TE), for evaluating to what extent a given language model, such as GPT models, can simulate different aspects of human behavior. A TE can also reveal consistent distortions in a language model’s simulation of a specific human behavior. Unlike the Turing Test, which involves simulating a single arbitrary individual, a TE requires simulating a representative sample of participants in human subject research. We carry out TEs that attempt to replicate well-established findings from prior studies. We design a methodology for simulating TEs and illustrate its use to compare how well different language models are able to reproduce classic economic, psycholinguistic, and social psychology experiments: Ultimatum Game, Garden Path Sentences, Milgram Shock Experiment, and Wisdom of Crowds. In the first three TEs, the existing findings were replicated using recent models, while the last TE reveals a “hyper-accuracy distortion” present in some language models (including ChatGPT and GPT-4), which could affect downstream applications in education and the arts.

Cite this Paper

BibTeX

@InProceedings{pmlr-v202-aher23a,
  title = 	 {Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies},
  author =       {Aher, Gati V and Arriaga, Rosa I. and Kalai, Adam Tauman},
  booktitle = 	 {Proceedings of the 40th International Conference on Machine Learning},
  pages = 	 {337--371},
  year = 	 {2023},
  editor = 	 {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
  volume = 	 {202},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {23--29 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v202/aher23a/aher23a.pdf},
  url = 	 {https://proceedings.mlr.press/v202/aher23a.html},
  abstract = 	 {We introduce a new type of test, called a Turing Experiment (TE), for evaluating to what extent a given language model, such as GPT models, can simulate different aspects of human behavior. A TE can also reveal consistent distortions in a language model’s simulation of a specific human behavior. Unlike the Turing Test, which involves simulating a single arbitrary individual, a TE requires simulating a representative sample of participants in human subject research. We carry out TEs that attempt to replicate well-established findings from prior studies. We design a methodology for simulating TEs and illustrate its use to compare how well different language models are able to reproduce classic economic, psycholinguistic, and social psychology experiments: Ultimatum Game, Garden Path Sentences, Milgram Shock Experiment, and Wisdom of Crowds. In the first three TEs, the existing findings were replicated using recent models, while the last TE reveals a “hyper-accuracy distortion” present in some language models (including ChatGPT and GPT-4), which could affect downstream applications in education and the arts.}
}

Endnote

%0 Conference Paper
%T Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies
%A Gati V Aher
%A Rosa I. Arriaga
%A Adam Tauman Kalai
%B Proceedings of the 40th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Andreas Krause
%E Emma Brunskill
%E Kyunghyun Cho
%E Barbara Engelhardt
%E Sivan Sabato
%E Jonathan Scarlett	
%F pmlr-v202-aher23a
%I PMLR
%P 337--371
%U https://proceedings.mlr.press/v202/aher23a.html
%V 202
%X We introduce a new type of test, called a Turing Experiment (TE), for evaluating to what extent a given language model, such as GPT models, can simulate different aspects of human behavior. A TE can also reveal consistent distortions in a language model’s simulation of a specific human behavior. Unlike the Turing Test, which involves simulating a single arbitrary individual, a TE requires simulating a representative sample of participants in human subject research. We carry out TEs that attempt to replicate well-established findings from prior studies. We design a methodology for simulating TEs and illustrate its use to compare how well different language models are able to reproduce classic economic, psycholinguistic, and social psychology experiments: Ultimatum Game, Garden Path Sentences, Milgram Shock Experiment, and Wisdom of Crowds. In the first three TEs, the existing findings were replicated using recent models, while the last TE reveals a “hyper-accuracy distortion” present in some language models (including ChatGPT and GPT-4), which could affect downstream applications in education and the arts.

APA

Aher, G.V., Arriaga, R.I. & Kalai, A.T.. (2023). Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:337-371 Available from https://proceedings.mlr.press/v202/aher23a.html.

Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies

Abstract

Cite this Paper

Related Material