Position: In-House Evaluation Is Not Enough. Towards Robust Third-Party Evaluation and Flaw Disclosure for General-Purpose AI

Shayne Longpre, Kevin Klyman, Ruth Elisabeth Appel, Sayash Kapoor, Rishi Bommasani, Michelle Sahar, Sean Mcgregor, Avijit Ghosh, Borhane Blili-Hamelin, Nathan Butters, Alondra Nelson, Dr. Amit Elazari, Andrew Sellars, Casey John Ellis, Dane Sherrets, Dawn Song, Harley Geiger, Ilona Cohen, Lauren Mcilvenny, Madhulika Srikumar, Mark M. Jaycox, Markus Anderljung, Nadine Farid Johnson, Nicholas Carlini, Nicolas Miailhe, Nik Marda, Peter Henderson, Rebecca S. Portnoff, Rebecca Weiss, Victoria Westerhoff, Yacine Jernite, Rumman Chowdhury, Percy Liang, Arvind Narayanan
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:81728-81758, 2025.

Abstract

The widespread deployment of general-purpose AI (GPAI) systems introduces significant new risks. Yet the infrastructure, practices, and norms for reporting flaws in GPAI systems remain seriously underdeveloped, lagging far behind more established fields like software security. Based on a collaboration between experts from the fields of software security, machine learning, law, social science, and policy, we identify key gaps in the evaluation and reporting of flaws in GPAI systems. We call for three interventions to advance system safety. First, we propose using standardized AI flaw reports and rules of engagement for researchers in order to ease the process of submitting, reproducing, and triaging flaws in GPAI systems. Second, we propose GPAI system providers adopt broadly-scoped flaw disclosure programs, borrowing from bug bounties, with legal safe harbors to protect researchers. Third, we advocate for the development of improved infrastructure to coordinate distribution of flaw reports across the many stakeholders who may be impacted. These interventions are increasingly urgent, as evidenced by the prevalence of jailbreaks and other flaws that can transfer across different providers’ GPAI systems. By promoting robust reporting and coordination in the AI ecosystem, these proposals could significantly improve the safety, security, and accountability of GPAI systems.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-longpre25a, title = {Position: In-House Evaluation Is Not Enough. Towards Robust Third-Party Evaluation and Flaw Disclosure for General-Purpose {AI}}, author = {Longpre, Shayne and Klyman, Kevin and Appel, Ruth Elisabeth and Kapoor, Sayash and Bommasani, Rishi and Sahar, Michelle and Mcgregor, Sean and Ghosh, Avijit and Blili-Hamelin, Borhane and Butters, Nathan and Nelson, Alondra and Elazari, Dr. Amit and Sellars, Andrew and Ellis, Casey John and Sherrets, Dane and Song, Dawn and Geiger, Harley and Cohen, Ilona and Mcilvenny, Lauren and Srikumar, Madhulika and Jaycox, Mark M. and Anderljung, Markus and Johnson, Nadine Farid and Carlini, Nicholas and Miailhe, Nicolas and Marda, Nik and Henderson, Peter and Portnoff, Rebecca S. and Weiss, Rebecca and Westerhoff, Victoria and Jernite, Yacine and Chowdhury, Rumman and Liang, Percy and Narayanan, Arvind}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {81728--81758}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/longpre25a/longpre25a.pdf}, url = {https://proceedings.mlr.press/v267/longpre25a.html}, abstract = {The widespread deployment of general-purpose AI (GPAI) systems introduces significant new risks. Yet the infrastructure, practices, and norms for reporting flaws in GPAI systems remain seriously underdeveloped, lagging far behind more established fields like software security. Based on a collaboration between experts from the fields of software security, machine learning, law, social science, and policy, we identify key gaps in the evaluation and reporting of flaws in GPAI systems. We call for three interventions to advance system safety. First, we propose using standardized AI flaw reports and rules of engagement for researchers in order to ease the process of submitting, reproducing, and triaging flaws in GPAI systems. Second, we propose GPAI system providers adopt broadly-scoped flaw disclosure programs, borrowing from bug bounties, with legal safe harbors to protect researchers. Third, we advocate for the development of improved infrastructure to coordinate distribution of flaw reports across the many stakeholders who may be impacted. These interventions are increasingly urgent, as evidenced by the prevalence of jailbreaks and other flaws that can transfer across different providers’ GPAI systems. By promoting robust reporting and coordination in the AI ecosystem, these proposals could significantly improve the safety, security, and accountability of GPAI systems.} }
Endnote
%0 Conference Paper %T Position: In-House Evaluation Is Not Enough. Towards Robust Third-Party Evaluation and Flaw Disclosure for General-Purpose AI %A Shayne Longpre %A Kevin Klyman %A Ruth Elisabeth Appel %A Sayash Kapoor %A Rishi Bommasani %A Michelle Sahar %A Sean Mcgregor %A Avijit Ghosh %A Borhane Blili-Hamelin %A Nathan Butters %A Alondra Nelson %A Dr. Amit Elazari %A Andrew Sellars %A Casey John Ellis %A Dane Sherrets %A Dawn Song %A Harley Geiger %A Ilona Cohen %A Lauren Mcilvenny %A Madhulika Srikumar %A Mark M. Jaycox %A Markus Anderljung %A Nadine Farid Johnson %A Nicholas Carlini %A Nicolas Miailhe %A Nik Marda %A Peter Henderson %A Rebecca S. Portnoff %A Rebecca Weiss %A Victoria Westerhoff %A Yacine Jernite %A Rumman Chowdhury %A Percy Liang %A Arvind Narayanan %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-longpre25a %I PMLR %P 81728--81758 %U https://proceedings.mlr.press/v267/longpre25a.html %V 267 %X The widespread deployment of general-purpose AI (GPAI) systems introduces significant new risks. Yet the infrastructure, practices, and norms for reporting flaws in GPAI systems remain seriously underdeveloped, lagging far behind more established fields like software security. Based on a collaboration between experts from the fields of software security, machine learning, law, social science, and policy, we identify key gaps in the evaluation and reporting of flaws in GPAI systems. We call for three interventions to advance system safety. First, we propose using standardized AI flaw reports and rules of engagement for researchers in order to ease the process of submitting, reproducing, and triaging flaws in GPAI systems. Second, we propose GPAI system providers adopt broadly-scoped flaw disclosure programs, borrowing from bug bounties, with legal safe harbors to protect researchers. Third, we advocate for the development of improved infrastructure to coordinate distribution of flaw reports across the many stakeholders who may be impacted. These interventions are increasingly urgent, as evidenced by the prevalence of jailbreaks and other flaws that can transfer across different providers’ GPAI systems. By promoting robust reporting and coordination in the AI ecosystem, these proposals could significantly improve the safety, security, and accountability of GPAI systems.
APA
Longpre, S., Klyman, K., Appel, R.E., Kapoor, S., Bommasani, R., Sahar, M., Mcgregor, S., Ghosh, A., Blili-Hamelin, B., Butters, N., Nelson, A., Elazari, D.A., Sellars, A., Ellis, C.J., Sherrets, D., Song, D., Geiger, H., Cohen, I., Mcilvenny, L., Srikumar, M., Jaycox, M.M., Anderljung, M., Johnson, N.F., Carlini, N., Miailhe, N., Marda, N., Henderson, P., Portnoff, R.S., Weiss, R., Westerhoff, V., Jernite, Y., Chowdhury, R., Liang, P. & Narayanan, A.. (2025). Position: In-House Evaluation Is Not Enough. Towards Robust Third-Party Evaluation and Flaw Disclosure for General-Purpose AI. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:81728-81758 Available from https://proceedings.mlr.press/v267/longpre25a.html.

Related Material