Position: Data-driven Discovery with Large Generative Models

Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Sanchaita Hazra, Ashish Sabharwal, Peter Clark
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:34350-34382, 2024.

Abstract

With the accumulation of data at an unprecedented rate, its potential to fuel scientific discovery is growing exponentially. This position paper urges the Machine Learning (ML) community to exploit the capabilities of large generative models (LGMs) to develop automated systems for end-to-end data-driven discovery—a paradigm encompassing the search and verification of hypotheses purely from a set of provided datasets, without the need for additional data collection or physical experiments. We first outline several desiderata for an ideal data-driven discovery system. Then, through DataVoyager, a proof-of-concept utilizing GPT-4, we demonstrate how LGMs fulfill several of these desiderata—a feat previously unattainable—while also highlighting important limitations in the current system that open up opportunities for novel ML research. We contend that achieving accurate, reliable, and robust end-to-end discovery systems solely through the current capabilities of LGMs is challenging. We instead advocate for fail-proof tool integration, along with active user moderation through feedback mechanisms, to foster data-driven scientific discoveries with efficiency and reproducibility.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-majumder24a, title = {Position: Data-driven Discovery with Large Generative Models}, author = {Majumder, Bodhisattwa Prasad and Surana, Harshit and Agarwal, Dhruv and Hazra, Sanchaita and Sabharwal, Ashish and Clark, Peter}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {34350--34382}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/majumder24a/majumder24a.pdf}, url = {https://proceedings.mlr.press/v235/majumder24a.html}, abstract = {With the accumulation of data at an unprecedented rate, its potential to fuel scientific discovery is growing exponentially. This position paper urges the Machine Learning (ML) community to exploit the capabilities of large generative models (LGMs) to develop automated systems for end-to-end data-driven discovery—a paradigm encompassing the search and verification of hypotheses purely from a set of provided datasets, without the need for additional data collection or physical experiments. We first outline several desiderata for an ideal data-driven discovery system. Then, through DataVoyager, a proof-of-concept utilizing GPT-4, we demonstrate how LGMs fulfill several of these desiderata—a feat previously unattainable—while also highlighting important limitations in the current system that open up opportunities for novel ML research. We contend that achieving accurate, reliable, and robust end-to-end discovery systems solely through the current capabilities of LGMs is challenging. We instead advocate for fail-proof tool integration, along with active user moderation through feedback mechanisms, to foster data-driven scientific discoveries with efficiency and reproducibility.} }
Endnote
%0 Conference Paper %T Position: Data-driven Discovery with Large Generative Models %A Bodhisattwa Prasad Majumder %A Harshit Surana %A Dhruv Agarwal %A Sanchaita Hazra %A Ashish Sabharwal %A Peter Clark %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-majumder24a %I PMLR %P 34350--34382 %U https://proceedings.mlr.press/v235/majumder24a.html %V 235 %X With the accumulation of data at an unprecedented rate, its potential to fuel scientific discovery is growing exponentially. This position paper urges the Machine Learning (ML) community to exploit the capabilities of large generative models (LGMs) to develop automated systems for end-to-end data-driven discovery—a paradigm encompassing the search and verification of hypotheses purely from a set of provided datasets, without the need for additional data collection or physical experiments. We first outline several desiderata for an ideal data-driven discovery system. Then, through DataVoyager, a proof-of-concept utilizing GPT-4, we demonstrate how LGMs fulfill several of these desiderata—a feat previously unattainable—while also highlighting important limitations in the current system that open up opportunities for novel ML research. We contend that achieving accurate, reliable, and robust end-to-end discovery systems solely through the current capabilities of LGMs is challenging. We instead advocate for fail-proof tool integration, along with active user moderation through feedback mechanisms, to foster data-driven scientific discoveries with efficiency and reproducibility.
APA
Majumder, B.P., Surana, H., Agarwal, D., Hazra, S., Sabharwal, A. & Clark, P.. (2024). Position: Data-driven Discovery with Large Generative Models. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:34350-34382 Available from https://proceedings.mlr.press/v235/majumder24a.html.

Related Material