Seminar Date: Tuesday, September 23, 2025
Time: 11:00 AM PT
Location: 67-3111 & Zoom
Talk Title: Unlocking unprecedented domains in computational chemistry with massive open datasets and AI models
Zoom link

Abstract:
Computational chemistry and materials science are being revolutionized by machine learned interatomic potentials (MLIPs), which have the capacity to deliver quantum chemical accuracy at 10,000-fold reduced computational cost. A sufficiently fast and accurate MLIP would enable predictive high-throughput molecular/materials screening campaigns to explore vast regions of chemical space and facilitate ab initio-level simulations at length scales and time scales that were previously inaccessible. However, a fundamental challenge to creating MLIPs that perform well across molecular chemistry is the lack of comprehensive data for training. To address this gap, we introduce Open Molecules 2025 (OMol25), a large-scale dataset composed of more than 100 million density functional theory calculations at the ωB97M-V/def2-TZVPD level of theory, representing billions of CPU core-hours of compute. OMol25 uniquely blends elemental, chemical, and structural diversity including: 83 elements, a wide-range of intra- and intermolecular interactions, explicit solvation, variable charge/spin, conformers, and reactive structures while covering small molecules, biomolecules, metal complexes, and electrolytes, with an imminent extension to polymers. We further develop a comprehensive set of model evaluations which populate a public leaderboard to guide MLIP model development and allow domain scientists to understand where current models are reliable. In this talk, I will discuss dataset construction and composition, currently available models trained on the data – including Meta’s Universal Model of Atoms (UMA) – and how well models perform on our tests and evaluations. I will further cover how the models are already being put to use by the community, the novel capabilities and opportunities enabled by OMol25/UMA, where next-generation MLIP architectures should seek to improve, and valuable future directions worth pursuing which build on the paradigm shift of accurate and general large-scale pre-trained MLIPs.
Bio:
Dr. Samuel M. Blau is a Research Scientist at Berkeley Lab working at the intersection of computational chemistry, materials science, high-performance computing, and machine learning. He received his B.S. in 2012 from Haverford College and his Ph.D. in Chemical Physics from Harvard University in 2017. Sam has pioneered the use of self-correcting molecular simulation workflows to enable the construction of chemical reaction networks describing complex reaction cascades, e.g. those responsible for battery interphase formation and photoresist patterning. Sam’s research group also develops novel datasets, representations, and models for machine learning of chemistry and materials as well as methods that leverage ML model speed and differentiability for accelerated scientific discovery.