LLNL has begun integrating the new AI hardware, SambaNova Systems DataScale™, into the NNSA’s Corona supercomputing cluster, an 11-plus petaFLOP machine that Lab scientists are using to conduct fusion energy research for stockpile stewardship applications, find therapeutics for COVID-19 and perform other unclassified basic science work.
Lab researchers said the upgrade will allow them to run scientific simulations on the Corona system while offloading AI calculations from those simulations to the SambaNova DataScale system, improving overall speed, performance and productivity.
“This integration enables low-latency communication between the two devices allowing them to operate in tandem with greater overall efficiency,” said LLNL computer scientist Ian Karlin, who heads the SambaNova project. “In addition, scientific simulations running on Corona will feed data as they run into the SambaNova DataScale system to train new machine learning models based on their results.”
Once the integration is complete, LLNL researchers plan to use the platform to continue exploring the combination of high performance computing (HPC) and AI, an innovative effort LLNL calls “cognitive simulation” (CogSim). Researchers said the two systems working in tandem will enable more streamlined computation and allow them to move applications into this new paradigm of computing.
“AI accelerators provide the basis for a heterogeneous system architecture that will support efficient cognitive simulation,” said Bronis de Supinski, chief technology officer for Livermore Computing. “Livermore Computing is leading the integration of these subsystems into large-scale resources such as Corona. Our strategy is already demonstrating that this approach will provide more cost-efficient solutions for the workloads of the future.”
The AI system was funded by NNSA’s Advanced Simulation and Computing program and comes to LLNL as part of an agreement between the Department of Energy (DOE) and SambaNova Systems to accelerate AI within the DOE national laboratories. Another such system is being deployed at Los Alamos National Laboratory, where it has been integrated into a heterogeneous system called “Darwin,” and will be initially used to model quantum chemistry, according to NNSA.
“SambaNova Systems is enabling next-generation AI applications to be reimagined beyond today’s current infrastructure limitations,” said Marshall Choy, vice president of products, SambaNova Systems. “Working in close partnership with LLNL’s team of researchers to help accelerate world-changing discoveries and experiments is a game changer for science and computing.”
SambaNova DataScale is designed for efficient deep-learning inference and training calculations. It features the SambaFlow™ software stack, the world’s first Reconfigurable Dataflow Unit (RDU) chip and the SambaNova Systems Cardinal SN10 RDU™. The RDU is a next-generation computing processor designed from the ground up for efficiently running dataflow workloads such as AI. The SambaNova DataScale system contains eight RDUs — each one capable of supporting multiple simultaneous jobs or working seamlessly together to execute large-scale models, according to the company.
An early test for the system at LLNL is a CogSim approach to inertial confinement fusion (ICF) reactions for stockpile stewardship applications. Researchers said the SambaNova DataScale’s ability to run dozens of inference models at once while performing scientific calculations on the Corona system will aid in their quest of using machine learning to improve high energy output and create more robust fusion implosions.
“AI acceleration of even one of the many complicated physics packages in an ICF simulation can halve the time to solution,” said LLNL physicist Brian Spears. “This allows us great flexibility – either to explore a wider range of physics hypotheses or to increase the detail of our physics models without costing us more time.”
Initial performance results are promising, with early applications showing 5x or larger speedups when normalized to transistors used vs. GPUs, researchers said.
LLNL computer scientists added they will be making their applications asynchronous to take advantage of tandem computing and increased efficiency.
“We are redesigning our HPC codes to offload machine learning (ML) calculations,” said LLNL AI researcher Brian Van Essen. “While the ML work is done on accelerators the HPC calculation will continue on GPU machines.”
Researchers said the SambaNova DataScale system can also be applied to the small molecule drug design work being done on the Corona system to find therapeutic compounds capable of binding to SARS-CoV-2, the virus that causes COVID-19. This work uses machine learning models to generate new potential compounds that are evaluated for safety and efficacy using HPC simulations on the Corona system.