May 30, 2022 — Across Europe and the United States, HPC developers are supercharging supercomputers with the power of Arm cores and accelerators built into NVIDIA BlueField-2 DPUs.
At Los Alamos National Laboratory (LANL), this work is part of a large, multi-year collaboration with NVIDIA that aims for 30x speedups in multi-physics computing applications.
LANL researchers predict significant performance gains using data processing units (DPUs) running on NVIDIA Quantum InfiniBand networks. They will pioneer techniques for computer storage, pattern matching and more using BlueField and its NVIDIA DOCA software framework.
An open API for DPUs
The efforts will also help better define OpenSNAPI, an application interface anyone can use to leverage DPUs. It is a project of the Unified Communication Framework, a consortium enabling heterogeneous computing for HPC applications whose members include Arm, IBM, NVIDIA, US National Laboratories and US Universities.
LANL is already feeling the power of networked computing, thanks to a DPU-powered storage system it created.
The Accelerated Box of Flash (ABoF, shown below) combines solid-state storage with DPU and InfiniBand accelerators to accelerate the performance-critical parts of a Linux file system. It is up to 30 times faster than similar storage systems and is becoming a key part of LANL’s infrastructure.
ABoF places compute near storage to minimize data movement and improve the efficiency of simulation and data analysis pipelines, a researcher said in a recent LANL blog post.
Texas rides a super cloud native
The Texas Advanced Computing Center (TACC) is the latest to adopt BlueField-2 in Dell PowerEdge servers. It will use DPUs on an InfiniBand network to make its Lonestar6 system a development platform for cloud-native supercomputing.
TACC’s Lonestar6 serves a wide range of HPC developers at Texas A&M University, Texas Tech University, and the University of North Texas, as well as numerous research centers and faculties.
MPI is accelerating
Eight hundred miles to the northeast, researchers at Ohio State University have shown how DPUs can run one of the most popular HPC programming models up to 26% faster.
By offloading critical parts of the Message Passing Interface (MPI), they accelerated P3DFFT, a library used in many large-scale HPC simulations.
“DPUs are like assistants managing the work of busy executives, and they will become mainstream because they can speed up any workload,” said Dhabaleswar K. (DK) Panda, professor of computer science and engineering at the ‘Ohio State who led the DPU. working with his team’s MVAPICH open source software.
DPU in HPC centers, Clouds
The double-digit increases are huge for supercomputers running HPC simulations like drug discovery or aircraft design. And cloud services can use those gains to boost their customers’ productivity, said Panda, which has received requests from multiple HPC centers for its code.
Quantum InfiniBand networks with features like NVIDIA SHARP help make its work possible.
“Others talk about network computing, but InfiniBand supports it today,” he said.
Durham does load balancing
Several research teams in Europe are accelerating MPI and other HPC workloads with BlueField DPUs.
For example, Durham University in Northern England is developing software for load balancing MPI tasks using BlueField DPUs on a 16-node Dell PowerEdge cluster. His work will pave the way for more efficient processing of better algorithms for HPC installations around the world, said Tobias Weinzierl, principal investigator of the project.
DPU in Cambridge, Munich
Researchers in Cambridge, London and Munich are also using DPUs.
For its part, University College London is studying how to schedule tasks for a host system on BlueField-2 DPUs. It’s a capability that could be used, for example, to move data between host processors so it’s there when they need it.
BlueField DPUs inside Dell PowerEdge servers in the Cambridge Service for Data Driven Discovery offload security policies, storage frameworks and other tasks from host processors, optimizing system performance.
Meanwhile, researchers from the Computer Architecture and Parallel Systems group at the Technical University of Munich are investigating ways to offload both MPI and operating system tasks with DPUs as part of a EuroHPC project.
Back in the United States, Georgia Tech researchers are collaborating with Sandia National Laboratories to accelerate molecular dynamics work using BlueField-2 DPUs. A paper describing their work so far shows that the algorithms can be sped up by up to 20% without loss of simulation accuracy.
An expanding network
Earlier this month, researchers in Japan announced a system using the latest NVIDIA H100 Tensor Core GPUs on our fastest, smartest network ever, the NVIDIA Quantum-2 InfiniBand platform.
NEC will build about 6 PFLOPS, an H100-based supercomputer for the Center for Computational Sciences at the University of Tsukuba. Researchers will use it for climatology, astrophysics, big data, AI and more.
Meanwhile, researchers like Panda are already thinking about how they will use the cores of BlueField-3 DPUs.
“It will be like hiring executive assistants with college degrees instead of those with high school degrees, so hopefully more offloads will be done,” he joked.
Source: Gilad Shainer, Nvidia