### Tensor Network Algorithms on Al accelerators

### Andor Menczer Supervisors: Örs Legeza, Tamás Kozsik

AIME 2024

Budapest, 2024.11.22.

### Strong correlations between electrons $\rightarrow$ exotic materials



High  $T_{\rm c}$  superconductors



Nitrogenase Cofactor, FeMoco



Lee, Small & Head-Gordon, JCP, 2018, 149, 244121

### Single Molecular Magnets



Battery Technology

Andor Menczer (ELTE)

# Experimental realizations: optical lattices Numerical simulations: model systems



- Atoms (represented as blue spheres) pictured in a 2D-optical lattice potential
- Potential depth of the optical lattice can be tuned.
- Periodicity of the optical lattice can be tuned.

Hubbard model: lattice model of interacting electron system

$$\mathtt{H} = \mathtt{t} \sum_{\langle \mathtt{i}, \mathtt{j} \rangle, \sigma} \mathtt{c}^{\dagger}_{\mathtt{i}, \sigma} \mathtt{c}_{\mathtt{j}, \sigma} + \frac{\mathtt{U}}{2} \sum_{\sigma \neq \sigma'} \sum_{\mathtt{i}} \mathtt{n}_{\mathtt{i}, \sigma} \mathtt{n}_{\mathtt{i}, \sigma'}$$

t hopping amplitude U on-site Coulomb interaction  $\sigma \in \uparrow, \downarrow$  spin index



## Properties of the TNS/DMRG algorithms

- Key aspect of TNS/DMRG: exponential scaling can be reduced to a polynomial form.
- Underlying tensor and matrix algebra can be organized into several million of independent operations (tasks).
- Dense matrix operations are performed in parallel according to the so-called quantum number decomposed representations (sectors).
- Full matrices, denoted as DMRG bond dimension, D, determines the accuracy of the calculations.
- $\bullet\,$  The overall scaling of the DMRG is  $D^3 \mathbb{N}^4$  where  $\mathbb{N}$  stands for the system size.
- The memory requirement is proportional to  $D^2 N^2$ .
- The iterative diagonalization of the effective Hamiltonian usually accounting for 85% of the total execution time.
- The renormalization step is responsible for 10% of the total execution time.

Andor Menczer (ELTE)

## TNS/DMRG provide state-of-the-art results in many fields

$$\mathcal{H} = \sum_{\mathbf{ij}\alpha\beta} \mathsf{T}^{\alpha\beta}_{\mathbf{ij}} \mathsf{c}^{\dagger}_{\mathbf{i}\alpha} \mathsf{c}_{\mathbf{j}\beta} + \frac{1}{2} \sum_{\mathbf{ijkl}\alpha\beta\gamma\delta} \mathsf{V}^{\alpha\beta\gamma\delta}_{\mathbf{ijkl}} \mathsf{c}^{\dagger}_{\mathbf{i}\alpha} \mathsf{c}^{\dagger}_{\mathbf{j}\beta} \mathsf{c}_{\mathbf{k}\gamma} \mathsf{c}_{\mathbf{l}\delta} + \dots \,,$$

- $\bullet~T_{\texttt{ij}}$  kinetic and on-mode terms,  $\mathtt{V}_{\texttt{ijkl}}$  two-particle scatterings
- We consider usually lattice models in real space (DMRG)
- In quantum chemistry modes are electron orbitals (QC-DMRG)
- In UHF QC spin-dependent inetractions (UHF-QCDMRG)
- In relativistic quantum chemistry modes are spinors (4c-DMRG)
- In nuclear problems modes are proton/neutron orbitals (JDMRG)
- In k-space modes are momentum eigenstates (k-DMRG)
- $\bullet\,$  For particles in confined potential modes  $\,\to\,$  Hermite polynoms
- Major aim: to obtain the desired eigenstates of  $\mathcal{H}$ .

### Quarter petaflops on a single node $\,\sim\,$ 10000x speedup



NVIDIA DGX H100 and Grace Hopper GH200: Testing performance up to  $\sim 250$  TFLOPS in collab with NVIDIA and SandboxAQ, M. van Damme, A. Menczer, M. Ganahl, J. Hammond, S. Xantheas, ÖL

Combination of our MPI and GPU kernels: full replacement of boost library, asynchronous IO, multiNode-multiGPU.

## Wigner, PNNL, NVIDIA, SandboxAQ joint press release

 $|\Psi_{MPS}\rangle = \sum \sum [A_1]_{1\alpha_1}^{i_1} [A_2]_{\alpha_1\alpha_2}^{i_2} ... [A_N]_{\alpha_{N-1}}^{i_N} |i_1...i_k\rangle$  $\{i_k\}\{\alpha_p\}$  $E_{opt} = \min_{|\Psi_{MPS}\rangle} \frac{\langle \Psi_{MPS} | H | \Psi_{MPS} \rangle}{\langle \Psi_{MPS} | \Psi_{MPS} \rangle}$ 

www.pnnl.gov/news-media/collaboration-speeds-complex-chemical-modeling

Andor Menczer (ELTE)

TNS Algorithms on AI accelerators

2024.11.22

### Our TNS/DMRG code will be used as benchmark

### 📀 NVIDIA

### NVIDIA GH200 Grace Hopper Superchip

The breakthrough accelerated CPU for largescale AI and high-performance computing (HPC) applications.

#### The World's Most Versatile Computing Platform

The NVIDIA Grace Hopper<sup>®</sup> architecture brings together the groundbreaking performance of the NVIDIA Hopper<sup>®</sup> GPU with the versatility of the NVIDIA Grace<sup>®</sup> CPU in a single superchip, connected with the high-bandwidth, memory-coherent NVIDIA<sup>®</sup> NVIDIA<sup>®</sup> (NLink<sup>®</sup> Chip-2-Chip (C2C) interconnect.

NVIDIA NVLIM-C2C is a memory-coherent, high-bandwidth, and low-latency interconnect for sequencing. The hard for the GH200 Gase Neperchip, the delivers up to 900 gigabytes per second (GB)(a) of trad bandwidth, which is 7X higher than PCIe Geo Binas commonly used in accelerated systems. NVLin-C2C enables applications to oversubacrite the GPU's memory and directly utilized NVDIA. Gines CPU's memory at high bandwidth, while up to 800G GB to DODGK CPU memory per GIC00 Gines hopper Superchip, the GPU has direct access to 7X more fast memory than hill So almost 10 memory best to memory and activity. Bind and per GIC00 Gines hopper Superchip, the GPU has direct access to 7X more fast memory than hill So almost 30 more fast memory with hall So GB clobal and other compate and memory-intensive werklash. GH200 can alice be combined with the NVDIA MUK soluti, Switch Systems with all GPU thads and GH20 NVLinic-connected GPUs and able to access up to 144 terabytes (TB) of memory at high bandwidth.

#### NVIDIA GH200 Grace Hopper Superchip





#### Key Features

- > 72-core NVIDIA Grace CPU
- NVIDIA H100 Tensor Core GPU
- Up to 480GB of LPDDR5X memory with error-correction code (ECC)
- Supports 96GB of HBM3 or 144GB of HBM3e
- > Up to 624GB of fast-access memory
- NVLink-C2C: 900GB/s of coherent memory

Andor Menczer (ELTE)

TNS Algorithms on AI accelerators

2024.11.22.

### SC24: NVIDIA TOP500 BoF Accuracy of Emulated FP64



Andor Menczer (ELTE)

TNS Algorithms on AI accelerators

2024.11.22

9/19

### Collaborations

- More than 30 research groups worldwide from condensed matter physics, quantum chemistry, nuclear physics, quantum information theory, applied mathematics and computer science
- High-Performance Computing Center Stuttgart, Germany
- Pacific Northwest National Laboratory (PNNL), USA
- National Energy Research Scientific Computing Center (NERSC), USA
- Recently there is also an interest by industrial partners.
  - NVIDIA, USA
  - AMD, USA
  - SandboxAQ, USA (Google startup)
  - Furukawa Electric Institute of Technology, Japan
  - Riverlane LTD, UK
  - Dynaflex LTD, Hungary

### Memory management: Data Dependency Trees

- TTCache is a model for virtual memory addressing designed to vastly reduce redundant IO operations and eliminate memory fragmentation as well as allocation overhead.
- **TTCache** works by factorizing data into attributes, then hierarchically mapping such attributes to execution blocks. Execution is done by traversing a tree-like structure, in which nodes close to each other depend on largely the same set of attributes



- Fragmentation-free, sequential write and read operations.
- Allocations and deallocations are purely virtual.

Andor Menczer (ELTE)

TNS Algorithms on AI accelerators

2024.11.22.

11/19

### Strided Batched Matrix Multiplication for Summation

• **SBMM4S** is a batched type matrix multiplication with inherent zero cost sum reduction. Produces a single result by multiplying an entire batch of matrices with concatenated vector arrays of interleaved matrices. Intermediate results of chained matrix multiplications are reached using strided batched type matrix multiplications with specific offset values to enable interleaving.



• A summation such as  $B := B + \alpha(\sum_{i=1}^{p} L_i * A * R_i^{\top})$  can be executed in parallel by first independently calculating the interleaved vectors of each  $A * R_i^{\top}$ , then multiplying concat $(L_1, ..., L_p)$  with the matrix holding all previously calculated vectors.

### Improved partial execution of SBMM4S

- The multiplication can to be broken up into multiple SBGEMM operations. The leading dimensions and stride values are set in a way that the vectors of the result matrices became interleaved.
- The sequence of such vectors can be viewed as a singular horizontally or vertically very long stripe-like matrix, just as before.



Example for partial SBMM4S

**Partial SBMM4S** works as a zero overhead drop-in replacement for both GEMM and regular SBMM4S:

- No auxiliary data
- No extra calculations
- Results remain monolithic

### Low-latency Self-scheduled threading

- Parallel models relying on inter-thread communication might be ineffective when bombarded with an extreme amount of tiny tasks
- Homogeneous threading with imprinted heuristics as guidance leads to a lightweight, decentralized and communication-free parallel construct.



Maze-Runner threads used previously

**Contractor** threads are selforganized ultra-lightweight constructs:

- No external scheduling
- No locking
- No barriers

### Hash based thread scheduling

- Assigning different groups of tasks to different workers can result in unwanted idle time due to size differences, while flattened task queues can result in high IO overhead due to decreased spatial locality.
- Hashing on the other hand assigns different groups to different workers whenever possible, but at the same time allows multiple devices to work within the same group if necessary.



## Performance up to CAS(113,76) dim $\mathcal{H}=2.88 imes10^{36})$

Performance for the  $F_2$  and FeMoco chemical systems for CAS(18,18), CAS(54,54) and CAS(113,76) orbitals spaces, respectively, as a function of the DMRG bond dimension on a dual Intel(R) Xeon(R) Gold 5318Y CPU.



Andor Menczer (ELTE)

TNS Algorithms on AI accelerators

16 / 19

### SU2 effective performance

Benchmark results obtained via the SU(2) spin adapted hybrid CPU-multiGPU DMRG for the F<sub>2</sub> molecule for a CAS(18,18) orbital space. Calculations have been performed on a dual AMD EPYC 7702 CPUs with  $2\times64$  cores compiled with eight NVIDIA A100-SXM4-40GB devices.



2024.11.22

- The power consumption of the TNS calculations are becoming one of the most important question due to high energy demands and costs.
- The thermal design power (TDP) for 2  $\times$  Intel(R) Xeon Gold 5318Y CPU is 2  $\times$  165 Watts  $\rightarrow$  2.5 TFLOPS would lead to  $\approx$  7.5 GFLOPS/Watt.
- For an NVIDIA A100-PCIE-40GB device the TDP is 250 Watts.
- For our 8 card accelerated hybrid algorithm with 110 TFLOPS performance results in  $\approx 47.2$  GFLOPS/Watt.
- For a given calculation the cost of the energy demand arising from the processors can be reduced to 1/6 of the original consumption.
- The energy consumption of the GPU devices fluctuates significantly, thus even a better ratio can be obtained.

### Ongoing and future work

- **Published:** GPU accelerated DMRG as shown previously, featuring Maze-Runner, TTcache and SBMM4S
- **Published:** Collaboration with NVIDIA, accelerating the simulation of strongly correlated systems using state-of-the-art supercomputing hardware known as DGX-H100. Other prototype super-hardware might also be tested.
- **Published:** Matrix dimensions can be further decreased by exploiting SU(2) symmetries. This leads to higher accuracy at the same dimensions or similar accuracy at much lower dimensions. Improved memory management, thread scheduling and support for a more general SBMM4S with partial execution.
- Published: GPU accelerated simulations of quantum lattices.
- In-progress: Unbounded scalability through multi-node execution using MPI and InfiniBand. Target: 1 PETAFLOPS.
- End Goal: Accurate modelling of strongly correlated subatomic particles at 1+ EXAFLOPS.

Andor Menczer (ELTE)