30–31 May 2024
Wigner Datacenter - HUN-REN Wigner Research Centre for Physics
Europe/Budapest timezone

Tensor Core Computing: an example on Independent Component Analysis

31 May 2024, 09:55
25m
Wigner Datacenter - HUN-REN Wigner Research Centre for Physics

Wigner Datacenter - HUN-REN Wigner Research Centre for Physics

HUN-REN Wigner RCP 1121 Budapest, Konkoly-Thege Miklós rd 29-33, Hungary
Lecture Session V

Speaker

Zeyu Wang (The University of Pannonia)

Description

The tensor core is a hardware unit in most modern GPUs first introduced in the NVIDIA Volta architecture. Similarly to the well-known CUDA core (Streaming Processor, SP), the tensor core is also a computing unit of the Streaming Multiprocessor (SM), but the input data to the tensor cores are a set of matrixes rather than single values processed by the CUDA cores. Each Tensor Core provides a 4x4x4 matrix processing array that performs the operation D = A * B + C, where A, B, C, and D are 4×4 matrices. Each tensor core can perform 64 floating-point FMA operations per clock cycle, 64 times more than a traditional CUDA core. Therefore, for implementing algorithms involving many matrix multiplication and addition operations, tensor cores can provide multiple times the performance of the CUDA cores.

Independent Component Analysis (ICA) is a fundamental tool in EEG data analysis. Its most frequent use is for separating unwanted noise artifacts from neural signals. Despite its advantages and importance, using ICA in the data analysis pipeline is problematic as the algorithm is very compute intensive. ICA is essentially a statistical signal unmixing method (a forward-backward propagation algorithm) that performs multiple iterations to update the unmixing/mixing matrix to achieve maximum statistical independence among the components, where each iteration/propagation involves numerous matrix multiplication operations. Regardless that GPUs seem to be ideal for speeding up ICA computation, the only known GPU-based implementation is the CUDA-ICA proposed by Raimondo et al. in 2011 using traditional GPU cores. The introduction of tensor cores could bring new opportunities for further accelerating ICA computation.

In this talk, we will outline the implementation strategy and details of using the tensor core for accelerating the Infomax ICA algorithm. First, we will give an introduction to the tensor core programming paradigm in CUDA, including its abstraction and hierarchy in software and hardware. Then, we study the adaptation of the ICA algorithm to tensor cores. We will analyse the potential steps/parts that can be accelerated by the tensor core and outline the implementation strategy. Finally, we will present our development workflow on the Komondor supercomputer and preliminary results of the implementation.

Primary author

Zeyu Wang (The University of Pannonia)

Co-author

Presentation materials