14th GPU Day - Massive parallel computing for science and industrial application
The 14th GPU Day will organized by the Wigner Scientific Computation Laboratory, HUN-REN Wigner RCP from May 30-31, 2024 at the Wigner Data Center in the KFKI Campus.
The GPU Day is an annually organized international conference series dedicated to massively parallel technologies in scientific and industrial applications for a decade. Its goal is to bring together from academia, developers from industry, and interested students to exchange experiences and learn about future massively parallel technologies. The event provides a unique opportunity to exchange knowledge and expertise on GPU (Graphical Processing Unit), FPGA (Field-Programmable Gate Array), and quantum simulations. As in previous years, we are expecting 80+ participants in person and online as well.
For the earlier events see: 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014
Participation is free for students, members of academic institutions, research centers, and universities.
The registration fee is 300EUR or 120.000HUF. Participants can pay via bank transfer or card at the registration desk.
The conference will be held in offline form and there are limited places to participate personally on-site. We encourage our former, current and future partners to contribute on the conference. Contributions to the conference are welcome.
Keynote speakers
Important deadlines
Talk submission deadline: 2024 May 24
Sponsors
| |
More information is available on gpuday.com
Useful pieces of information for sponsors.
Organizers:
Gergely Gábor Barnaföldi
Gábor Bíró
Balázs Kacskovics
The CERN Quantum Technology Initiative (QTI) was launched in 2020 with the aim of investigating the role that quantum technologies could have within the High Energy Physics (HEP) research program. During this initial exploratory phase a set of results were gathered, outlining benefits, constraints and limitations of introducing technologies in different HEP domains, from advanced sensor for next generation detectors, to computing. These findings have been used to defineof a longer-term research plan, closely aligned with the technological development of quantum infrastructure and the HEP priorities.
The CERN QTI has now entered its second phase, dedicated to extending and sharing technologies uniquely available at CERN, while boosting development and adoption of quantum technologies in HEP and beyond.
This talk will summarize the experience accumulated through the past years, outlining the main QTI research results, focusing in particular the field of quantum computing, and provide a perspective of future research directions.
Piquasso is a full-stack open-source software platform for simulating and programming photonic quantum computers. It can be employed via a high-level Python programming interface, enabling users to simulate photonic quantum computing efficiently. Piquasso targets numerous applications, ranging from boson sampling to photonic quantum machine learning. The former is underpinned by the built-in support for ubiquitous machine learning frameworks such as TensorFlow and JAX. Moreover, XLA compilation is also supported, granting a significant speedup.
At the intersection of two rapidly growing fields, generative quantum machine learning research is attracting significant interest. Being in the early days of the field, the proposed algorithms mostly rely on generic quantum models, that while being very powerful, face several challenges. This motivates the design of problem-informed models, making assumptions about the data that can be encoded into the quantum circuit. Probabilistic graphical models provide a general mathematical framework to represent structure in learning problems involving random variables. In this work, we introduce a problem-informed quantum model that leverages the Markov network structure of the underlying problem. We further demonstrate the applicability of the Markov network framework in the construction of generative learning benchmarks and compare the performance of our model to previous designs. Finally, we make a distinction between quantum advantage in learning and sampling tasks, and discuss the potential of our model to demonstrate improvement over classical methods in sampling efficiency.
In recent years, quantum machine learning (QML) has emerged as a rapidly expanding field within quantum algorithms and applications. Current noisy quantum devices enable small-scale experiments on existing quantum hardware, while increasingly powerful classical hardware allows for the simulation of quantum algorithms and the execution of robust classical AI applications.
Recently, hybrid quantum-classical approaches have been explored to utilize both high-performance classical and noisy quantum computing resources. Among these hybrid methods, variational quantum algorithms are the most extensively studied.
We examine the application of quantum reinforcement learning (QRL) in a hybrid quantum-classical system, where the quantum agent is represented by a parametric quantum circuit (PQC), optimized via gradient descent by a classical optimizer. However, many reinforcement learning environments possess high-dimensional observation spaces, such as visual observations, with feature vectors in the thousands. This makes it infeasible to encode these feature vectors into currently available quantum devices, which typically have only a few dozen noisy qubits.
To address these limitations, we investigate the use of classical autoencoders (AEs) to reduce the dimensionality of the original feature spaces and encode the latent feature variables into quantum states. While a similar approach has been tested in a fully classical scenario to our knowledge, it has not yet been applied to quantum agents.
We simulate these experiments using state-of-the-art quantum simulators and optimization frameworks. Our preliminary results indicate that this hybrid approach enables the use of quantum agents in reinforcement learning environments, which would not be feasible without the application of autoencoders.
TBF
To Be Filled
The investigation of the dynamics of a network's stability is considered an important research area, whether it concerns neural networks, power supply networks, or communication and social networks. These networks are usually large graphs ranging from 10,000 (power grids) to 1 million (connectome) nodes. Solving the second-order Kuramoto equations that describe such systems optimally is usually done on GPUs making use of the power of parallelization.
In the case of power supply networks, and power grids, dynamic instabilities can lead to cascade failures, where a fault in one component or subsystem triggers a chain reaction of faults in other components or subsystems, resulting in widespread power outages or disruptions. Controlling these failures is crucial from economic, sustainability, and national security perspectives. Our study investigated high-voltage power grids from multiple perspectives. Firstly, the properties of cascade dynamics are examined: after the system thermalizes, we allow the overload of a single line, which leads to its disconnection, and we track subsequent overloads against threshold values. The probability distributions of cascade failures exhibit power-law tails near the synchronization point [1].
We also identify the weighted European power grids of 2016 and 2022, as well as their sensitive graph elements using two different methods: we determine bridges between connected communities and highlight "weak" nodes that show the smallest local synchronization of the "swing equation", and we strengthen them by creating structures that increase the graph's robustness. We compare the results with network variations where bridges are removed, similar to isolation, and with network variations where edges are randomly added or removed at random locations [2, 3] and we show that random augmentations of the network can lead to Braess paradox [4].
Lastly, we perform an analysis of the dynamics on the community level, where we discover chimera-like behavior. We try to explain the observed dynamics with spectral analysis and cycle detection.
[1] Ódor, G., Deng, S., Hartmann, B. & Kelling, J. Synchronization dynamics on power grids in europe and the united states.
Phys. Rev. E 106, 034311 (2022). URL https://link.aps.org/doi/10.1103/PhysRevE.106.034311.
[2] Hartmann, B. et al. Dynamical heterogeneity and universality of power-grids (2023). 2308.15326.
[3] Ódor, G., Papp, I., Benedek, K. & Hartmann, B. Improving power-grid systems via topological changes or how self-
organized criticality can help power grids. Phys. Rev. Res. 6, 013194 (2024). URL https://link.aps.org/doi/10.
1103/PhysRevResearch.6.013194.
[4] Braess, D., Nagurney, A. & Wakolbinger, T. On a paradox of traffic planning. Transportation science 39, 446–450 (2005).
In quantum mechanics, the wave function describes the state of a physical system. In the non-relativistic case, the time evolution of the wave function is described by the time-dependent Schrödinger equation. In 1982, D Kosloff and R Kosloff proposed a method [1] to solve the time-dependent Schrödinger equation efficiently using Fourier transformation. In 2020, Géza István Márk published a paper [2] describing a computer program for the interactive solution of the time-dependent and stationary two-dimensional (2D) Schrödinger equation. Some details of quantum phenomena are only observable by calculating with all three spatial dimensions. We found it worth stepping out from the two-dimensional plane and investigating these phenomena in three dimensions. We implemented the said method for the three-dimensional case to simulate the time evolution of the wave function. We used our implementation to simulate typical quantum phenomena using wave packet dynamics. First, we tried the method on analytically describable cases, such as the simulation of the double-slit experiment, and then we investigated the operation of flash memory. We used raytraced volumetric visualization to render the resulting probability density. In our work, we introduce the basics of wave packet dynamics in quantum mechanics. We describe the method in use in detail and showcase our simulation results.
For further information and animations, please visit
https://zoltansimon.info/src/content/research/wavepacketsim.html
References
[1] Kosloff, D., & Kosloff, R. (1983). A Fourier method solution for the time dependent Schrödinger equation as a tool in molecular dynamics. Journal of Computational Physics, 52(1), 35–53. https://doi.org/https://doi.org/10.1016/0021-9991(83)90015-3
[2] Márk, G. I. (2020). Web-Schrödinger: Program for the interactive solution of the time dependent and stationary two-dimensional (2D) Schrödinger equation. https://doi.org/10.48550/arXiv.2004.10046
The dawn of explicit APIs, and particularly the introduction of Vulkan®, transformed the way we interact with GPU hardware. Despite the steep learning curve, the success and fast evolution of the Vulkan API shows that there is room for this such a new programming model in the industry, and we expect this model to get even wider adoption in the API landscape going forward. The goal of our presentation is to provide a glimpse into the rationale behind and the benefits of an explicit API compared to the traditional ones that we are all too familiar with.
We present a novel system here that is capable of recording data from vehicle-mounted sensors. The system is very flexible; digital cameras, LiDAR devices, and a GPS receiver are applied at the current status of the project; but novel sensors can be added to the approach.
A data visualization system has also been completed, that can cooperate with the recording system, its most interesting algorithmic details are also presented here. The Graphics Processing Unit (GPU) is employed for the essential tasks of data processing and visualization, both of which are integral to the system’s real-time operation. A significant proportion of the system’s features have been implemented in shaders. Both the recorded data and the real-time system is available to the public, therefore, this tool can be freely used by the vision community.
The GigaBit Transceiver (GBT) and the low power GBT (lpGBT) link architecture and protocol have been developed by CERN for physical experiments in the Large Hadron Collider as a radiation-hard optical link between the detectors and the data processing FPGAs (https://gitlab.cern.ch/gbt-fpga/). This presentation shows the details of how to implement a large array of GBT/lpGBT links (up to 48 x 4.8/5.12/10.24 Gbps) on a custom data acquisition board based on the Intel Arria 10 FPGA.
Many engineering applications involve the global dynamical analysis of nonlinear systems to explore their fixed points, periodic orbits, or chaotic behavior. The Simple Cell Mapping (SCM) algorithm is a tool for global dynamical analysis relying on the discretization of the state space, resulting in a finite set of cells corresponding to the possible states of the system and a discrete mapping representing the state transitions.
The computational cost of cell mapping-based methods can significantly increase in cases where higher resolution is required due to the complex dynamical behavior of the investigated system or a higher-dimensional system is considered.
This talk presents a possible solution for accelerating the SCM algorithm using parallel computing techniques. We present the challenges of a GPU-based implementation and other potential alternatives for efficient global dynamical analysis.
Supported by the ÚNKP-23-3-1-BME-63 New National Excellence Program of the Ministry for Culture and Innovation from the source of the National Research, Development and Innovation Fund.
Extremal combinatorial structures bear fundamental relevance in coding theory and various other applications. Their study consists in solving hard computational problems. Many of these are good candidates for to be solved on Ising-based quantum annealers for their limited size. Compared to other common benchmark problem classes these are not based on pseudorandomness, and most primal solution improvements can contribute several disciplines including mathematics, combinatorics, or physics.
We consider mixed Hamming packings: maximal subsets of a given Hamming space over a mixed alphabet keeping that every selected codeword pair has at least a minimum fixed given distance, using mixed integer programming models. There are many known and efficient approaches to the problem in the literature, including clique-reformulation, specific exhaustive search algorithms, etc. Our contribution is based on mixed integer programming which was not frequently used before as it was not competitive with other methods. We overcome this issue by introducing a reduction technique and further constraints based on structural properties. In particular, we introduce a presolving column reduction technique based on our idea of adopting the notion of contact graphs motivated by the Tammes-problem.
We prove that for every mixed Hamming-space over alphabets having minimal cardinality 3, there is a maximal Hamming-packing in them with a connected contact graph with minimal vertex degree at least 2. As a corollary of this lemma we can introduce a new type of linear constraints for the MILP model, which significantly simplify the solution. We present computational results for a number of particular instances: in case of binary-ternary problems we obtain best known results in competitive computational time. Preliminary results calculated using quantum annealers will also be presented.
We present the results of a novel type of numerical simulation that realizes a rotating universe with a shear-free, rigid body rotation in a Gödel-like metric. We run cosmological simulations with various degrees of rotation and compare the results to the analytical expectations of the Einstein--de Sitter and the $\Lambda$CDM cosmologies. To achieve this, we use the StePS
N-body code that is capable of simulating the entire infinite universe, overcoming the technical obstacles of classical toroidal (periodic cubical) topologies that would otherwise prevent us from running such simulations. Results show a clear anisotropy between the polar and equatorial expansion rates with up to $2.5~\%$ deviation from the isotropic case, a considerable effect in the era of precision cosmology.
In many image processing problems, we need to process polygons that usually involve rasterization. Also, in many such problems we need to compute reductions over images, such as average of intensities or other metrics. In some cases, a combination of the two computations is desired: we need to use the area of polygons to restrict which pixels should contribute to a reduction operation. Doing this in two steps, first computing a mask and then using it in a reduction is expected to be inefficient. We look into the problem of how to efficiently perform these two operations in a single step on the GPU exploiting parallelism on multiple levels.
The tensor core is a hardware unit in most modern GPUs first introduced in the NVIDIA Volta architecture. Similarly to the well-known CUDA core (Streaming Processor, SP), the tensor core is also a computing unit of the Streaming Multiprocessor (SM), but the input data to the tensor cores are a set of matrixes rather than single values processed by the CUDA cores. Each Tensor Core provides a 4x4x4 matrix processing array that performs the operation D = A * B + C, where A, B, C, and D are 4×4 matrices. Each tensor core can perform 64 floating-point FMA operations per clock cycle, 64 times more than a traditional CUDA core. Therefore, for implementing algorithms involving many matrix multiplication and addition operations, tensor cores can provide multiple times the performance of the CUDA cores.
Independent Component Analysis (ICA) is a fundamental tool in EEG data analysis. Its most frequent use is for separating unwanted noise artifacts from neural signals. Despite its advantages and importance, using ICA in the data analysis pipeline is problematic as the algorithm is very compute intensive. ICA is essentially a statistical signal unmixing method (a forward-backward propagation algorithm) that performs multiple iterations to update the unmixing/mixing matrix to achieve maximum statistical independence among the components, where each iteration/propagation involves numerous matrix multiplication operations. Regardless that GPUs seem to be ideal for speeding up ICA computation, the only known GPU-based implementation is the CUDA-ICA proposed by Raimondo et al. in 2011 using traditional GPU cores. The introduction of tensor cores could bring new opportunities for further accelerating ICA computation.
In this talk, we will outline the implementation strategy and details of using the tensor core for accelerating the Infomax ICA algorithm. First, we will give an introduction to the tensor core programming paradigm in CUDA, including its abstraction and hierarchy in software and hardware. Then, we study the adaptation of the ICA algorithm to tensor cores. We will analyse the potential steps/parts that can be accelerated by the tensor core and outline the implementation strategy. Finally, we will present our development workflow on the Komondor supercomputer and preliminary results of the implementation.
We would like to present a short introduction to the ALICE Analysis Facility and WSCLAB (Wigner Scientific Computing Laboratory) projects in our datacenter and show some key operation and visibility details of monitoring. Hardware components are aging, so monitoring is an important method to keep infrastructure healthy and to prolong cluster lifetime.
We created server types (worker node, storage) and defined entities in our monitoring system. In some cases, monitoring checks are just basic, others are advanced and some are even more complex to make sure we know the most important details in almost real time.
For power consumption we are using a visualization solution for power usage statistics based on each rack.
Ansible automation tool was used to scale up the monitoring system.
Historical data is also very valuable, so we integrated a database solution (InfluxDB) into our monitoring workflow.
Current milestones and roadmap for monitoring: continuous disk tests (S.M.A.R.T.), smart alerting for complex cases, scheduled backup for monitoring data, proper alerting based on pre-defined warning and critical levels, iterative time-based optimization for running checks, HTCondor service monitoring.
In this talk, I will give a technical overview of the GPU partition of the Komondor supercomputer (No.300 on the TOP500 list). First, some job statistics will be presented, then we will describe the system and node architecture in detail; the CPU and GPU architecture details, intra and inter-node interconnects, their key properties and performance implications. We will follow with the introduction of the software environment, the module architecture, its properties and configurations from multi-GPU development and execution. We will overview the different MPI implementations available in the system and their behaviour in multi-GPU programs. Finally, the GPU job scheduling mechanism used by the SLURM scheduler will be discussed, with examples of GPU job submission scripts. The key theme throughout the talk is computational performance and scalability, which will appear in the hardware, software and development sections of the talk.
High-density EEG processing is a time-consuming task due to the large number of electrodes, high sampling rates and the computational cost of the pre-processing algorithms. Typical pre-processing steps include high-pass and low-pass filtering, line noise removal, detection and interpolation of bad channels, power spectral density calculation and time-frequency analysis. This talk will present the design and implementation of the multi-GPU convolution operation and the continous wavelet transform for multi-channel EEG analysis. We will highlight the parallel design strategies and stages of the algorithm using the CUDA programming model first to create a single-GPU implementation and then extending it to a multi-node version using data communication schemes based on the Message Passing Interface. In conclusion, multi-GPU runtime measurement results will be presented alongside the performance and scalability of the implementation.
The presentation gives a short overview on the newly established partnership programmes introduced by the EU and also discusses the objectives of these partnership programmes. The first partnership was called to action in September 2018: this is the EuroHPC Joint Undertaking, funded by the Commission, Participating States (35 all together from EU and outside of EU) and a few organizations of private companies.
The EuroHPC JU has been successful in its operation; just in a five years’ time the EU has become a region having the most developed supercomputing infrastructure and ecosystem in the world. Hungarian users, R&D communities are now presented to have direct accesses to the EuroHPC network of supercomputers.
Hungary also has launched an ambitious HPC development program. The first phase of this program resulted in the “Komondor” HPC of 5 PF capacity. It is established at the Debrecen University Data Center and operated by KIFÜ – the Hungarian Governmental IT Development Agency. The HPC development is going on in Hungary. With the support of the EuroHPC JU the next more powerful HPC – the “Levente” project – has been launched. It is the “Levente” HPC of 20 PF capacity. In the second part of the paper, the “Levente” system and its development is briefly discussed.
Large language models have changed the way we think about language and are fueling the next industrial revolution. However, during their evolution, the focus quickly shifted from language to data. In this talk, I will briefly summarise how this change impacted linguistics, linguists, and other representatives of related fields in the humanities. Theoretically, the number of word forms per language is not that high, still dealing with them in their raw form has its challenges. However, in the field of natural language processing and corpus linguistics, the noise caught up when collecting real-life texts into corpora also needs to be managed not to mention the multilingual environments and non-standard language use. Since the embedding and vector representation of words are still inexplicable within the accepted linguistic frameworks, linguists have been averse to these methods. On the other hand, the other areas of the humanities have begun to benefit from LLMs, as some of the necessary text processing methods that had previously not performed well enough have been significantly improved with the advent of even bigger and smarter language models. This has led to a decline in interest in some widely used standard tasks related to linguistics and presumably will force even the most conservative disciplines of humanities scholars to revise their practices. I will give some examples of how this process is going and what problems have arisen along the way.
Large Language Models (LLMs) have been with us for a few years now. Their generalization capabilities are outstanding due to their sheer size; however, they still lack the benefits of information processing grounded in multimodality. In this review, we explore how early forms of this grounding could be achieved by constructing Large World Models (LWMs). We formulate method-agnostic general templates for both LLMs and LWMs and highlight several alternatives and distinct methods that were designed for these large, general models but could be applied to Deep Learning research in other areas as well.
GPUs are increasingly common in scientific high-performance computing; however, their benefits are not uniform across all areas of scientific computing. In certain fields, such as in sonochemistry where delay differential equations can arise, large amounts of data must be accessed based on the current state of the simulation in an unaligned and uncoalesced manner. This usually hinders the applicability of GPUs. The goal is to leverage the advantages of both GPU and CPU architectures by overlapping parallel and serial computations, as well as memory copy and write instructions, in the solution of an ensemble of differential equations (ODEs). This approach could make certain computing tasks faster and more efficient.
The general idea of a heterogeneous CPU-GPU differential equation solver involves deploying four asynchronous CUDA streams and partitioning the workload into four equal parts, with each stream managing 1/4 of the workload. Each stream consists of four stages. Initially, data is transferred from the CPU to the GPU in the first stage, then a kernel is executed on the GPU in the second stage to compute a single Runge-Kutta step. Following this, data is transferred from the GPU to the CPU in the third stage, and finally, in the fourth stage serial calculations on the CPU are carried out. These serial calculations could involve predicting, accepting, or rejecting the time-step, or computing the delayed terms in a delay differential equation. Ideally, these four stages can be overlapped to optimize performance.
Parameter sensitivity studies were conducted on a first-order Bernoulli-type ODE and the Duffing equation (second order ODE) using both a homogeneous GPU solver and the heterogeneous CPU-GPU approach described earlier. GPU profiling was employed to show that an overlap of the 4 stages is possible. It demonstrates that, while overlapping CPU, GPU, and memory copy operations using CUDA streams is feasible, achieving an ideal overlap is only possible under specific conditions. For the simple problems investigated, the pure GPU approach proves to be the most effective solution. However, in the future, adaptive delay differential equation solvers may benefit from employing such a heterogeneous approach.
We consider generalization bounds for two types of neural structures, feedforward Rectified Linear Unit (ReLU) networks, special types of neural Ordinary Differential Equations (ODE) and State Space Models (SSM). Calculating the Rademacher complexity of both models involves computationally expensive norm calculations therefore we propose techniques to compute them efficiently.
One of the most successful treatments in cancer therapy is proton therapy, with radiation planning being a key element. Photon CT is commonly used for this purpose; however, it does not provide sufficiently accurate information about the range of protons. Therefore, proton CT imaging is more favorable for radiation planning. Due to the Coulomb scattering of protons, it is important to calculate the Relative Stopping Power at the voxel level (thus, appropriate handling of trajectories is also required), for which several algorithms have been developed. The aim of my research is to test, further develop, and optimize a software package using the Richardson-Lucy algorithm developed in the Bergen Proton-CT Collaboration.
The simulations necessary for the research were performed using the Geant4 and Gate software. I optimized the framework using the Richardson-Lucy algorithm with appropriate methods for faster and more efficient operation. I verified the operation of the algorithm and image reconstruction on phantoms developed to measure the performance of medical imaging systems at different energies.
During my work, I managed to optimize the algorithm, significantly reducing the runtime. Based on the evaluation of phantom reconstruction, I found that the algorithm operates with the desired accuracy.
Among my long-term goals are further optimization and achieving clinical usability (including further reducing runtime).
Hadron therapy is a form of cancer therapy, where we aim to destroy those cancerous cells that are hard to reach with surgery. Since this kind of approach differs from the classical gamma radiation therapy, the tomography methods used for that are not sufficient enough for Hadron Therapy . Proton Computed Tomography (pCT) is developed to achieve more accurate results for this kind of treatment. During pCT a detector system measures incoming particles that are scattered on a phantom. My research is to be able to reconstruct the trajectories of these particles that went through the detector system. For this purpose I used machine learning algorithms to achieve high accuracy with not too high computational time.
Small square grids presented in bit matrices representing occupied sites and some neighborhood definition, like von Neumann, Moore or hexagonal neigborhood, arise in various Monte Carlo simulations and connection games. High speed testing of a very large number of such grids for connection between the opposite edges of the grid under the given neighborhood is often reqired. Because of the relatively few number of registers in CPUs the bit matrices are usually stored in memory, reducing the speed. On the contrary, as CUDA has a large number of processors each with much more registers, it offers much higher theoretical speed if the memory bottleneck can be avoided.In our approach the bit matrices (e.g 32x32 bits) are fully stored and processed in registers, so use of shared, local or global memory is eliminated, except for initial loading of data and storing the results, thus a high speed of processing is achieved with all shader cores working while the thread divergence is kept low.
Recent advances in laser technology and nanoplasmonics, combined with heavy-ion collisions leads to a new, previously untamed road towards fusion energy production research. Here we explore recent advances in the theoretical, simulation part of the NAno-Plasmonic Laser Inertial confinement Fusion Experiment (NAPLIFE). We study how gold nanoparticle doping enhances medium absorption under laser infrared pulses at intensities between $10^{15}$ and $10^{18}$ W/cm$^2$. Contrary with tradition, we simulate the nanoparticles with the particle-in-cell method, this also makes investigation of other effects possible, that cannot be considered with common methods, such as electron spillouts. We compare various shapes and sizes of dopes, focusing on the ejection dynamics of conducting electrons and tracking the interaction between the laser radiation and the doped matter, monitoring ionization products and their energies, as well as field intensities around the resonating nanoantennas.