GPU Day 2022

Name: GPU Day 2022
Start: 2022-06-20T08:20:00+02:00
End: 2022-06-21T17:50:00+02:00
Location: Hotel Mercure Budapest Castle Hill

20 Jun 2022, 08:20 → 21 Jun 2022, 17:50 Europe/Budapest

Hotel Mercure Budapest Castle Hill

1013 Budapest, Krisztina Körút 41-43

Description

The 12th GPU Day will organized by the Wigner Scientific Computation Laboratory from June 20-21, 2022 at the Mercure Budapest Castle Hill hotel.

The GPU Day is a annualy organized international conference series dedicated to massively parallel technologies in scientific and industrial applications since a decade. It’s goal is to bring together from academia, developers from industry and interested students to exchange experiences and learn about future massively parallel technologies. The event provides a unique opportunity to exchange knowledge and expertise on GPU (Graphical Processing Unit), FPGA (Field-Programmable Gate Array) and quantum simulations. As in previous years, we are expecting 80+ participants in person and online as well.

For the earlier events see: 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014

Confirmed keynote speakers:The conference will be held in hybrid from, therefore it is possible to participate in online form and there are limited places to participate personally on-site. We encourage our former, current and future partners to contribute on the conference. Contributions to the conference are welcome.

Talk submission deadline: 2022 June 14.

More information is available on gpuday.com

Useful pieces of information for sponsors.

Organizers:
Gergely Gábor Barnaföldi
Gábor Bíró
Balázs Kacskovics
Balázs Endre Szigeti

Participants

68 View full list

Monday 20 June
- Mon 20 Jun
- Tue 21 Jun
- 09:00 → 09:20
  
  Opening: Opening Talk and Welcome by the Director
  
  Conveners: Gergely Barnafoldi (Wigner RCP RMI of the Hungarian Academy of Sciences), Peter Levai (WIGNER RCP)
  
  GGBarnafoldi_WSCLAB_GPUDAY_202206190v1.pdf
- 09:20 → 10:40
  Session I
  - 09:20
    
    Accelerating massively parallel .NET code using FPGAs with Hastlayer 40m
    
    Hastlayer (https://hastlayer.com/) by Lombiq Technologies is a .NET software developer-focused, easy-to-use high-level synthesis tool with the aim of accelerating massively parallel applications. It converts standard .NET Common Intermediate Language (CIL) bytecode into equivalent Very High Speed Integrated Circuit Hardware Description Language (VHDL) constructs which can be implemented in hardware using FPGAs. Cloud-available FPGA platforms are supported for high-performance use-cases, as well as the Zynq 7000 family of FPGA SoC devices for low-power embedded systems like drones and nanosatellites. In this talk, we'll introduce Hastlayer and how it can be used, our results showing up to 2 orders of magnitude speed and power efficiency increases, as well as the collaboration partners we seek from academia and other industry players.
    
    Speaker: Zoltán Lehóczky (Lombiq Technologies Ltd.)
    
    Accelerating massively parallel .NET code using FPGAs with Hastlayer.pptx
  - 10:00
    
    OpenCL Ecosystem Updates 40m
    
    OpenCL has gained a name for being one of the most portable programming models for heterogenous programming. It’s 3.0 release improves upon deployment flexibility and most major vendors already ship 3.0 compliant runtimes. Stream HPC is at the forefront of reinvigorating the ecosystem, demonstrating how to make the best use of 3.0 and the latest features. The OpenCL SDK is a 1st party Khronos SDK aimed at being the „one-stop-shop” for OpenCL developers. A feature that was released close to 3.0 is the ability to register OpenCL Layers between applications and OpenCL drivers. Stream HPC also took part in implementing validation layers for OpenCL to help developers catch erronous code with informative diagnostics before/instead of crashing your application.
    
    Speaker: Máté Nagy-Egri (Stream HPC)
    
    GPU-Day 2022.pptx
- 10:40 → 11:00
  
  Coffee Break 20m
- 11:00 → 12:30
  Session II
  - 11:00
    
    ELKH Cloud vGPU challenges and implementation steps 30m
    
    Speaker: Ádám Pintér
    
    Pintér_Ádám_ELKH_Cloud_GPUDAY.pdf
  - 11:30
    
    Present and Future of GPU HPC in Hungary 20m
    
    Komondor, the newest 5 petaflops GPU supercomputer of Hungary is on its way to both academic and industrial users. The system includes 200 A100 GPU. The talk will focus on the details of the hardware and software stack, and ways to access and use the system, including the option to run containerized jobs by using web interfaces only.
    The HPC Competence Center will offer training on how to use the newest infrastructure efficiently, including GPU programming training courses in English.
    KIFÜ is already planning the upgrade of the infrastructure. The machine in question will offer 20 petaflops of capacity integrated into the EuroHPC ecosystem. KIFÜ intends to benefit from cutting-edge GPU technologies, but would gladly hear feedback from the community, to learn your opinions of future technologies and ways we would help you with your tasks effectively. The system will be ready to be integrated with Quantum computers.
    
    Speaker: Zoltan Kiss (KIFÜ)
    
    KissZ_GPU_2022_06.pdf
  - 11:50
    
    Parallel Implementation of Multivariate Empirical Mode Decomposition on GPU 20m
    
    Empirical Mode Decomposition (EMD) is an effective tool for the analysis of non-linear and non-stationary signals, which has been widely used in various application fields for noise reduction, feature extraction and classification. Due to its adaptive and data-driven nature, it has been introduced to electroencephalography (EEG) analysis to extract more accurate information in time-frequency, phase coherence and brain connectivity analysis. EMD method decomposes signal into several narrow band oscillatory mode components, known as Intrinsic Mode Functions (IMFs). Despite its advantage and importance, using EMD in signal processing is problematic as the algorithm is computationally very expensive. For high-density, high-resolution EEG measurements, the runtime can easily reach several hours.
    
    Over the past decade, several variants of the EMD method have been proposed, including Multivariate Empirical Mode Decomposition (MEMD). MEMD focuses on the empirical mode decomposition process of multi-channel signals, it treats the input signal as a multivariate signal in a high-dimensional space. By projecting the signal onto each direction vector and calculating the multivariate envelopes and IMFs, the synchronous decomposition of the multichannel signal can be realized. However, multi-channel signals will bring a heavier workload which makes MEMD computationally even more expensive.
    
    In this talk, we will describe the implementation strategy and details of a parallel CUDA MEMD algorithm. We will start with an overview of the numerical steps of MEMD, then the details of the parallelization steps, including direction vector generation, signal projection, extrema detection and selection, and the cubic spline interpolation, etc. Compared with MEMD implementation in the MATLAB-based EMDLAB toolbox, our GPU parallel version achieves about 150x performance improvement reducing execution time from hours to minutes.
    
    Speaker: Zeyu Wang (The University of Pannonia)
    
    Presentation_GPUDAY.pdf
    
    Presentation_GPUDAY.pptx
  - 12:10
    
    What makes us humans: Differences in the critical dynamics underlying the human and fruit-fly connectome 20m
    
    Previous simulation studies on human connectomes [1] suggested, that critical dynamics emerge subcrititcally in the so called Griffiths Phases. This is the consequence of the strong heterogeneity of the graphs. Now we investigate this on the largest available brain network, the $21.662$ node fruit-fly connectome, using the Kuramoto synchronizationmodel. As this graph is less heterogeneous, lacking modular structure and exhibit high topological dimension, we expect a difference from the previous results. Indeed, the synchronization transition is mean-field like, and the width of the transition region is larger than in random graphs, but much smaller than as for the KKI-18 human connectome. This demonstrates the effect of modular structure and dimension on the dynamics, providing a basis for better understanding the complex critical dynamics of humans [2].
    I show some numerical results obtained by the Kuramoto-GPU code developed for ODE solution of synchronization phenomena.
    
    [1] G. Odor and J. Kelling, Critical synchronization dynamics of the Kuramoto model on connectome and small world graphs, Scientic Reports 9 (2019) 19621.
    
    [2] Geza Odor, Gustavo Deco and Jeffrey Kelling
    Differences in the critical dynamics underlying the human and fruit-fly
    connectome, Phys. Rev. Res. 4 (2021) 023057.
    
    Speaker: Geza Odor (EK-MFA)
    
    02_04GPU22talk.odp
- 12:30 → 14:00
  
  Lunch break 1h 30m
- 14:00 → 15:40
  Session III
  - 14:00
    
    Online data processing with GPUs in ALICE during LHC Run 3 40m
    
    The ALICE experiment has undergone a major upgrade for LHC Run 3 and will record 50 times more heavy ion collisions than before.
    The new computing scheme for Run 3 replaces the traditionally separate online and offline frameworks by a unified one.
    Processing will happen in two phases.
    During data taking, a synchronous processing phase performs data compression, calibration, and quality control on the online computing farm.
    The output is stored on an on-site disk buffer.
    When there is no beam in the LHC, the same computing farm is used for the asynchronous reprocessing of the data which yields the final reconstruction output.
    ALICE will employ neither hardware nor software triggers for Pb-Pb data taking but instead store all collisions in compressed form.
    This requires full online processing of all recorded data, which is a major change compared to a traditional online systems, which sees only the data selected by a hardware trigger.
    Traditional CPUs are unable to cope with the huge data rate and processing demands of the synchronous phase, therefore ALICE employs GPUs to speed up the processing.
    Since the online computing farm performs a part of the asynchronous processing, ALICE plans to use the GPUs also for this second phase when there is no beam in the LHC.
    The primary goal for the commissioning in 2021 and 2022 was to make those reconstruction steps required for the online phase run on the GPU efficiently.
    The development is now shifting towards moving more computing-intensive steps of the asynchronous reconstruciton to the GPU as well.
    The talk will detail the ALICE Run 3 computing scheme, and outline the hardware architecture and software design for synchronous and asynchronous processing.
    
    Speaker: Dr David Rohr (CERN)
    
    2022-06-20 GPU Day.pdf
  - 14:40
    
    Machine learning based estimator for elliptic flow in heavy-ion collisions 20m
    
    Using the kinematic information of the final state particles produced in heavy-ion collisions at relativistic energies, one tries to probe the properties of the very hot and dense medium formed just after the collision. There have been different probes to study the physics associated with such a medium, and one of them is the elliptic flow ($v_2$). In this study, we have employed a deep neural network (DNN) based estimator in the machine learning framework to estimate $v_2$, using the particle kinematic information as the input. The DNN model is trained with Pb-Pb collisions at $\sqrt{s_{NN}}$ = 5.02 TeV minimum bias data, simulated with AMPT. The trained model is also evaluated for Pb-Pb collisions at $\sqrt{s_{NN}}$ = 2.76, 5.02 TeV and Au-Au collisions at $\sqrt{s_{NN}}$ = 200 GeV, and is compared with ALICE experimental results. The proposed DNN model preserves the centrality, and transverse momentum dependence of the flow coefficient. It is also found to be quite sturdy when subjected to simulated data with the uncorrelated noise as the prediction accuracy of the DNN model remains intact upto a reasonable extent. Such an estimator is yet to be tested with the experimental inputs along with detector level correlations in future with ALICE.
    
    Speaker: Suraj Prasad (Indian Institute of Technology Indore (IN))
    
    SurajPrasad_MLv2_GPUDAY.pdf
  - 15:00
    
    pCT Image Reconstruction -- A Huge Linear Problem 20m
    
    Modern proton Computed Tomography (pCT) images are usually reconstructed by algebraic reconstruction techniques (ART). The Kaczmarz-method and its variations are among the most widely used methods, which are iterative solution techniques for linear problems with sparse matrices. It is an interesting question whether statistically-motivated iterations, which have been successfully used for emission tomography, can be applied to reconstruct the novel technology of pCT images as well.
    
    In my research, I developed a method for pCT image reconstruction, based on the Richardson–Lucy deconvolution. It treats the problem as a statistically-motivated fixed-point iteration. I implemented this algorithm as a parallel code to GPU, with spline-based trajectory calculation and on-the-fly system matrix generation. My results presented that the method works well, and it can be successfully applied in pCT applications, such as in the detector R&D of the Bergen pCT Collaboration.
    
    Speaker: Akos Sudar (MTA Wigner FK)
    
    GPU_Day_2022_SA.pdf
  - 15:20
    
    Representation learning in Artificial Intelligence 20m
    
    In Artificial Intelligence (AI) the success of learning depends crucially on the way we represent the input data. In the talk we overview the criteria of an ideal representation, and associate an entropy formula to them. We also show how these representations work in case of mechanical motion reconstruction from data.
    
    Speaker: Antal Jakovac (Wigner RCP)
    
    Jakovac_GPU_day_2022.pdf
- 15:40 → 16:00
  
  Coffee break 20m
- 16:00 → 18:00
  Session IV
  - 16:00
    Machine learning methods for Schlieren imaging of a plasma channel in tenuous atomic vapor 20m
    
    Inventing and fine-tuning laser and plasma based electron accelerators is a hot topic of contemporary physics, either considering experimental, theoretical or applied physics. One of the most prominent experiments in this field is the CERN-AWAKE experiment [1]. In this experiment, electrons are accelerated by the wakefields generated by a series of proton microbunches in a 10-meter-long rubidium plasma channel. The series of proton microbunches is generated via the self-modulation instability: first, the proton beam, obtained from the SPS experiment, enters the plasma channel, then the head of the proton beam generates plasma wakes that split the proton beam into a series of microbunches with a length of a few tens of micrometers each.
    The plasma itself is generated via photoionisation of rubidium vapour with an ionising laser pulse with $780 \, \mathrm{nm}$ wavelength, $120 \, \mathrm{fs}$ pulse duration and $450 \, \mathrm{mJ}$ pulse energy [2].
    
    The spatial extent of the plasma channel that is generated by the ionizing laser pulse can be investigated using a Schlieren imaging setup. To obtain parameters for the extent of the plasma channel, we assume the plasma density distribution to be of the form:
    \begin{equation}
    \mathcal{N}{plasma} = \left{
    \begin{aligned}
    &\mathcal{N}_0 P{max}, \mathrm{~if~} r\leq r_0\
    &\mathcal{N}0 P{max}\exp\left(-\frac{(r-r_0)^2}{t_0^2}\right) , \mathrm{~if~} r>r_0
    \end{aligned}\right.
    \end{equation}
    with $\mathcal{N}_0$ being the vapour density, $P_{max}$ the maximum of the photoionisation probability, i.e.~ the value measured in the center of the plasma channel, $r=\sqrt{(y-y_0)^2+z^2}$ the distance from the center of the plasma channel that is located at $(y,z)=(y_0,0)$ and $r_{0}$ the radius of the plasma channel. $t_{0}$ characterises the width of the region where the photoionisation probability rises from $0$ to $P_{max}$. The output of the imaging setup can be calculated in a straightforward way with any given plasma density distribution. The reverse is not true, however, as the measured image depends on the plasma channel parameters $P_{max}, y_0, r_0, t_0$ in a complicated way. The task of inverting the problem, that is, to determine the plasma parameters from the calculated image can be attempted by using machine learning methods. Below we shortly summarize the key features of our approach and our recent results. We invite the Reader to look at our paper, currently available on arXiv, for a detailed description [3].
    
    Using computer simulations, a sufficient amount of good quality learning data can be generated with low computational costs. Machine learning methods have the capability of determining the parameters of the plasma density distribution, shown in the Equationgiven above. We applied different deep neural network architectures to achieve the goal, and will present three models that produce the best predictions. Our results show that using a machine learning approach, the plasma parameters can be determined with high accuracy, regardless of the background noise. We also compared the predicted Schlieren signals with the reference signals and experienced that our neural networks predicted the signals themselves accurately, with only a few percents of mean amplitude error and phase error. The calculated probability distributions of these errors also confirm the high accuracy of the predictions. Furthermore, we tested how sensitive our networks are to the uncertainty of the vapour density and the probe laser beam intensity. We found if the actual vapor density or the probe laser intensity differs not more than $\sim 2.5 \, \%$ from the reference value, i.e.~the value for which our networks have been trained, the accuracy of the predictions remains acceptable. This suggests that our approach is a reliable, robust method, with possibly better performance than other, classical methods, and is suitable for the automated evaluation of experimental data.
    
    References
    
    E. Gschwendtner, et al., AWAKE, the Advanced Proton Driven Plasma Wake?eld Acceleration experiment at CERN, Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 829 (2016) 76-82. 2nd European Advanced Accelerator Concepts Workshop - EAAC 2015.
    
    E. Adli, A. Ahuja, O. Apsimon, R. Apsimon, A.-M. Bachmann, D. Barrientos, F. Batsch, J. Bauche, V. B. Olsen, M. Bernardini, et al., Acceleration of electrons in the plasma wakefield of a proton bunch, Nature 561 (2018) 363.
    
    G. Bíró, M. A. Pocsai, I. F. Barna, J. T. Moody, G. Demeter, Machine learning methods for schlieren imaging of a plasma channel in tenuous atomic vapor, 2022. URL: https://arxiv.org/abs/2205.12731 . doi:10.48550/arxiv.2205.12731.
    
    Speaker: Mr Mihály András Pocsai (Wigner RCP)
    
    pocsai_mlplasma.pdf
  - 16:20
    
    Strategies for multi-GPU PIC/MCC plasma simulation implementation on pre-exascale supercomputers 20m
    
    In this talk we describe, compare and evaluate various implementation strategies that can be used to implement massively parallel Particle-in-Cell / Monte Carlo collision low-pressure plasma simulations. Building on our earlier single-GPU 1D and 2D plasma implementations that demonstrated two orders of magnitude speedup, our goal is now to utilise the thousands of GPUs found in pre-exascale supercomputers such as MARCONI 100, SUMMIT or LEONARDO. A key performance bottleneck in these distributed memory GPU systems is communication cost. Traditionally, these systems are programmed in a hybrid parallel fashion using a combination of CUDA, OpenMP and MPI for controlling and coordinating different levels of parallelism that results in a complex and architecture dependent simulation code achieving -- at best -- weak scaling only. We demonstrate these traditional multi-GPU programming strategies in the context of our 1D plasma simulation program. We will illustrate the limitation of these approaches and their inability to hide inter-GPU communication efficiently. Then, we overview two alternative approaches, NCCL and NVSHMEM, that provide device-side, kernel-initiated communication operations that provide a more GPU-friendly programming model and improved communication hiding capabilities. Our work is still in progress but we will show preliminary results that can demonstrate the design and implementation challenges of large multi-GPU programs.
    
    Speaker: Zoltan Juhasz (University of Pannonia)
    
    ar-sh-movie.mov
    
    GPU_Day_2022_Juhasz.pdf
    
    GPU_Day_2022_Juhasz.pptx
  - 16:40
    
    Preliminary results of the tuned HIJING++ heavy-ion event generator 20m
    
    Monte Carlo event generators became one of the most important tools of modern high-energy physics. They are widely used in the high-energy community to simulate particle collisions, make predictions, and to design experiments.
    The HIJING++ (Heavy Ion Jet INteraction Generator) is the successor of the 30 year old Fortran HIJING, completely rewritten in C++, providing multi-thread processing and various new modular features. In order to have meaningful data from HIJING++ however, it had to be tuned to reproduce existing experimental data correctly. An important and resource-consuming phase of the development is the tuning of the internal parameters to reproduce the existing experimental data. These parameters cannot be determined by direct methods of calculation, therefore to get the desired, optimal generator responses we have to run the generator in every value combinations of these parameters . This is the most computationally heavy part in the development of the generator. After several months of CPU time and hundreds of terabytes of generated data we settled down on the internal parameters of the HIJING++.
    In this talk I want to highlight the process that was used to tune the HIJING++ and show the results of this process. The current state of HIJING++ can reliably reproduce experimental particle collision data, in various collision systems and energies.
    
    Speaker: Balazs Majoros (Wigner FK)
    
    majoros_balazs_gpuday_2022 (1).pptx
  - 17:00
    
    The highly increased number of protein structures calls for high performance algorithms 20m
    
    The number of unique transmembrane (TM) protein structures doubled in the last four years that can be attributed to the revolution of cryo-electron microscopy. In addition, the AlphaFold2 (AF2) deep learning algorithm also provided a large number of predicted structures with high quality. However, if a specific protein family is the subject of a study, collecting the structures of the family members is highly challenging in spite of existing general and protein domain-specific databases.
    
    We demonstrate this and assess the applicability of automatic collection of protein structures via the ABC protein superfamily. We developed a pipeline to identify and classify transmembrane ABC protein structures and also to determine their conformational states based on special geometric measures, conftors. This and similar processes need alignment of structures with a run time of 1-10s that was feasible on the scale of experimental structures (n<300K). However, the ~100M theoretical, high quality AF2 protein structures renders the calculations challenging and requires reimplementation of various algorithms.
    
    Since the AlphaFold database contains structure predictions only for single chains, we performed AF-Multimer predictions for human ABC half transporters functioning as dimers. Our AF2 predictions warn of possibly ambiguous interpretation of some biochemical data regarding interaction partners and call for further experiments and experimental structure determination. In order to organize structural data and made novel structure predictions and their annotation available for the broader scientific community, we joined the 3D-Beacon Network community to develop data and API standards.
    
    Speaker: Tamas Hegedus (Semmelweis University)
    
    04_04_hegedus_abc.pptx
Tuesday 21 June
- Mon 20 Jun
- Tue 21 Jun
- 09:00 → 10:40
  Session V
  - 09:00
    
    Piquasso, a comprehensive framework for optical quantum computer programming and simulation 30m
    
    In this talk, we introduce Piquasso, a full-stack open source platform for Photonic Quantum Computing built using Python and C++. Piquasso enables users to perform efficient Quantum Computing using continuous variables, which could be used for designing photonic circuits for simulation and machine learning purposes.
    
    Speaker: Zoltan Zimboras (Wigner RCP)
    
    Zimboras_GPU_Days.pdf
  - 09:30
    
    Improving efficiency of non-Gaussian photonic circuit simulations 20m
    
    The simulation of photonic quantum computers with non-Gaussian circuit elements
    has a high memory usage since the quantum state is usually represented as a
    tensor, which scales exponentially in the number of modes in the photonic
    circuit. However, this representation turns out to be slow and overabundant in
    most cases, forcing us to devise a new strategy for simulating general
    non-Gaussian photonic circuits.
    
    In our proposed strategy, the way to cut off the quantum state is more
    economical in terms of data, so that the memory usage of the simulation can be
    significantly reduced. In our recently developed simulator (called Piquasso) we
    implemented this strategy, which enabled us to perform simulations faster than before.
    
    Speaker: Zoltán Kolarovszki
    
    Improving_efficiency_of_non_Gaussian_photonic_circuit_simulations.pdf
  - 09:50
    
    Custom Tailored FPGA Boson Sampling 20m
    
    Computing the permanent of a matrix finds an important application in the context of boson sampling. Using the BB/FG permanent formula with a reflected binary Gray code, we implemented an FPGA design aimed at maximizing the use of logic and DSP resources to increase the parallelism and reducing the time complexity from $\mathcal{O}(n.2^{n-1})$ to $\mathcal{O}(n.2^{n-3})$. This can be furthered reduced by half using both cards of the FPGA in a dual array mode of operation. To allow the design to scale up to 40x40 matrices and achieve a speed of 280MHz, we used a properly pipelined state-of-the-art rectangular multiplier algorithm to match the structure of the DSP units of the FPGA.
    
    In practical boson sampling configurations, there will be photons sharing an optical mode, which mathematically refers to computing the permanent of a matrix with independent repeated rows and repeated columns. This allows for a generalization of BB/FG using binomial coefficients to simplify multiplicities across rows or columns:
    \begin{equation}
    \text{rp}(\boldsymbol{A}, \boldsymbol{M}, \boldsymbol{N})=\frac{1}{2^{n-1}} \sum\limits_{\boldsymbol{\Delta}} \bigg(\prod\limits_{k=1}^{m} (-1)^{\Delta_k} {M_k \choose \Delta_k}\bigg) \prod\limits_{j=1}^{m} \bigg(\sum\limits_{k=1}^{m}
    \left(M_k-2\Delta_k\right)a_{k, j}\bigg)^{N_j}
    \end{equation}
    where $A$ is a square matrix describing the interferometer with $m$ modes, $\boldsymbol{M}$ and $\boldsymbol{N}$ are the row and column multiplicities respectively such that the photon count $n=\sum\limits_{r\in R}=\sum\limits_{c\in C}$ and $\Delta$ is an n-ary Gray code, required for efficient computation. Computing binomial coefficients efficiently presents design challenges on the FPGA. By constructing a large enough loop, this can be resolved but it requires special logic around the n-ary Gray code. Certain techniques like Guan codes are insufficient as they lack the reflection property being computable with simple logic so we based our method on a dynamic programming technique. We extended this approach to stagger the Gray code at precise even intervals based on the loop length, incurring a constant initialization summation cost, and then a streamlined operation. To avoid division, we implemented division via multiplication by "magic numbers". To account for the BB/FG "anchor" row and allow the $4$ parallel operations to proceed with a simultaneous "smooth" same row update, we reduce the smallest multiplicity up to three times to maintain optimal complexity.
    
    All our implementations were designed to automatically reset on completion, providing for batching capability, an important optimization for FPGA designs and fitting perfectly into the context of boson sampling, where many permanents are computed and the batch size equals the matrix size. As we used fixed point arithmetic, we conducted accuracy testing against a CPU infinite precision calculator. We benchmarked against similar maximally efficient implementations on CPU. These are a part of the piquassoboost extension to the piquasso library. We measured two important metrics against the CPU: the matrix size cross-over threshold due to FPGA initialization time delay and performance speed-up. For non-repeated permanents, we achieve an 8.86x speed-up over CPU with a cross-over threshold at 16x16 matrices, which batching reduces it to 9x9 matrices. For repeated row/column permanents, it is dependent upon the number of photons. For 20 photons, the speed up is 5.9x and cross-over 25x25, while with batching the speed-up is 15x and the cross-over is 15x15. Our effective equivalent if long double floating point operations were used is $\frac{(C_A+C_M)*280*10^6}{10^9}$ where $C_A=2A=2*(40+4)$ and $C_M=4M+2A=6*4*39$ represent complex addition and multiplication respectively, yielding $285.5$ GFLOPS.
    
    Speaker: Gregory Morse (Department of Programming Languages and Compilers, Eötvös Loránd University)
    
    CustomTailoredFPGABosonSamplingPresentation.pdf
  - 10:10
    
    Efficient quantum gate decomposition via adaptive circuit compression 30m
    
    In this work, we report on a novel quantum gate approximation algorithm based on the application of parametric two-qubit gates in the synthesis process. The utilization of these parametric two-qubit gates in the circuit design allows us to transform the discrete combinatorial problem of circuit synthesis into an optimization problem over continuous variables. The circuit is then compressed by a sequential removal of two-qubit gates from the design, while the remaining building blocks are continuously adapted to the reduced gate structure by iterated learning cycles. We implemented the developed algorithm in the SQUANDER software package and benchmarked it against several state-of-the-art quantum gate synthesis tools. Our numerical experiments revealed outstanding circuit compression capabilities of our compilation algorithm providing the most optimal gate count in the majority of the addressed quantum circuits.
    
    Speaker: Peter Rakyta (Department of Physics of Complex Systems, Eötvös Loránd University)
    
    Rakyta_GPU_Days.pdf
- 10:40 → 11:00
  
  Coffee Break 20m
- 11:00 → 12:30
  Session VI
  - 11:00
    
    Studying hadronization with Machine Learning techniques 20m
    
    Hadronization is a non-perturbative process, which theoretical description can not be deduced from first principles. Modeling hadron formation, requires several assumptions and various phenomenological approaches. Utilizing state-of-the-art Computer Vision and Deep Learning algorithms, it is eventually possible to train neural networks to learn non-linear and non-perturbative features of the physical processes.
    
    Here, I would like to present the results of two ResNet networks, by investigating global and kinematical quantities, indeed jet- and event-shape variables. The widely used Lund string fragmentation model is applied as a baseline in √s=7 TeV proton-proton collisions to predict the most relevant observables at further LHC energies. Non-liear QCD scaling properties were also identified and validated by experimental data.
    
    [1] G. Bíró, B. Tankó-Bartalis, G.G. Barnaföldi; arXiv:2111.15655
    
    Speaker: Gabor Biro (MTA Wigner FK)
    
    BiroGabor_GPU2022.pdf
  - 11:20
    
    Polynomial speedup in exact Torontonian calculation by a scalable recursive algorithm 20m
    
    Evaluating the Torontonian function is a central computational challenge in the simulation of Gaussian Boson Sampling (GBS) with threshold detection.
    During the calculation of this matrix function exponentially large number of determinants have to be computed.
    We proposed a recursive algorithm providing a polynomial speedup in the exact calculation of the Torontonian compared to state-of-the-art algorithms.
    Our algorithm recursively reuses the data used before to reach the computational advantage.
    According to numerical analysis the complexity of the algorithm is ordo(n^1.06912 * 2^n).
    With our algorithm, one can simulate threshold GBS up to 35-40 photon clicks without the needs of large-scale computational capacities.
    
    Speaker: Ágoston Kaposi
    
    Torontonian_talk.pdf
  - 11:40
    
    Machine learning Hadron Spectral Functions in Lattice QCD 20m
    
    We study the inverse problem of reconstructing spectral functions from Euclidean correlation functions via machine learning. We propose a novel neural network, SVAE, which is based on the variational autoencoder (VAE) and can be naturally applied to the inverse problem. The prominent feature of the SVAE is that a Shannon-Jaynes entropy term having the ground truth values of spectral functions as prior information is included in the loss function to be minimized. We train the network with general spectral functions produced from a Gaussian mixture model. As a test, we use correlators generated from four different types of physically motivated spectral functions made of one resonance peak, a continuum term and perturbative spectral function obtained using non-relativistic QCD. From the mock data test we find that the SVAE in most cases is comparable to the maximum entropy method (MEM) in the quality of reconstructing spectral functions and even outperforms the MEM in the case where the spectral function has sharp peaks with insufficient number of data points in the correlator. By applying to temporal correlation functions of charmonium in the pseudoscalar channel obtained in the quenched lattice QCD at 0.75 $T_c$ on $128^3\times96$ lattices and $1.5$ $T_c$ on $128^3\times48$ lattices, we find that the resonance peak of $\eta_c$ extracted from both the SVAE and MEM has a substantial dependence on the number of points in the temporal direction ($N_\tau$) adopted in the lattice simulation and $N_\tau$ larger than 48 is needed to resolve the fate of $\eta_c$ at 1.5 $T_c$.
    
    Speaker: Feiyi Liu (Eötvös Loránd University & Central China Normal University)
    
    GPU_day_Feiyi_Liu.pdf
  - 12:00
    
    Exploring SARS-CoV-2 receptor binding domain variants 30m
    
    The COVID-19 epidemic created an extraordinary situation for the whole humanity, claiming millions of lives and causing a significant economic setback. At the same time, the international research community has rapidly generated an order of magnitude larger data set than ever before, which can contribute to understanding the evolution and dynamics of the epidemic, to its containment and to the prevention of similar pandemics. Although genetic sequences are available in a never before seen amount (as of April 2022, more than 10 million complete sequences at GISAID) a key question is what kind of phenotypic changes the mutations cause and if we can estimate the virulence or severity of a certain variant solely from the sequence. Recently a so-called "deep mutational scanning" database became available in which receptor binding affinities are measured for tens of thousands of mutations in several variants. We will calculate the 3D structure to these variants with GPU accelerated AlphaFold software and prepare the resulting database for further machine learning analysis.
    
    Speaker: Ákos Gellért
    
    02_04_GPUDAY_ÁkosGellért.pdf
    
    02_04_GPUDAY_ÁkosGellért.pptx
- 12:30 → 14:00
  
  Lunch 1h 30m
- 14:00 → 15:40
  Session VII
  - 14:00
    
    Simulating gold resonant nano-antennas for nano-fusion 30m
    
    Modern theoretical fusion research is powered by plasma simulations, inertial confinement fusion usually involves particle-in-cell (PIC) methods containing lasers interacting with charged particles. Here we show a simple kinetic model of resonant gold nanoantennas both in vacuum and a monomer used as fillings in dentistry. The model manages to describe well qualitatively and quantitatively the behaviours of nanoantennas while taking into account additional existing effects, such as electron spillout and eventually the destruction of the nanorods. We will study the nanoantenna's lifetime and absorption properties.
    
    Speaker: Istvan Papp (Wigner FK)
    
    $RRKIALN.zip
    
    $RYC3H06.zip
    
    $RZI77JR.zip
    
    PICnanorod.pdf
  - 14:30
    
    Numerical Simulation of Mirages Above Water Bodies 20m
    
    When light travels through a medium with a changing refractive index, it gets bent towards its higher values. This can produce upside down "mirror" images of a scenery in deserts, over heated roads or above bodies of water. We built a model for the temperature profile based on measurement data, from which we calculated the refractive index of air using another model. We implemented a computer program, which simulates mirages above water using the method of ray tracing. Rays of light are traced by solving the eikonal equation using different Runge-Kutta methods. We developed a simple extension of the Runge-Kutta method to efficiently check ray intersection with an arbitrary geometry. Given a picture, a physical setup and a value for both the temperature of the water body and that of the ambient air sufficiently far from the surface (where it can be considered constant), our program can realistically reproduce images of photographed mirages.
    
    Speakers: Balázs Bámer (Wigner Research Center for Physics), Anna Horváth (Wigner Research Center for Physics)
    
    AnnaHorvath_Mirage_GPU_20220615.pdf
    
    bamer-mirage_water_presentation.pdf
  - 14:50
    
    Parallel computing for determining stable parameter domain in mechatronic applications 20m
    
    The field of mechatronics engineering integrates mechanical systems and control; therefore, the main challenges of both subfields appear simultaneously. A typical task in mechatronics is position control, where the main goal is to reach the desired position or track a predefined trajectory. The primary design task is to determine the stability domain of the control parameters where the emerging vibrations converge to a stable equilibrium. A mechatronic system consists of a mechanical structure and digitally implemented control, resulting in combined effects of friction and sampling. Both of them can cause non-smooth dynamics, where the forces have discontinuities. The friction force can have discontinuities at velocity reversals, while the control force changes at every sampling instant. This dynamical behaviour makes it extremely hard to make the analysis analytically. Most of the time, approximating models are used where the friction effects are neglected. This approximation results in an inaccurate stability domain, limiting the design process. The stability analysis can be done with numerical simulation at each parameter combination, taking the friction effects into account. Still, this method is rarely used because of its high computational costs when high-resolution stability charts are needed.
    
    In this study, a parallel computing approach is introduced for exploring the stability domain of mechatronics systems with simulations. Parallelization and the general-purpose application of GPUs can radically accelerate computing tasks where partial results can be computed simultaneously. In this specific case, simulations corresponding to different parameter combinations can be run independently from each other, allowing the possibility of parallelizing. GPU programs can have thousands of threads, making the simulations extremely efficient without the accuracy trade-off of different methods.
    
    The study presents a GPU based method for determining the stability domain of a mechatronic system through an example of position control. The basic model of the system with Coulomb friction and the control law is shown, and a discrete-time model is presented. The simulations based on the mapping are implemented in OpenCL and tested on GPU. Results show that the proposed method efficiently produces high-resolution stability charts.
    
    Speaker: Mr Tamás Haba (Budapest University of Technology and Economics)
    
    GPUday 2022 Haba.pdf
  - 15:10
    
    Full Core Pin-Level VVER-440 Simulation of a Rod Drop Experiment with the GPU-Based Monte Carlo Code~GUARDYAN 30m
    
    Targeting ultimate fidelity reactor physics calculations the Dynamic Monte Carlo (DMC) method simulates reactor transients without resorting to static or quasistatic approximations. Due to the capability to harness the computing power of Graphics Processing Units, the GUARDYAN (GpU Assisted Reactor DYnamic ANalysis) code has been recently upscaled to perform pin-by-pin simulations of power plant scale systems as demonstrated in this contribution. A recent rod drop experiment at a VVER-440/213 (vodo-vodyanoi enyergeticheskiy reaktor) type power plant at Paks NPP, Hungary, was considered and signals of ex-core detectors placed at three different positions were simulated successfully by GUARDYAN taking realistic fuel loading, including burn-up data into account. Results were also compared to the time-dependent Paks NPP in-house nodal diffusion code VERETINA (VERONA: VVER Online Analysis and RETINA: Reactor Thermo-hydraulics Interactive). Analysis is given of the temporal and spatial variance distribution of GUARDYAN fuel pin node-wise power estimates. We can conclude that full core, pin-wise DMC power plant simulations using realistic isotope concentrations are feasible in reasonable computing times down to 1--2\% error of ex-core detector signals using current NVIDIA A100 GPU High Performance Computing architectures, thereby demonstrating a technological breakthrough.
    
    Speaker: David Legrady (Dr.)
    
    03_04_Legrady_Guardyan_VVER440_GPU.pptx
    
    full_VVER440_tranzient_v8.avi
    
    full_VVER440_tranzient_v9.avi
    
    vver440_demo_8.avi
    
    VVER440_excore.avi
    
    VVER440_geometry_tranzient_v1.avi
    
    VVER440_geometry_tranzient_v2.avi
    
    VVER440_geometry_tranzient_v3.avi
    
    VVER440_geometry_tranzient_v4.avi
    
    VVER440_geometry_tranzient_v5.avi
    
    VVER440_geometry_tranzient_v6.avi
- 15:40 → 16:00
  
  Coffee break 20m
- 16:00 → 17:50
  Session VIII
  - 16:00
    
    The resonant structure of the trans-Neptunian space 20m
    
    The outer realm of the Solar System, known either as the trans-Neptunian space or the Kuiper belt, is of great interest among celestial mechanical studies. Its dynamical structure is shaped to a large extent by the mean-motion resonances (MMRs) occurring between the trans-Neptunian objects (TNOs) and (mainly) the Neptune.
    In a recent research, we carried out a large-scale survey of the TNOs, with a sample containing more than 4100 small bodies. By applying the FAIR method (Forgács-Dajka, Sándor, and Érdi, 2018), we identified the most important MMRs, and distinguished between the so-called short- and long-term resonances: TNOs in the former category are only temporarily captured in a given MMR, while those of the latter remain resonant throughout the (sufficiently long) integration time. We explored the dynamical properties of such intriguing MMRs through the quantification of the chaotic diffusion. For this purpose, we adopted both classical methods (as the one e.g. based on the computation of the time evolution of the variance of an action variable) and a more recent one (based on the computation of the time derivative of the Shannon entropy).
    Apart from investigating the individual TNOs of our sample, we constructed dynamical maps of fictitious test particles, too. This approach enabled us to analyze the structure of the phase space in the vicinity of the resonances.
    Our findings indicate that a notable percentage of the examined TNOs are engaged in MMRs with Neptune, and that however chaotic the phase space appears to be, the diffusion timescales are remarkably long.
    As for the technical realization of the research, we adopted a barycentric model of the Solar System - containing the four giant planets and either a massless TNO or a test particle -, and integrated the equations of motion on a timescale of million years. Our codes were optimized for GPU computations in order to deal with the significant computational costs of integrating several hundreds of thousands of initial conditions (i.e. test particles).
    
    Speaker: Emese Kővári (Department of Astronomy, Institute of Geography and Earth Sciences, Eötvös Loránd University, H-1117 Budapest, Pázmány Péter sétány 1/A, Hungary; Centre for Astrophysics and Space Science, Eötvös Loránd University, H-1117 Budapest, Pázmány Péter sétány 1/A, Hungary; Wigner Research Centre for Physics, P.O. Box 49, Budapest H-1525, Hungary)
    
    gpuday_ekovari.pdf
  - 16:20
    
    Massively Parallel Tensor Network Algorithms 20m
    
    As the design and mass manufacturing of efficient quantum computers are still subject of intense research, the numerical simulations of quantum systems still rely on classical computation. In this case however the complexity and resource requirements of such algorithms scale exponentially relative to the system size, thus making bigger simulations problematic or even impossible to run.
    
    Our approach focuses on the development of massively parallel algorithms that are not only highly scalable and ideal to use in an HPC environment, but by building on the foundation of theoretical physics and applied mathematics the number of required arithmetic calculations could be reduced by multiple magnitudes. As a result the exponential time cost of the simulations has collapsed into polynomial complexity.
    
    The research program puts an emphasis on one of the subclasses of tensor network state algorithms called density matrix renormalization group, or DMRG for short. In such cases large-scale tensor operations can be substituted with multi-million vector and matrix operations, of which many can be executed independently of one another. Through the exploitation of these (in)dependencies arithmetic operations can be reordered and put into multiple tiers of groups corresponding to specific software and hardware layers ranging from low level CPU and GPU based SIMD execution to high level HPC scheduling. Thanks to the fact that for every tier we can execute all operations contained within the same group independently of all other arithmetics residing outside the group, mass scale parallelism can be achieved at every tier of our multi-tiered grouping. The resulting parallelization is the product of each tier's own massive parallelization, thus with suitable hardware infrastructure exascale computing in the near future might become a reality for DMRG based quantum simulations.
    
    Speaker: Andor Menczer (Eötvös Loránd University (ELTE-IK))
    
    Menczer_gpu_day.pdf
  - 16:40
    
    Critical synchronization dynamics on power grids 20m
    
    Dynamical simulation of the cascade failures on the EU and USA high-voltage power grids has been done via solving the second-order Kuramoto equation. We show that synchronization transition happens by increasing the global coupling parameter $K$ with metastable states depending on the initial conditions so that hysteresis loops occur. We provide analytic results for the time dependence of frequency spread in the large $K$ approximation and by comparing it with numerics of $d=2,3$ lattices, we find agreement in the case of ordered initial conditions. However, different power-law (PL) tails occur, when the fluctuations are strong. After thermalizing the systems we allow a single line cut failure and follow the subsequent overloads with respect to threshold values $T$. The PDFs $p(N_f)$ of the cascade failures exhibit PL tails near the synchronization transition point $K_c$. Near $K_c$ the exponents of the PL-s for the US power grid vary with $T$ as $1.4 \le \tau \le 2.1$, in agreement with the empirical blackout statistics, while on the EU power grid we find somewhat steeper PL-s characterized by $1.4 \le \tau \le 2.4$. Below $K_c$ we find signatures of $T$-dependent PL-s, caused by frustrated synchronization, reminiscent of Griffiths effects. Here we also observe stability growth following the blackout cascades, similar to intentional islanding, but for $K > K_c$ this does not happen. For $T < T_c$, bumps appear in the PDFs with large mean values, known as ``dragon king'' blackout events. We also analyze the delaying/stabilizing effects of instantaneous feedback or increased dissipation and show how local synchronization behaves on geographic maps.
    
    Speaker: Mr Shengfeng Deng (Institute of Technical Physics and Materials Science, Centre for Energy Research)
    
    EU_Chimera_freq_dark_K80_a0.4_T0.1.gif
    
    power_GPUday.pdf
  - 17:00
    
    Application of high-performance computing for bubble simulations in sonochemistry 20m
    
    The objective of sonochemistry is to increase the yield of chemical processes in a fluid with ultrasound excitation. It is based on a special case of cavitation called acoustic cavitation. Because of the ultrasound excitation, several bubbles and bubble-clouds can be formed in a liquid. During the radial oscillation of the bubbles, their compression can be so large that the internal temperature can reach several thousands of Kelvins inducing chemical reactions. The importance of sonochemistry is in its potential applications, e.g., nano-metal particle production, organic synthesis, or water purification.
    Understanding the behaviour of a single bubble in an acoustic field is an important topic of sonochemistry, with many open questions. The presentation focuses on the break-up mechanism of a bubble into several smaller bubbles. To directly observe the behaviour of a single bubble, a computational approach is used (CFD). This requires a multiphase model (gas and liquid). Another difficulty is the differences in spatial scales: the size of a bubble in sonochemistry is usually a few micrometres, while the wavelength of the used ultrasound is a few millimetres. An appropriate spatial resolution of the problem requires a highly resolved adaptive mesh with millions of cells. Furthermore, the problem must be solved in time with an appropriately small step size to correctly simulate the propagation of acoustic waves around the bubble. Due to the high spatial and temporal resolution, the solution must be parallelized, and supercomputers must be used to reduce the runtime of the simulations.
    To simulate a single bubble, the open-source program package called ALPACA is used, which is capable of simulating compressible multiphase flows in 2D or 3D. A multiresolution algorithm is employed to automatically adapt the numerical mesh during the solution process; thus, it is suitable for bubble simulations. Moreover, ALPACA is designed to be run on supercomputers. During the research, the SUPERMUC-NG supercomputer in Germany is used in collaboration with the Nanoshock research group from the Technical University of Munich.
    In this presentation, we demonstrate the basics of a bubble simulation and analyse the strong scaling of such simulations for up to 300 CPU cores. Based on the scaling analysis, an appropriate configuration can be found for efficient simulations with which several different parameters (e.g., bubble radius, acoustic excitation frequency) can be tested. Finally, the results of the bubble simulations are discussed.
    
    Speaker: Dániel Nagy (Budapest University of Technology and Economics, Department of Hydrodynamic systems)
    
    Daniel_Nagy_GPU_day_2022.pdf
  - 17:20
    
    Closing 20m
    
    Speaker: Gergely Barnafoldi (Wigner RCP RMI of the Hungarian Academy of Sciences)