Speaker
Description
GPUs are increasingly common in scientific high-performance computing; however, their benefits are not uniform across all areas of scientific computing. In certain fields, such as in sonochemistry where delay differential equations can arise, large amounts of data must be accessed based on the current state of the simulation in an unaligned and uncoalesced manner. This usually hinders the applicability of GPUs. The goal is to leverage the advantages of both GPU and CPU architectures by overlapping parallel and serial computations, as well as memory copy and write instructions, in the solution of an ensemble of differential equations (ODEs). This approach could make certain computing tasks faster and more efficient.
The general idea of a heterogeneous CPU-GPU differential equation solver involves deploying four asynchronous CUDA streams and partitioning the workload into four equal parts, with each stream managing 1/4 of the workload. Each stream consists of four stages. Initially, data is transferred from the CPU to the GPU in the first stage, then a kernel is executed on the GPU in the second stage to compute a single Runge-Kutta step. Following this, data is transferred from the GPU to the CPU in the third stage, and finally, in the fourth stage serial calculations on the CPU are carried out. These serial calculations could involve predicting, accepting, or rejecting the time-step, or computing the delayed terms in a delay differential equation. Ideally, these four stages can be overlapped to optimize performance.
Parameter sensitivity studies were conducted on a first-order Bernoulli-type ODE and the Duffing equation (second order ODE) using both a homogeneous GPU solver and the heterogeneous CPU-GPU approach described earlier. GPU profiling was employed to show that an overlap of the 4 stages is possible. It demonstrates that, while overlapping CPU, GPU, and memory copy operations using CUDA streams is feasible, achieving an ideal overlap is only possible under specific conditions. For the simple problems investigated, the pure GPU approach proves to be the most effective solution. However, in the future, adaptive delay differential equation solvers may benefit from employing such a heterogeneous approach.