Speaker
Description
In this talk, we present work recently done by our group on the parallel solution of multiple tridiagonal linear systems that typically arise during the solution of discretised partial differential equations. We briefly introduce the established serial (Thomas) and parallel (Parallel Cyclic Reduction) algorithms for individual systems, then discuss how multiple systems are formed and solved in a high-dimensional system - including shared memory, distributed memory, and pipeline parallelism, targeting recent many-core CPUs, GPUs and FPGAs. We demonstrate scalability up to 16k CPU cores or 32 GPUs for large systems representative of CFD applications. We also study computational and energy efficiency on GPUs and FPGAs on smaller problems representative of applications in computational finance, demonstrating that a Xilinx Altevo U280 can closely match an NVIDIA V100 GPU in terms of throughput, and significantly outperform it in terms of energy efficiency.