Speaker
Description
Mixing different precision of floating point arithmetics and number representations may be a highly effective tool to tackle some main challenges of exascale computing. By lowering precision, we can reduce memory and network traffic, decrease memory footprint, we can achieve more floating point operations per second by using less time to compute the same operations and we can also reduce energy consumption. Using recently introduced hardware features, the benefit can become even larger. NVIDIA Tensor Cores provide 2.5X speed-up in HPC by enabling mixed-precision computing, but they can also provide 10X speed-up in AI training with their 32-bit and 16-bit Tensor Float support. Using FPGAs with half precision the advantage is further increased, since the operating area may decrease as well, and the frequency of the device may increased. On the flip side, changing the representation also degrades accuracy, so mixed representation can only be used with careful consideration, making it even more difficult to apply automatically. In 2017, a group of NVIDIA researchers published a study detailing how to reduce the memory requirements for neural network training with a technique called Mixed Precision Training. Weights, activations, and gradients are stored in IEEE FP16 format, but in order to match the accuracy of the FP32 networks, FP32 master copies of the weights are maintained. During one training step, the forward and backward passes are calculated using FP16 arithmetics, while the optimizer step and weight update are calculated using FP32 arithmetics. To avoid the underflow of the gradient, they also introduce a loss scaling scheme, whereby the loss and therefore the gradient is scaled up by a constant factor. The GPUMixer (best paper at ISC 2019) is a performance-driven automatic tuner for GPU kernels. It uses static analysis for finding a set of operations (FISet) to execute in lower precision, while data entering and leaving those sets are in high precision. They try to maximize the ratio of low precision arithmetics and type casting operations to achieve better performance. Also they apply "shadow" execution to determine the error and maintain a prescribed error bound. In our work, we want to achieve a similar automatic mixed-precision execution on unstructured mesh computations, using the OP2 domain specific language. The advantage of this system is that we can exploit further domain knowledge instead of focusing on an individual kernel. If we find a variable which acts like an accumulator, then we should keep it in higher precision. If we find one that stores only differences, then we can lower its precision. As an example, we measured mixed-precision execution on the Airfoil application (an industrially representative CFD code which is a finite volume simulation that solves the 2D Euler equations): using two NVIDIA V100 GPUs the speed-up is 1.11X (using all FP32 it would be 1.44X), and using 64 INTEL Xeon processors the speedup is 1.13X (using all FP32 it would be 1.76X).