In this talk we describe, compare and evaluate various implementation strategies that can be used to implement massively parallel Particle-in-Cell / Monte Carlo collision low-pressure plasma simulations. Building on our earlier single-GPU 1D and 2D plasma implementations that demonstrated two orders of magnitude speedup, our goal is now to utilise the thousands of GPUs found in pre-exascale supercomputers such as MARCONI 100, SUMMIT or LEONARDO. A key performance bottleneck in these distributed memory GPU systems is communication cost. Traditionally, these systems are programmed in a hybrid parallel fashion using a combination of CUDA, OpenMP and MPI for controlling and coordinating different levels of parallelism that results in a complex and architecture dependent simulation code achieving -- at best -- weak scaling only. We demonstrate these traditional multi-GPU programming strategies in the context of our 1D plasma simulation program. We will illustrate the limitation of these approaches and their inability to hide inter-GPU communication efficiently. Then, we overview two alternative approaches, NCCL and NVSHMEM, that provide device-side, kernel-initiated communication operations that provide a more GPU-friendly programming model and improved communication hiding capabilities. Our work is still in progress but we will show preliminary results that can demonstrate the design and implementation challenges of large multi-GPU programs.