Speaker
Description
Some applications of Monte Carlo simulation require the generation of a
large number of small grids of bits (e.g. tiles or boards of 32 by 32 bits) with a
given number of bits set to one, the rest is set to zero. Conventional sequential
algorithms generate them by randomly selecting empty sites until the given
number of set sites is reached, which is not an efficient solution. In this talk
we present a solution that uses the special capabilities of either CPU or GPU
to generate them at a high speed. In the CUDA implementation the tiles are
generated warpwise (with the use of warp shuffles) thereby eliminating the loop
divergence of threads. This realization allows us also to reduce the use of shared
memory, the computation is performed in registers, only the PRNG states are
stored in local memory.