The compute efficiency of Monte-Carlo event generators for the Large Hadron Collider is expected to become a major bottleneck for simulations in the high-luminosity phase. Aiming at the development of a full-fledged generator for modern GPUs, we study the performance of various recursive strategies to compute multi-gluon tree-level amplitudes. We investigate the scaling of the algorithms on both CPU and GPU hardware. Finally, we provide practical recommendations as well as baseline implementations for the development of future simulation programs. The GPU implementations can be found at: https://www.gitlab.com/ebothmann/blockgen-archive.