Simulations of systems with quenched disorder are extremely demanding, suffering from the combined effect of slow relaxation and the need of performing the disorder average. As a consequence, new algorithms, improved implementations, and alternative and even purpose-built hardware are often instrumental for conducting meaningful studies of such systems. The ensuing demands regarding hardware availability and code complexity are substantial and sometimes prohibitive. We demonstrate how with a moderate coding effort leaving the overall structure of the simulation code unaltered as compared to a CPU implementation, very significant speed-ups can be achieved from a parallel code on GPU by mainly exploiting the trivial parallelism of the disorder samples and the near-trivial parallelism of the parallel tempering replicas. A combination of this massively parallel implementation with a careful choice of the temperature protocol for parallel tempering as well as efficient cluster updates allows us to equilibrate comparatively large systems with moderate computational resources.