Generating visible-like face images from thermal images is essential to perform manual and automatic cross-spectrum face recognition. We successfully propose a solution based on cascaded refinement network that, unlike previous works, produces high quality generated color images without the need for face alignment, large databases, data augmentation, polarimetric sensors, computationally-intense training, or unrealistic restriction on the generated resolution. The training of our solution is based on the contextual loss, making it inherently scale (face area) and rotation invariant. We present generated image samples of unknown individuals under different poses and occlusion conditions.We also prove the high similarity in image quality between ground-truth images and generated ones by comparing seven quality metrics. We compare our results with two state-of-the-art approaches proving the superiority of our proposed approach.