Self-Supervised Contrastive Learning with Adversarial Perturbations for Robust Pretrained Language Models


Abstract in English

This paper improves the robustness of the pretrained language model, BERT, against word substitution-based adversarial attacks by leveraging self-supervised contrastive learning with adversarial perturbations. One advantage of our method compared to previous works is that it is capable of improving model robustness without using any labels. Additionally, we also create an adversarial attack for word-level adversarial training on BERT. The attack is efficient, allowing adversarial training for BERT on adversarial examples generated textit{on the fly} during training. Experimental results show that our method improves the robustness of BERT against four different word substitution-based adversarial attacks. Additionally, combining our method with adversarial training gives higher robustness than adversarial training alone. Furthermore, to understand why our method can improve the model robustness against adversarial attacks, we study vector representations of clean examples and their corresponding adversarial examples before and after applying our method. As our method improves model robustness with unlabeled raw data, it opens up the possibility of using large text datasets to train robust language models.

Download