De-STT: De-entaglement of unwanted Nuisances and Biases in Speech to Text System using Adversarial Forgetting


Abstract in English

Training a robust Speech to Text (STT) system requires tens of thousands of hours of data. Variabilities present in the dataset such as unwanted nuisances (environmental noise, etc) and biases (accent, gender, age, etc) are reasons for the need of large datasets to learn general representations, which is often not feasible for low resource languages. In many computer vision tasks, a recently proposed adversarial forgetting approach to remove unwanted features has produced good results. This motivates us to study the effect of de-entangling the accent information from the input speech signal while training STT systems. To this end, we use an information bottleneck architecture based on adversarial forgetting. This training scheme aims to enforce the model to learn general accent invariant speech representations. Two STT models trained on just 20 hrs of audio, with and without adversarial forgetting, are tested on two unseen accents not present in the training set. The results favour the adversarial forgetting scheme with an absolute average improvement of 6% over the standard training scheme. Furthermore, we also observe an absolute improvement of 5.5% when tested on the seen accents present in the training set.

Download