Generating Natural Language Adversarial Examples
In the image domain, adversarial perturbations can be crafted to be virtually indistinguishable to human perception, causing humans and state-of-the-art models to disagree. However, in the natural language domain, small perturbations are clearly perceptible, and the replacement of a single word can drastically alter the semantics of the document. Given these challenges, we use a black-box population-based optimization algorithm to generate semantically and syntactically similar adversarial examples that fool well-trained sentiment analysis and textual entailment models. We additionally show that the successful adversarial examples are classified to the true label by 20 human annotators, and are perceptibly quite similar to the original. Finally, we attempt to use adversarial training as a defense, but fail to yield improvement, demonstrating the strength and diversity of the generated examples. We hope our findings encourage researchers to pursue improving robustness in the natural language domain.
Yash Sharma*, Moustafa Alzantot*, Ahmed Elgohary, Bo-Jhang Ho, Mani Srivastava, Kai-Wei Chang