View PDF
Abstract:
Adversarial attacks can mislead a Deep Learning (DL) algorithm into generating erroneous predictions via feeding maliciously-disturbed inputs called adversarial examples. DL-based Natural Language Processing (NLP) algorithms are severely threatened by adversarial attacks. In real-world, black-box adversarial attacks, the adversary needs to submit many highly-similar queries before drafting an adversarial example. Due to this long process, in-progress attack detection can play a significant role in adversarial defense in DL-based NLP algorithms. Although there are several approaches for detecting adversarial attacks in NLP, these approaches are reactive in the sense that they can detect adversarial examples only when they are fabricated and fed into the algorithm. In this study, we take one step towards proactive detection of adversarial attacks in NLP systems by proposing a robust, history-based model named Stateful Query Analysis (SQA) to identify suspiciously-similar sequences of queries capable of generating textual adversarial examples to which we refer by adversarial scenarios. The model exhibits a detection rate of over 99.9% in our extensive experimental tests against several state-of-the-art black-box adversarial attack methods.