Using null samples to shape decision spaces and defend against adversarial attacks

This post is a summary of a technical report we recently posted on arXiv.org. Read the full report here: https://arxiv.org/abs/2002.10084. A copy of this post is also available on Medium.
Introduction
Shortly after the arrival of the new wave of neural network models for computer vision in roughly 2012 [1, 2], it was discovered that such networks could be fooled by cleverly designed images that, to humans, look minimally altered from an original image, if at all [3]. Such so-called adversarial examples pose a problem for many computer vision applications that need to defend against hostile actors.

Clearly the convolutional neural network (CNN) model makes decisions in a manner very different from that of humans, or any other mammal, for that matter.
One likely source of the problem is that the model is architected and trained to classify an image as belonging to one of N classes (objects), regardless of the image. For example, if a model is trained to classify images of isolated digits, from 0 to 9, and then shown a “T” or a “$” or a donkey, it will classify that image as one of the digits — that’s all it was trained to do.
What that means is that within the high-dimensional input space of possible images, every point is assigned to an object class. Decision spaces for individual classes jointly fill the entire space, despite the fact that for a given object class, the true samples lie within only a small sub-region of the learned decision space. Decision spaces abut one another, even when true samples within the spaces are well separated.
Presented below is a toy model and task that allows for visualization of this condition, the manner in which it makes conventional models vulnerable to adversarial attacks, and a strategy for mitigating the problem.
A demonstrative example

In our toy environment, an input “image” is a linear array of three pixels in which one of two objects may reside, each a pair of adjacent pixels with particular values. The figure above shows the possible points at which objects in the two classes (colored magenta and cyan) may exist. Spatial location is irrelevant to class identity and thus object samples of a single class exist in two isolated regions of the 3D input space. The colored regions are represent the “optimal” decision spaces for the two classes — i.e., no training samples exist outside of the magenta and cyan areas.
We trained a simple, conventional CNN on this classification task and then probed the model with samples that spanned the entire input space. Resulting decision spaces are shown below.

The decision spaces of the conventional CNN completely fill the input space — as they must for the given architecture and training — and spaces for individual classes (middle and right panel of Figure 2) span much larger spaces than those of the true object samples (Figure 1). It is not difficult to imagine that a small bit of noise, when added to a true object sample, could push the image across a decision boundary.
One approach to mitigating this problem is to (1) add an (N+1)ᵗʰ output to the model — an output that indicates a null class, rather than one of the N object classes, and (2) use null samples as well as object samples during training. For the toy example, we use three-pixel images of uniform noise as the null samples.

After training this null model we visualize the decision spaces as we did for the conventional model. As shown above, the decision spaces have tighten considerably. The figure displays only the decision spaces for the two object classes. The visually empty space is that of decision space for the null class. In other words, the model was allowed to leave an input image “unclassified” if it did not fall into the decision space of one of the object classes.
We speculated that as the dimensionality of an input space increases, the situation demonstrated by the conventional model in the toy example is only made worse. Learned decision spaces are convoluted, and true object samples may lie near decision boundaries that have no relevance to the causal features of the object — that is, the features which make it a member of its class.
A null-trained model for MNIST digit recognition
Although MNIST digit classification is generally considered an “easy” classification benchmark task, CNN models remain susceptible to adversarial attacks. There has been progress [5], but even the best conventional CNN models are still vulnerable [6]. One might argue that the problem has be “solved” for other model types, such as the analysis-by-synthesis model of Schott et al. [6]. We gain helpful insights from such models, but they are computationally expensive and are unlikely to be viable solutions for applications that work on photo-realistic imagery, with 2020 hardware.
Thus, we were motivated to test our approach on the MNIST benchmark task. We first trained a convention (or “baseline”) CNN, achieving an accuracy above 99% on the unperturbed MNIST test set. We then used the fast gradient sign method (FGSM) [4] to create adversarial images with just enough noise (smallest possible value for the noise multiplier, epsilon) such that the baseline model classified the digits as ones other than those of the unperturbed source images. Examples are shown below.

We then trained a null model, using three types of null samples — uniform noise, mixed-digit images, and tiled-and-shuffled images. During training, all null images were given the null target label.

After training the null model, we tested it on the adversarial images created from the baseline model. As seen in the figure below, most of the adversarial images are classified as null images (i.e., “unclassified”) rather than mistakenly classified as a different digit.

Finally, we assessed performances of a set of baseline models and several sets of null models (each trained with different types of null samples). Rather than simply test on a fixed set of adversarial examples, we varied the degree of perturbation by varying the range of the noise multiplier, epsilon, from 0 to 1.

As seen above, null models trained with the mixed-digit null samples (left panel) are effective at preventing misclassifications on images with low perturbation — preferentially classifying such images as nulls. Models trained with shuffled-digit null samples (middle panel), in contrast, are effective at preventing misclassifications on images with high perturbation. When trained on both types of null samples (right panel) null models rarely make misclassifications, regardless of the magnitude of the perturbation.
Conclusion
MNIST null models trained on both mixed-digit and shuffled-digit null samples rarely make digit misclassification mistakes on perturbed images, and they accurately classify unperturbed digit images. Furthermore, our null-training approach is computationally efficient both during training and inference.
It should be emphasized, however, that the null models do not correctly classify images with modest perturbation, despite digits in those images being easily identified by humans. Instead, the models trade misclassifications for null classifications. This trade-off may be perfectly acceptable for many applications, as misclassifications might be disastrous wereas null classifications could be ignored or dealt with in some other manner.
Additional research is needed in order to develop models that both generalize well and fend off adversarial attacks. One strategy may be to combine our null-training approach with corrupted image training, which was recently shown to improve generalization if meta-parameter values are properly tuned [7].
References
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
- Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 5 2015.
- Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv:1312.6199, 2013.
- Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv:1412.6572, 2014.
- Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv:1706.06083, 2017.
- Lukas Schott, Jonas Rauber, Matthias Bethge, and Wieland Brendel. Towards the first adversarially robust neural network model on mnist. arXiv:1805.09190, 2018.
- Evgenia Rusak, Lukas Schott, Roland S. Zimmermann, Julian Bitterwolf, Oliver Bringmann, Matthias Bethge, and Wieland Brendel. Increasing the robustness of dnns against image corruptions by playing the game of noise. arXiv:2001.06057, 2020.