An adversarial example is a slightly modified input that is designed to mislead a machine learning model. In the context of this project we focus on images as inputs and deep neural nets as models. However, adversarial examples exists for other domains as well (e.g. audio) [1]. To craft such deceptive inputs, a clean input is often used and modified in a way that reduces the target neural network’s confidence on the correct label. The induced perturbations are designed to be so subtle that they are hardly perceptible to a human.

These manipulations can even occur in the physical world by modifying the appearance of an object [2]. With the adoption of neural networks in autonomous vehicles for example the existence of adversarial examples can cause serious safety concerns such as misreading road markings or stop signs [3].

Deep neural networks can be attacked in both the training or inference phase [4]. We focus on the latter.

Why adversarial examples exist

Since the discovery of adversarial examples targeting neural network classifiers [5], varying explanations for this phenomenon have been given. The first hypothesis was that they were caused by the highly non-linear nature of neural networks, creating predictions for inputs wherefor there are no nearby training examples [5]. The authors also found that adversarial examples transfer between different models which is often referred to as transferability.

In 2015, this hypothesis was overturned in favour of another explanation suggesting that the linear behaviour of networks (use of Rectified Linear Units, linear behaviour of sigmoids around 0, etc.) allows for the existence of adversarial examples and introduced the Fast Gradient Sign Method (FGSM) to generate them [6].

Since then multiple attack methods based on this assumption of the linear behaviour have been developed. We present some of them below.

Recently the focus in the search of the origin of adversarial examples has shifted away from the networks to the data. [7, 8] claim that the existence of non-robust features in the data enables adversarial examples. These non-robust features are input features that help correctly classify an image in a standard setting, but will hinder accuracy under adversarial attacks.

It is possible to train a robust classifier by using robust features only which are features that help correctly classify an image in both standard and adversarial settings. The price for only using robust features to resist adversarial attacks is lower accuracy. Intuitively, this is because the classifier would rely on fewer features that could help correctly classify an image. Interestingly, Tsipras et al. [8] show that these robust features tend to be features that humans can use to classify the image, making them more interpretable to humans and useful to GANs, which we cover in the next project.

Tsipras et al. [8] also show that a model trained only on adversarially-perturbed data with their respective incorrect labels can still yield good accuracy in the standard setting, supporting the claim that neural network models rely on imperceptible features. Additionally, in a second experiment they showed that using a robust dataset can transfer robustness to different networks. Their findings are discussed in this article.

Existing attack methods and how to generate them

Existing attack methods can be grouped into the following categories.

White box: The attacker has full access to the model with all its parameters.

Black box with probing: An attacker has no access to the model’s parameters. However, the model can be be queried to approximate the gradients.

Black box without probing: Here, the attacker has neither access nor can he query the model.

Digital attack: An attacker has direct access to digital data fed into the model.

Moreover, attacks can be targeted or untargeted. In the latter scenario the attack is successful if any wrong class is predicted. For the former attack type, a specific class is targeted. Most existing attack method require a gradient to work with. Consequently, most black box attacks take advantage of transferability. Papernot et al. have shown that adversarial examples also transfer between different machine learning models and training datasets [9,10]. Recently, [11] have shown that network parameter can also be extracted by analyzing the power consumption of the model during inference.

The following are the methods that we explore in this project.

  • Fast Gradient Sign Method (FGSM) by [6]
  • Basic Iterative Method (BIM) by [3]
  • Iterative Least Likely Class Method (ILLM) by [3]
  • DeepFool [12]

Other methods

The methods above modify all pixels to increase the loss. [13] propose that rather than modifying all pixels slightly, only one (or few) pixels can be modified with greater magnitude. The result is an overall less perturbed image. The modification is generated by differential evolution. [14] introduce a method which does not rely on backpropagation to craft adversaries.

The strongest attacks to date have been proposed by Carlini and Wagner [15]. Their attacks are based on three distance metrics which assure that the adversaries are strong, imperceptible and achieve a high confidence. In contrast to BIM or DeepFool, these attacks can target any desired class.

How to defend against them

As of this writing the best defense against adversarial examples is to include them in the training data of the model (adversarial training [6]). This allows learning robust decision boundaries, which works better for models with large capacity [16]. Papernot et al. [17] created the Cleverhans library to support adversarial training by providing the common attack methods.

A second approach is using another model which is specialized in detecting adversaries [18]. A third approach is defensive distillation [19]. However, Carlini et al. have shown that this cannot defend against their strong attacks [16].

Beyond classification

Adversarial examples also exist for semantic segmentation, object detection or pose estimation tasks. Two common algorithms to generate them are Dense Adversary Generation [20] and Houdini [21]. Xiao et al. [22] analyze these and find that for semantic segmentation adversaries do not transfer between models. Moreover, with spatial consistency check they introduce a promising detection mechanism for the segmentation task.


[1]   Qin, Y., Carlini, N., Goodfellow, I., Cottrell, G., & Raffel, C. (n.d.). Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition. 12.

[2]   Kurakin, A., Goodfellow, I., & Bengio, S. (2017). Adversarial examples in the physical world. ArXiv:1607.02533 [Cs, Stat].

[3]   Eykholt, K., Evtimov, I., Fernandes, E., Li, B., Rahmati, A., Xiao, C., Prakash, A., Kohno, T., & Song, D. (2018). Robust Physical-World Attacks on Deep Learning Models. ArXiv:1707.08945 [Cs].

[4]   Huang, L., Joseph, A. D., Nelson, B., Rubinstein, B. I. P., & Tygar, J. D. (n.d.). Adversarial Machine Learning. 15.

[5]   Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2014). Intriguing properties of neural networks. ArXiv:1312.6199 [Cs].

[6]   Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and Harnessing Adversarial Examples. ArXiv:1412.6572 [Cs, Stat].

[7]   Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., & Ma, A. (2019). Robustness May Be at Odds with Accuracy. 23.

[8]   Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B., & Madry, A. (2019). Adversarial Examples Are Not Bugs, They Are Features. ArXiv:1905.02175 [Cs, Stat].

[9]   Papernot, N., McDaniel, P., & Goodfellow, I. (2016). Transferability in Machine Learning: From Phenomena to Black-Box Attacks using Adversarial Samples. ArXiv:1605.07277 [Cs].

[10]   Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., & Swami, A. (2017). Practical Black-Box Attacks against Machine Learning. ArXiv:1602.02697 [Cs].

[11]  Dubey, A., Cammarota, R., & Aysu, A. (2019). MaskedNet: The First Hardware Inference Engine Aiming Power Side-Channel Protection. ArXiv:1910.13063 [Cs].

[12]   Moosavi-Dezfooli, S.-M., Fawzi, A., & Frossard, P. (2016). DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]   Su, J., Vargas, D. V., & Kouichi, S. (2019). One pixel attack for fooling deep neural networks. IEEE Transactions on Evolutionary Computation, 23(5), 828–841.

[14]   Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z. B., & Swami, A. (2015). The Limitations of Deep Learning in Adversarial Settings. ArXiv:1511.07528 [Cs, Stat].

[15]   Carlini, N., & Wagner, D. (2017). Towards Evaluating the Robustness of Neural Networks. ArXiv:1608.04644 [Cs].

[16]   Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu, A. (2017). Towards Deep Learning Models Resistant to Adversarial Attacks. ArXiv:1706.06083 [Cs, Stat].

[17]   Papernot, N., Faghri, F., Carlini, N., Goodfellow, I., Feinman, R., Kurakin, A., Xie, C., Sharma, Y., Brown, T., Roy, A., Matyasko, A., Behzadan, V., Hambardzumyan, K., Zhang, Z., Juang, Y.-L., Li, Z., Sheatsley, R., Garg, A., Uesato, J., … McDaniel, P. (2018). Technical Report on the CleverHans v2.1.0 Adversarial Examples Library. ArXiv:1610.00768 [Cs, Stat].

[18]   Lu, J., Issaranon, T., & Forsyth, D. (2017). SafetyNet: Detecting and Rejecting Adversarial Examples Robustly. ArXiv:1704.00103 [Cs].

[19]   Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. ArXiv:1503.02531 [Cs, Stat].

[20]   Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., & Yuille, A. (2017). Adversarial Examples for Semantic Segmentation and Object Detection. ArXiv:1703.08603 [Cs].

[21]   Cisse, M., Adi, Y., Neverova, N., & Keshet, J. (2017). Houdini: Fooling Deep Structured Prediction Models. ArXiv:1707.05373 [Cs, Stat].

[22]   Xiao, C., Deng, R., Li, B., Yu, F., Liu, M., & Song, D. (2018). Characterizing Adversarial Examples Based on Spatial Consistency Information for Semantic Segmentation. ArXiv:1810.05162 [Cs].