Neural Cleanse

This is the documentation of the Tensorflow/Keras implementation of Neural Cleanse, the defense against backdoor attacks on deep neural networks.

Please see the paper for details Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks, IEEE S&P'19. The presentation slides is available.

What to know before you start

At the minimum, you need to have a DNN model that you would like to test and some computational power (CPU or GPU). Note that you don't need to have a clean dataset (see FAQ). If you don't have a model to test and want to train an infected model by yourself instead, see here.
The most important component of Neural Cleanse is trigger reverse-enginnering (see Section IV in paper). It has some configurations. We've already set their default values to be the ones that show successful and stable performance in our experiments. But like any machine learning problem, the complexity in optimization and diversity in different application domains are inevitable. Therefore it's better to change them and search for the best ones on your application if your result doesn't look good. We list some of the important configurations in FAQ.
Because of the randomness from optimization process in trigger reverse-engineering, the final result from outlier detection might vary in different runs. However the difference shouldn't be significant enough to flip the decision of outlier detection. We would like to recommend you to report average results over multiple runs. If you find behaviors of some runs radically different from others, it might be worthy of checking their trigger reverse-engineering steps.
The devil is in the details. If you want to adapt Neural Cleanse to your data and model, please make sure you're aware of your scenario, and adjust accordingly in the configurations. One example would be input preprocessing method, namely INTENSITY_RANGE in the configurations. It varies from model to model. In addition, you should also make sure the configuration is consistent between trigger reverse-engineering and outlier detection. If they're improperly configured, you'll receive non-negligible impact on your results without raising an error.
Please email us if you would like some help.

Dependencies

keras==2.2.2
numpy==1.14.0
tensorflow-gpu==1.10.1
h5py==2.6.0

Our code has been tested on Python 2.7.12 and Python 3.6.8.

Source Code

Download here.

Directory Layout

utils_backdoor.py               # Utility functions.
visualizer.py                   # Trigger reverse-engineering functions.
mad_outlier_detection.py        # Outlier detection via MAD.
gtsrb_visualize_example.py      # An example script to reverse and detect backdoors.
injection/                      # Directory for backdoor injection.
    injection_utils.py          # Utility functions for backdoor injection.
    gtsrb_injection_example.py  # An example script to inject backdoors.
data/                           # Directory to store clean data.
    gtsrb_dataset_int.h5        # GTSRB data in .h5 format.
models/                         # Directory to store testing models.
    gtsrb_bottom_right_white_4_target_33.h5    # DNN model in h5 format.

Usage

Backdoor Injection

python injection/gtsrb_injection_example.py

This script injects a white square trigger at bottom right of inputs on GTSRB dataset.

Trigger Reverse-engineering

We include a script gtsrb_visualize_example.py to show how to perform trigger reverse-engineering on a DNN model. If you download the zip file, it already contain an infected model trained on traffic sign recognition task, along with the clean testing data (used for reverse-engineering). The script uses this model and dateset.

To reverse-engineer trigger on the infected GTSRB model, run

python gtsrb_visualize_example.py

It takes about 10 minutes. All reverse-engineered triggers will be saved to RESULT_DIR. You need to modify several parameters if you would like to run it on a different data or model:

GPU device: If you use GPU, specify which GPU in DEVICE variable.
Directories: If you use the code on your own models and datasets, please specify their path under data/, model/ directory and create a directory called result/ to save reversed triggers.
Model-specific information: If you test on your own model, please specify the correct meta information about the task, including input size (IMG_ROWS, IMG_COLS, IMG_COLOR), preprocessing method (INTENSITY_RANGE), total number of labels (NUM_CLASSES).
Optimization Configurations: There are several parameters you could configure for the optimization process, including learning rate, batch size, number of samples per iteration, total number of iterations, initial value for weight balance, etc. We do not observe strong sensitivity across different configurations in our experiments. But it's always good to try different configuration to fit into your application if the results don't look good.

Anomaly Detection

We use an anomaly detection algorithm called MAD (Median Absolute Deviation). A useful explanation of MAD could be found here. Our implementation loads all reversed triggers and detects any outlier with small size. Before you execute the code, please make sure the following configuration is correct in the script mad_outlier_detection.py:

Path to reversed trigger: You can specify the location where you put all reversed triggers in RESULT_DIR. Filename format in the sample code is consistent with previous code for trigger reverse-engineering. Our code only checks if there is any anomaly among reversed triggers under the specified folder. So be sure to include all triggers you would like to analyze in the folder.
Model-specific information: Configure the correct model-specific information correctly so that the anomaly detection code could load reversed triggers with the correct shape. You need to specify the input shape and the total number of labels in the model.

To execute the sample code, run

python mad_outlier_detection.py

Below is a snippet of the output of outlier detection in the infected GTSRB model

median: 64.466667, MAD: 13.238736
anomaly index: 3.652087
flagged label list: 33: 16.117647

Line 2 shows the final anomaly index is 3.652, which suggests the model is infected (we use threshold 2 in the paper). Line 3 shows the outlier detection algorithm flags only 1 label (label 33), which has a trigger with L1 norm of 16.1. Note that the absolute value of anomaly index might be different on your side due to randomness of trigger reverse-engineering optimization. But the variation should not be significant enough to flip anomaly detection's decision (See FAQ).

FAQ

Q: Do I need clean data to run Neural Cleanse?
A: No. Although in the paper and this instruction we use clean samples, we've tested our algorithm based on only random noise instead of clean data samples. And our trigger reverse-engineering algorithm can also successfully reverse the trigger. It's because the optimization only needs to find out the "shortcut" between classes regardless of what data distribution looks like in each class.
Q: Which parameters in trigger reverse-engineering to pay special attention?
- INTENSITY_RANGE: The preprocessing method according to your model. One of "imagenet", "inception", "mnist" or "raw" (no preprocessing).
- LR: The optimization learning rate.
- INIT_COST: The initial weight used for balancing two objectives.
- STEPS: The total optimization iterations.
Q: What if the results are slightly different across different runs?
A: Like all optimization-based approaches in machine learning, there'll be some small variations across different runs. But usually it won't affect the major conclusion. If you experience a radically different result, please go back to double check the trigger reverse-engineering step.

Citation

Please cite the paper as follows

@inproceedings{wang2019neural,
title={Neural cleanse: Identifying and mitigating backdoor attacks in neural networks},
author={Wang, Bolun and Yao, Yuanshun and Shan, Shawn and Li, Huiying and Viswanath, Bimal and Zheng, Haitao and Zhao, Ben Y},
booktitle={Proceedings of IEEE S\&P},
year={2019}
}

Contact

Bolun Wang (bolunwang@cs.ucsb.edu)
Kevin Yao (ysyao@cs.uchicago.edu)
Shawn Shan (shansixioing@uchicago.edu)