Counterfactual Explanations and Model Extraction Attacks

Explored how the additional information provided by counterfactual explanations can be exploited by an adversary in-order to improve model extraction attacks

Problem setting

Counterfactual explanations provide guidance on achieving a favorable outcome from a model, with minimum input perturbation. However, they can also be exploited to leak information about the underlying model, causing privacy concerns. Prior work has shown that one can query for counterfactual explanations with several input instances and train a surrogate model using all the queries and their counterfactual explanations. In this project, we investigate how model extraction attacks can be improved by further leveraging the fact that the counterfactual explanations also lie quite close to the decision boundary.

The proposed attack mitigates the issue of decision boundary shift, which is a problem that occurs in case of systems that provide counterfactual explanations for queries originating only from one side of the target model’s decision boundary.

Decision boundary shift issue

Consequently, the proposed model extraction attack surpassed the performance of the baseline attack in [1] in a wide array of experiments.

2D demonstration of the proposed attack. Notice how the decision boundary of the surrogate model obtained by the baseline attack has shifted, but the proposed attack has mitigated this shift.
Performance (measured as fidelity) of the proposed and baseline attacks over real-world data

This project was supported by Northrop Grumman Corporation.

Github: Link will appear here soon.

[1] Ulrich Aïvodji, Alexandre Bolot, and Sébastien Gambs, “Model extraction from counterfactual explanations,” arXiv preprint arXiv:2009.01884, 2020.

References

2023

  1. The Role of Counterfactual Explanations in Model Extraction Attacks
    Pasan Dissanayake, and Sanghamitra Dutta
    2023
    Under Review