Counterfactual Explanations and Model Extraction Attacks

Problem setting

Counterfactual explanations provide guidance on achieving a favorable outcome from a model, with minimum input perturbation. However, they can also be exploited to leak information about the underlying model, causing privacy concerns. Prior work has shown that one can query for counterfactual explanations with several input instances and train a surrogate model using all the queries and their counterfactual explanations. In this project, we investigate how model extraction attacks can be improved by further leveraging the fact that the counterfactual explanations also lie quite close to the decision boundary.

The proposed attack mitigates the issue of decision boundary shift, which is a problem that occurs in case of systems that provide counterfactual explanations for queries originating only from one side of the target model’s decision boundary.

Decision boundary shift issue

Consequently, the proposed model extraction attack surpassed the performance of the baseline attack in [1] in a wide array of experiments.

2D demonstration of the proposed attack. Notice how the decision boundary of the surrogate model obtained by the baseline attack has shifted, but the proposed attack has mitigated this shift.

Performance (measured as fidelity) of the proposed and baseline attacks over real-world data

This project was supported by Northrop Grumman Corporation.

Github: https://github.com/pasandissanayake/model-reconstruction-using-counterfactuals

[1] Ulrich Aïvodji, Alexandre Bolot, and Sébastien Gambs, “Model extraction from counterfactual explanations,” arXiv preprint arXiv:2009.01884, 2020.

Counterfactual Explanations and Model Extraction Attacks

References

2024