Counterfactual Explanations and Model Extraction Attacks
Explored how the additional information provided by counterfactual explanations can be exploited by an adversary in-order to improve model extraction attacks

Counterfactual explanations provide guidance on achieving a favorable outcome from a model, with minimum input perturbation. However, they can also be exploited to leak information about the underlying model, causing privacy concerns. Prior work has shown that one can query for counterfactual explanations with several input instances and train a surrogate model using all the queries and their counterfactual explanations. In this project, we investigate how model extraction attacks can be improved by further leveraging the fact that the counterfactual explanations also lie quite close to the decision boundary.
The proposed attack mitigates the issue of decision boundary shift, which is a problem that occurs in case of systems that provide counterfactual explanations for queries originating only from one side of the target model’s decision boundary.

Consequently, the proposed model extraction attack surpassed the performance of the baseline attack in [1] in a wide array of experiments.


This project was supported by Northrop Grumman Corporation.
Github: https://github.com/pasandissanayake/model-reconstruction-using-counterfactuals
[1] Ulrich Aïvodji, Alexandre Bolot, and Sébastien Gambs, “Model extraction from counterfactual explanations,” arXiv preprint arXiv:2009.01884, 2020.
References
2024
- Model Reconstruction Using Counterfactual Explanations: A Perspective From Polytope TheoryIn NeurIPS, 2024