Counterfactual Explanations and Model Reconstruction

Explored how the additional information provided by counterfactual explanations can be exploited for reconstructing a surrogate model. The work was later extended to multiple directions including retrieving counterfactual explanations in private and data-efficient LLM distillation using counterfactual explanations.

Counterfactual explanations and model extraction attacks

Problem setting

Counterfactual explanations provide guidance on achieving a favorable outcome from a model, with minimum input perturbation. However, they can also be exploited to leak information about the underlying model, causing privacy concerns. Prior work has shown that one can query for counterfactual explanations with several input instances and train a surrogate model using all the queries and their counterfactual explanations. In this project, we investigate how model extraction attacks can be improved by further leveraging the fact that the counterfactual explanations also lie quite close to the decision boundary.

The proposed attack mitigates the issue of decision boundary shift, which is a problem that occurs in case of systems that provide counterfactual explanations for queries originating only from one side of the target model’s decision boundary.

Decision boundary shift issue

Consequently, the proposed model extraction attack surpassed the performance of the baseline attack in [1] in a wide array of experiments.

2D demonstration of the proposed attack. Notice how the decision boundary of the surrogate model obtained by the baseline attack has shifted, but the proposed attack has mitigated this shift.
Performance (measured as fidelity) of the proposed and baseline attacks over real-world data

This project was supported by Northrop Grumman Corporation.

Github: https://github.com/pasandissanayake/model-reconstruction-using-counterfactuals

Counterfactual explanations for data-efficient LLM distillation

Similar to the definition of a counterfactual explanation in a tabular data manifold, a counterfactual can be defined in the context of natural language classification settings such as sentiment analysis. In this case, it can be defined as the minimum (natural-looking) perturbation that has to be made to a sentence in order for a classifier (e.g.: an LLM) to classify the new sentence into a different class than the original.

[1] Ulrich Aïvodji, Alexandre Bolot, and Sébastien Gambs, “Model extraction from counterfactual explanations,” arXiv preprint arXiv:2009.01884, 2020.

References

2025

  1. Few-Shot Knowledge Distillation of LLMs With Counterfactual Explanations
    F. Hamman, P. Dissanayake, Y. Fu, and 1 more author
    In NeurIPS, 2025
  2. Private Counterfactual Retrieval With Immutable Features
    Shreya Meel, Pasan Dissanayake, Mohamed Nomeir, and 2 more authors
    In ISIT, 2025
  3. Counterfactual Explanations for Model Ensembles Using Entropic Risk Measures
    Erfaun Noorani, Pasan Dissanayake, Faisal Hamman, and 1 more author
    In AAMAS, 2025
  4. Private Counterfactual Retrieval
    Mohamed Nomeir, Pasan Dissanayake, Shreya Meel, and 2 more authors
    In Allerton, 2025

2024

  1. Model Reconstruction Using Counterfactual Explanations: A Perspective From Polytope Theory
    Pasan Dissanayake, and Sanghamitra Dutta
    In NeurIPS, 2024