Quantifying Knowledge Distillation Using Partial Information Decomposition
Proposed information-theoretic metrics based on Partial Information Decomposition (PID) to quantify and explain knowledge transfer in distillation. This led to the Redundant Information Distillation (RID) framework, which filters task-irrelevant information and improves distillation under nuisance teachers.
Knowledge distillation compresses complex machine learning models by training a smaller student model to emulate the representations of a larger teacher model. However, the teacher’s representations may also encode nuisance information irrelevant to the downstream task. Distilling such irrelevant information can impede the performance of a capacity-limited student. This raises a fundamental question: What are the information-theoretic limits of knowledge distillation?
We leverage Partial Information Decomposition (PID) to formally define and quantify:
- Knowledge to distill (unique information): the task-relevant information in the teacher that is not yet captured by the student.
- Transferred knowledge (redundant information): the task-relevant information that is common between the teacher and the student.
We theoretically demonstrate that the task-relevant transferred knowledge is succinctly captured by the redundant information about the task between the teacher and student, and show through examples that existing frameworks based on maximizing mutual information \(I(T;S)\) between teacher and student representations have fundamental limitations—they force the student to blindly mimic the teacher regardless of task-relevance.
Based on these insights, we propose Redundant Information Distillation (RID)—a novel multi-level optimization framework that maximizes redundant information as a regularizer. RID precisely captures task-relevant knowledge and filters out the task-irrelevant information from the teacher. Unlike prior methods, RID is resilient to nuisance teachers (untrained or uninformative teachers) where conventional distillation methods degrade student performance.
Our experiments on CIFAR-10, CIFAR-100, and a transfer learning setup (ImageNet → CUB-200-2011) confirm that RID outperforms existing approaches, particularly in scenarios where the teacher encodes substantial nuisance information.