Lifting Data-Tracing Machine Unlearning to Knowledge-Tracing for Foundation Models

Department of Computer Science, Boston University

The code repository for "Lifting Data-Tracing Machine Unlearning to Knowledge-Tracing for Foundation Models" in PyTorch.

News

[06/2025] 🌟 arXiv paper has been released.

[06/2025] 🌟 The code repository of the case study has been released. [Project Page]

Abstract

Machine unlearning removes certain training data points and their influence on AI models (e.g. when a data owner revokes their decision to allow models to learn from the data). In this position paper, we propose to lift data-tracing machine unlearning to knowledge-tracing for foundation models (FMs). We support this position based on practical needs and insights from cognitive studies. Practically, tracing data cannot meet the diverse unlearning requests for FMs, which may be from regulators, enterprise users, product teams, etc., having no access to FMs' massive training data. Instead, it is convenient for these parties to issue an unlearning request about the knowledge or capability FMs (should not) possess. Cognitively, knowledge-tracing unlearning aligns with how the human brain forgets more closely than tracing individual training data points. Finally, we provide a concrete case study about a vision-language FM to illustrate how an unlearner might instantiate the knowledge-tracing machine unlearning paradigm.

Case Study

Following this work's position, we provide a concrete case study about Contrastive Language-Image Pretraining CLIP to bridge the position with real-world applications and, in return, explore the position in depth, spanning multiple factors and perspectives. We envision that Oudi Inc., a car manufacturer and an enterprise user of the CLIP model, has retired their O1 sedan for some reason. Accordingly, Oudi's product team requests that the Oudi O1 concept be unlearned from CLIP. An unlearner is equipped with existing MU methods developed in the research community but realizes they all operate on the training data points. The unlearner cannot access CLIP's training data; instead, they assemble a set of exemplar Oudi O1 images as the proxy forgetting set (but no retention set for convenience).

Results

The following table shows the main results of our proposed method and other SOTA methods. Please note that there might be slight variations in results based on the type and quantity of NVIDIA GPUs.

Requirements

Dependencies

For more information about the environment requirements, please see ./requirements.txt.

Datasets

We provide the orginal datasets as follows:

ImageNet-1k: Reference ImageNet-1k
CompCars: Reference CompCars

Our training sets are subsets of the original datasets. Please modify the dataset paths in ./data/.../train.jsonl and ./data/.../test.jsonl to match your local setup.

For general capability evaluation, we include the following datasets: cifar100, Stanford Cars, Food 101, Flower 102 , Catech101 and OxfordPet. Please download these datasets before running the evaluation.

Training Scripts

You can specify your own unlearning targets, datasets, and training parameters to train a customized unlearning model using train.py.

In our setup, we provide ready-to-use scripts under the scripts/ directory for all compared methods. These scripts define the unlearning targets and hyperparameters used throughout the paper.

python Unlearning/train.py 

for ImgnetDogs:
bash ./scripts/train/Subsetdog_train.sh

for CompCars-S:
bash ./scripts/train/Compcars_train.sh

Evaluation Scripts

You need to specify the saved checkpoint for each MU method first!!!

Evaluate the forgetting quality and model utility:

python Unlearning/evaluate_retain.py 

for ImgnetDogs:
bash ./scripts/evaluate_retain/evaluate_retain_Subsetdog.sh

for CompCars-S:
bash ./scripts/evaluate_retain/evaluate_retain_Compcars.sh

Evaluate model generalizability:

python Unlearning/evaluate.py 

for ImgnetDogs:
bash ./scripts/evaluate_retain/evaluate_Subsetdog.sh

for CompCars-S:
bash ./scripts/evaluate_retain/evaluate_Compcars.sh

Citation

If you find this useful in your research, please consider citing:

@misc{tan2025liftingdatatracingmachineunlearning,
      title={Lifting Data-Tracing Machine Unlearning to Knowledge-Tracing for Foundation Models}, 
      author={Yuwen Tan and Boqing Gong},
      year={2025},
      eprint={2506.11253},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.11253}, 
}

Acknowledgment

This repo is based on CLIP,NPO ,and SaLUN.

Thanks for their wonderful work!!!

Correspondence

If you have any question about this project, please contact yuwentan@bu.edu

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
Unlearning		Unlearning
__pycache__		__pycache__
assets		assets
data		data
dataloader		dataloader
model		model
scripts		scripts
wandb		wandb
.gitignore		.gitignore
README.md		README.md
evaluate.py		evaluate.py
evaluate_retain.py		evaluate_retain.py
requirements.txt		requirements.txt
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lifting Data-Tracing Machine Unlearning to Knowledge-Tracing for Foundation Models

News

Abstract

Case Study

Results

Requirements

Dependencies

Datasets

Training Scripts

Evaluation Scripts

Citation

Acknowledgment

Correspondence

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Lifting Data-Tracing Machine Unlearning to Knowledge-Tracing for Foundation Models

News

Abstract

Case Study

Results

Requirements

Dependencies

Datasets

Training Scripts

Evaluation Scripts

Citation

Acknowledgment

Correspondence

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages