ImageNet-RIB Benchmark:
Large Pre-Training Datasets Don't Guarantee Robustness after Fine-Tuning

Massachusetts Institute of Technology
NeurIPS 2024
Workshop on Fine-Tuning in Modern Machine Learning: Principles and Scalability (FITML)

Illustration of the ImageNet-RIB benchmark.

Abstract

Highly performant large-scale pre-trained models promise to also provide a valuable foundation for learning specialized tasks, by fine-tuning the model to the desired task. By starting from a good general-purpose model, the goal is to achieve both specialization in the target task and maintain robustness. To assess the robustness of models to out-of-distribution samples after fine-tuning on downstream datasets, we introduce a new robust fine-tuning benchmark, ImageNet-RIB (Robustness Inheritance Benchmark). The benchmark consists of a set of related but distinct specialized (downstream) tasks; pre-trained models are fine-tuned on one task in the set and their robustness is assessed on the rest, iterating across all tasks for fine-tuning and assessment. We find that the continual learning methods, EWC and LwF maintain robustness after fine-tuning though fine-tuning generally does reduce performance on generalization to related downstream tasks across models. Not surprisingly, models pre-trained on large and rich datasets exhibit higher initial robustness across datasets and suffer more pronounced degradation during fine-tuning. The distance between the pre-training and downstream datasets, measured by optimal transport, predicts this performance degradation on the pre-training dataset. However, counterintuitively, model robustness after fine-tuning on related downstream tasks is the worst when the pre-training dataset is the richest and the most diverse. This suggests that starting with the strongest foundation model is not necessarily the best approach for performance on specialist tasks. The benchmark thus offers key insights for developing more resilient fine-tuning strategies and building robust machine learning models.

Evaluation Metrics

The robustness improvement (RI) on i-th downstream dataset is defined as the average accuracy difference between fine-tuned model and pre-trained models: $$RI_i = \frac{1}{n-1} \sum_{j=1, j \neq i}^n A^{(j)}_i - A_\text{pre}^{(j)}$$,

Experiments on ImageNet-RIB

mean Robust Improvement (mRI) of each method with different architectures and pre-trainining datasets.

There are huge performance drops after fine-tuning on downstream datasets using OpenAI or LAION-2B pre-trained models.

IN-21K with AugReg pre-trained models overfits on the downstream dataset and OpenAI pre-trained models learns the slowest. However, only OpenAI and LAION-2B pre-trained models suffer huge robustness degradation.

BibTeX

@inproceedings{hwang2024imagenet,
      title={ImageNet-RIB Benchmark: Large Pre-Training Datasets Don't Guarantee Robustness after Fine-Tuning},
      author={Hwang, Jaedong and Cheung, Brian and Hong, Zhang-Wei and Boopathy, Akhilan and Agrawal, Pulkit and Fiete, Ila R},
      booktitle={NeurIPSW on Fine-Tuning in Modern Machine Learning: Principles and Scalability},
      year={2024}
    }