Carnegie Mellon University

Cross-modal few-shot adaptation with CLIP

A comic strip features a character deciding between mislabeled photos. In the first panel: "Cat vs. Dog?” then “Hat vs. Beanie?” Next panel: “Pond vs. Grass?” then “Frog vs. Goat?" with corresponding sounds.

A simple cross-modal adaptation approach that learns from few-shot examples spanning different modalities. We demonstrate that one can indeed build a better visual dog classifier by reading about dogs and listening to them bark. To do so, we exploit the fact that recent multimodal foundation models such as CLIP are inherently cross-modal, mapping different modalities to the same representation space.

Abstract

The ability to quickly learn a new task with minimal instruction -- known as few-shot learning -- is a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot samples from a single modality, but such samples may not be sufficient to characterize an entire concept class. In contrast, humans use cross-modal information to learn new concepts efficiently. In this work, we demonstrate that one can indeed build a better visual dog classifier by reading about dogs and listening to them bark. To do so, we exploit the fact that recent multimodal foundation models such as CLIP are inherently cross-modal, mapping different modalities to the same representation space. Specifically, we propose a simple cross-modal adaptation approach that learns from few-shot examples spanning different modalities. By repurposing class names as additional one-shot training samples, we achieve SOTA results with an embarrassingly simple linear classifier for vision-language adaptation. Furthermore, we show that our approach can benefit existing methods such as prefix tuning and classifier ensembling. Finally, to explore other modalities beyond vision and language, we construct the first (to our knowledge) audiovisual few-shot benchmark and use cross-modal training to improve the performance of both image and audio classification.

CVPR 2022 Presentation

Learn More