Home 9 AI 9 Open Ad-Hoc Vision: AI That Invents New Categories on the Fly

Open Ad-Hoc Vision: AI That Invents New Categories on the Fly

by Ruchika Saini, AI | Aug 7, 2025

Context-tuned OAK framework lets models flexibly regroup images without predefined labels, bringing human-like perception to machine vision.

When identifying objects that can be sold at a garage sale, OAK can discover new categories such as hats or luggage even when only provided with the concept of “shoes” and a few shoe image examples during training (source: Wang et al., 2025).

Researchers from the University of Michigan and collaborators have introduced an open ad-hoc categorization (OAK) approach, a vision-language technique that lets AI models invent and switch visual categories on demand, rather than being locked into a fixed label set, says this interesting article on Tech Xplore. OAK adds a handful of context tokens to a pretrained CLIP backbone; these learned tokens act like mini “instruction sets,” steering the model to reinterpret the same image differently—say, by action, location, or mood—according to the user’s goal.

Because CLIP itself stays frozen, the system gains new skills without losing old ones. During inference, OAK automatically shifts its visual attention to the most relevant image region (hands for “drinking,” background for “in-store,” etc.), a behavior that emerges from the context-conditioned training rather than manual rules.

To discover unseen categories, OAK combines top-down semantic prompts (language-driven guesses such as “hats” based on knowing “shoes”) with bottom-up visual clustering (finding frequently co-occurring patterns like suitcases). The two processes iteratively confirm each other, allowing the model to propose and verify entirely new classes with only a few labeled examples.

Benchmark tests on Stanford and CLEVR-4 datasets show OAK outperforming extended-vocabulary CLIP and Generalized Category Discovery by wide margins; it achieved 87.4% novel-class accuracy for mood recognition, more than 50 percentage points better than the strongest baseline.

The authors presented the work at IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2025 and argue that context-aware categorization will be crucial for robotics, scientific exploration, and any setting where an AI must flexibly interpret the same scene for different tasks in real time.