
Researchers from the University of Michigan and collaborators have introduced an open ad-hoc categorization (OAK) approach, a vision-language technique that lets AI models invent and switch visual categories on demand, rather than being locked into a fixed label set, says this interesting article on Tech Xplore. OAK adds a handful of context tokens to a pretrained CLIP backbone; these learned tokens act like mini “instruction sets,” steering the model to reinterpret the same image differently—say, by action, location, or mood—according to the user’s goal.
Because CLIP itself stays frozen, the system gains new skills without losing old ones. During inference, OAK automatically shifts its visual attention to the most relevant image region (hands for “drinking,” background for “in-store,” etc.), a behavior that emerges from the context-conditioned training rather than manual rules.
To discover unseen categories, OAK combines top-down semantic prompts (language-driven guesses such as “hats” based on knowing “shoes”) with bottom-up visual clustering (finding frequently co-occurring patterns like suitcases). The two processes iteratively confirm each other, allowing the model to propose and verify entirely new classes with only a few labeled examples.
Benchmark tests on Stanford and CLEVR-4 datasets show OAK outperforming extended-vocabulary CLIP and Generalized Category Discovery by wide margins; it achieved 87.4% novel-class accuracy for mood recognition, more than 50 percentage points better than the strongest baseline.
The authors presented the work at IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2025 and argue that context-aware categorization will be crucial for robotics, scientific exploration, and any setting where an AI must flexibly interpret the same scene for different tasks in real time.