Classic vision ZSL: (1) Each class (seen and unseen) is assigned a semantic vector a_c ∈ ℝ^d — binary attributes, class-name embeddings (Word2Vec, GloVe) or descriptive sentences. (2) A mapping function f: X → ℝ^d from images into the semantic space is trained using only seen classes S, typically by minimizing a compatibility loss (cosine, ranking) between f(x) and a_y for ground-truth y, or by training an attribute classifier directly (Lampert's DAP/IAP). (3) At inference, for a test image x, ĉ = argmax_{c ∈ U} sim(f(x), a_c). Modern CLIP-style zero-shot: a contrastive learner trains a joint image-text embedding space on a huge corpus of pairs (Conceptual Captions, LAION, WIT). Zero-shot classification compares the image embedding with text-prompt embeddings of each class — no fine-tuning needed. Zero-shot prompting in LLMs: the model is given a task instruction in natural language ("Translate this sentence into French:") and performs it because pretraining on a massive corpus has covered similar patterns. The absence of demonstrations distinguishes zero-shot from few-shot / in-context learning.
How to recognize classes or perform tasks for which labeled data cannot be collected — because examples are scarce (rare species, rare diseases), continuously emerging (new products, new user intents) or because annotation is prohibitively expensive. ZSL transfers knowledge from seen to unseen classes/tasks via a shared semantic representation.
Shared vector space in which each class can be described independently of labeled images — attributes, word embeddings, embeddings of textual descriptions, or text-encoder outputs.
Vector describing each class c — in classical ZSL handcrafted attributes (Animals with Attributes, CUB); in modern ZSL the embedding of a prompt such as "a photo of a {class}".
Function s(x, c) measuring the fit of input x to class c via similarity in the semantic space (cosine, dot product, ranking) — used during both training and inference.
Network encoding the input (image, audio, text) into a vector comparable with class semantic vectors — ResNet/ViT in CLIP, an LLM encoder for zero-shot NLP.
A very common ZSL pitfall: classes in U appear in pretraining data (e.g. an ImageNet-pretrained backbone where U ⊂ ImageNet). Reported numbers are then inflated.
A model trained on seen classes is heavily biased by softmax toward them; in GZSL almost everything is classified into S and almost nothing into U.
In high-dimensional spaces a few classes become "hubs" — the nearest neighbor of a disproportionate fraction of queries — degrading nearest-neighbor classification.
In CLIP-style ZSL, small changes to the prompt template ("a photo of a {class}" vs "{class}") shift accuracy by several percentage points.
Class-name embeddings for rare classes (e.g. obscure species) are poorly trained in the text corpus — ZSL cannot represent them well.
First explicit formulation of ZSL — learning new classification tasks without any examples of the target class, via task descriptors.
Two parallel papers grounding ZSL in attribute-based image classification; AwA becomes the standard benchmark.
Frome et al. (Google) replace hand-crafted attributes with Word2Vec word embeddings, opening ZSL to ImageNet-scale settings.
Standardization of ZSL evaluation and introduction of the generalized zero-shot protocol, revealing strong bias toward seen classes.
Brown et al. show that large LLMs perform tasks without fine-tuning from prompt instructions alone — ZSL moves from vision into mainstream NLP.
Radford et al. (OpenAI) establish contrastive pretraining on 400M image-text pairs as the standard for zero-shot classification; CLIP matches supervised baselines on ImageNet without a single ImageNet label.
ZSL extends from classification to detection, segmentation, generation, and robot control — "open-vocabulary" becomes the practical synonym of zero-shot.
Time complexity: O(|U| · d) per prediction (after embedding computation). Space complexity: O(|C| · d).
Hand-crafted attributes, word embeddings (Word2Vec/GloVe), LLM text embeddings, Wikipedia descriptions — fundamentally affects transfer quality.
CLIP-style prompts such as "a photo of a {class}" vs "a satellite image of a {class}". Prompt ensembling raises zero-shot accuracy by several percentage points.
Bilinear (DeViSE, ALE), ranking, cosine + temperature (CLIP). Affects scaling with the number of classes.
In generalized ZSL it is critical to mitigate the bias toward seen classes (calibrated stacking, softmax calibration).
ZSL does not introduce its own execution paradigm — it inherits one from the host architecture (dense Transformer, conditional MoE, etc.). Similarity-based classification is a dense matmul.
Zero-shot classification itself is a matrix operation (matmul of image embedding with class-embedding bank), ideally parallelized on GPU. Encoder training (e.g. CLIP) is massively parallel over image-text batches.
Zero-shot classification is encoder forward + matmul of the embedding with the class bank — operations ideal for tensor cores. CLIP/SigLIP pretraining requires GPU clusters.
Google trains its multimodal foundation models (ALIGN, PaLI, Gemini) on TPU — contrastive pretraining maps well to systolic arrays.
Zero-shot inference with quantized CLIP (e.g. ONNX Runtime, GGML) is feasible on CPU AVX2/AVX-512 for smaller models (ViT-B).
The ZSL algorithm itself (compare embedding with class bank) is hardware-agnostic — the encoder choice determines the cost.