Ai-engineering · June 7, 2026

Vision Language Models (VLMs): How GPT-4o, Claude, and LLaVA Understand Images

CLIP encoders, cross-attention, visual tokens, and the full pipeline from pixels to language

by Perivitta 45 mins read Advanced
Share
Back to all posts

Vision Language Models (VLMs): How GPT-4o, Claude, and LLaVA Understand Images

  • What you will learn: How VLMs bridge pixels and language tokens, covering CLIP encoders, patch tokenisation, projection layers, and the three dominant architectures used in production.
  • Why it matters: VLMs power GPT-4o, Claude Vision, Gemini, and the open-source LLaVA family. Understanding their internals is now a core skill for ML engineers building multimodal applications.
  • Architecture: Three paradigms dominate, encoder-projector-LLM (LLaVA-style), cross-attention fusion (Flamingo-style), and native multimodal (GPT-4o-style), each with distinct trade-offs.
  • Key insight: A 336x336 image becomes 576 visual tokens via patch tokenisation, each carrying rich spatial semantics that the language model attends to alongside text tokens.
  • Watch out for: Hallucination on fine-grained spatial details, high per-image token cost, and resolution limits that cause failures on small text or dense diagrams.

When you send a photo of a handwritten invoice to GPT-4o and ask it to extract the line items, or when you upload a chart to Claude and it summarises the trend, something extraordinary is happening under the hood. A model that was built around sequences of text tokens is somehow processing the continuous, high-dimensional signal of an image and integrating that information into its reasoning chain. How?

Vision Language Models (VLMs) are the class of architectures that make this possible. They bridge the gap between the continuous world of pixels and the discrete world of language tokens, enabling a new generation of applications: visual question answering (VQA), image captioning, optical character recognition at scale, chart and table understanding, document parsing, medical image analysis, and multimodal agents that can see and act on the world.

In 2026, VLMs have moved from research curiosity to production infrastructure. GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and the open-source LLaVA family are all deployed at scale. Understanding how they work, not just that they work, is now a core competency for ML engineers. This post gives you that understanding in full technical depth.


The Problem VLMs Solve

A pure language model operates in token space. Its input is a sequence of integers (token IDs), each mapped to a vector via an embedding table, and its output is a probability distribution over the vocabulary. Everything is discrete and one-dimensional. Images are neither of those things.

A 336 x 336 pixel RGB image contains 338,688 raw numerical values. Even at reduced resolution, the raw pixel array is a dense, spatially structured, continuous signal. Feeding raw pixels directly into a transformer would require attention over hundreds of thousands of positions, making computation prohibitively expensive. More fundamentally, raw pixel values carry no semantic structure: the number 127 in position (42, 83, 2) tells the model nothing useful by itself.

The core challenge of VLMs is therefore a representation mismatch: the language model expects semantically rich, fixed-dimensional vectors arranged in a short sequence. Images are high-dimensional, spatially structured, and continuous. Bridging this gap requires three things: (1) a vision encoder that converts raw pixels into compact, semantically meaningful representations; (2) a projection mechanism that maps those representations into the language model's embedding space; and (3) a training procedure that teaches the combined system to align visual and linguistic meaning.

Getting this wrong in any of the three places produces a model that confidently hallucinates image content, fails on spatially precise questions, or cannot generalise to images outside its training distribution.


Core Concepts and Terminology

Term Definition Why It Matters
Vision Encoder A neural network (typically a Vision Transformer or CNN) that converts raw pixel data into a grid of feature vectors. Determines the quality and type of visual representations available to the language model.
Language Model Backbone The pretrained LLM (e.g., LLaMA, Vicuna, Mistral) that receives visual and text tokens and generates output. Provides all the reasoning, instruction-following, and language generation capability.
Visual Tokens The sequence of vectors produced by the vision encoder (and optionally compressed by a projector) that are fed into the LLM alongside text tokens. Each visual token represents a region of the image in the language model's embedding space.
Image Patches Non-overlapping rectangular regions of the input image (e.g., 14x14 or 16x16 pixels) that the ViT processes independently before applying self-attention. The patch size directly controls how many visual tokens are produced per image.
CLIP Contrastive Language-Image Pretraining (OpenAI, 2021). A dual-encoder model trained to align image and text representations in a shared embedding space. CLIP's vision encoder is the most widely used backbone for VLMs because its representations are semantically aligned with language.
ViT (Vision Transformer) An image encoder that divides an image into fixed-size patches, linearly embeds each patch, and applies transformer self-attention over the resulting sequence. ViTs produce the per-patch token sequences that VLMs consume. CLIP uses a ViT as its image encoder.
Cross-Attention An attention mechanism in which queries come from one modality (e.g., text) and keys/values come from another (e.g., image features). Used in Flamingo-style architectures to let the language model attend to image regions at every transformer layer.
Projection Layer A trainable module (linear, MLP, or Q-Former) that maps vision encoder output vectors into the LLM's embedding dimensionality. The projection layer is the primary trainable interface between the two modalities in many VLMs.
Multimodal Alignment The process of training or fine-tuning the combined system so that visual and language representations are compatible in a shared semantic space. Without alignment, the LLM cannot interpret visual tokens and produces incoherent outputs.
Instruction Tuning Fine-tuning a pretrained model on (instruction, response) pairs so it learns to follow natural language instructions, including multimodal ones. Converts a pretrained VLM into a useful assistant that responds correctly to "describe this image" or "what is the trend in this chart?"

Architecture Overview

Three dominant paradigms have emerged for building VLMs, each making different trade-offs between flexibility, training cost, and performance.

Architecture 1: Encoder + Projector + LLM (LLaVA-Style)

This is the simplest and most widely used open-source architecture. The data flow is:

StageComponentWhat Happens
1Input ImageRaw pixels (H x W x 3) fed into the vision encoder
2CLIP ViT-L/14Image divided into patches; each patch becomes a D_vision-dimensional embedding vector
3Projection LayerLinear or MLP maps patch embeddings from vision space into the LLM's embedding dimension
4Token ConcatenationVisual tokens are prepended to the text token sequence to form a single combined input
5LLM (LLaMA / Vicuna)Processes the full combined sequence; self-attention spans both visual and text tokens
6OutputAutoregressive text generation conditioned on both image and prompt

LLaVA-style architecture: the vision encoder and LLM are coupled through a lightweight projection layer. Every transformer layer in the LLM can attend to every visual token.

The vision encoder (typically CLIP ViT-L/14 or ViT-L/14@336px) processes the image and produces a sequence of patch embeddings. These are passed through a projection layer that maps them from the vision encoder's hidden dimension (e.g., 1024) to the LLM's embedding dimension (e.g., 4096). The resulting visual tokens are then prepended to the text token sequence, and the LLM processes the combined sequence autoregressively.

Trade-offs: Simple to implement and train. The entire image is visible to every layer of the LLM via self-attention. However, the visual token count can be large (576 tokens for a 336x336 image with 14x14 patches), consuming a significant portion of the context window. The projection layer is the only component that learns the cross-modal mapping; the vision encoder and LLM can be frozen or fine-tuned depending on compute budget.

Example models: LLaVA-1.5, LLaVA-NeXT, BakLLaVA, MoE-LLaVA, ShareGPT4V.

Architecture 2: Cross-Attention Fusion (Flamingo-Style)

In Flamingo (DeepMind, 2022), the image and text modalities are kept separate. The language model backbone is frozen, and new cross-attention layers are interleaved between its existing transformer layers. These cross-attention layers receive queries from the text stream and keys/values from a pooled representation of image features.

ComponentRoleKey Detail
Vision Encoder (NFNet or ViT)Extracts visual features from the input imageProduces a variable-length sequence of patch embeddings
Perceiver ResamplerCompresses visual features to a fixed token countLearnable query vectors pool patch embeddings down to 64 tokens regardless of image size
Cross-Attention LayersInserted between frozen LLM blocksText hidden states act as queries; image features are keys and values
Frozen LLM BackboneLanguage generationOriginal weights unchanged; only the cross-attention layers and Perceiver are trained
OutputText responseGenerated autoregressively, informed by image features at every layer depth

Flamingo-style architecture: cross-attention layers injected between frozen LLM blocks allow the language model to attend to compressed image features at every depth, without disturbing the pretrained text weights.

A key component is the Perceiver Resampler, which uses a small set of learnable query vectors to compress the variable-length patch sequence from the vision encoder down to a fixed number of tokens (e.g., 64). This keeps the cross-attention computation tractable regardless of image resolution.

Trade-offs: The frozen LLM backbone is protected from catastrophic forgetting. Cross-attention at every layer gives the model fine-grained control over when and how it uses image information. However, the architecture is more complex to implement, and the cross-attention adds inference latency at every layer.

Example models: Flamingo, OpenFlamingo, IDEFICS, IDEFICS2.

Architecture 3: Native Multimodal (GPT-4o-Style)

The most capable but least open architecture trains a single unified model end-to-end on interleaved image, text, and audio data from the start. Rather than adapting a pretrained LLM to accept images, the model is pretrained jointly across modalities, allowing every layer to develop natively multimodal representations.

GPT-4o is believed to tokenise images into discrete visual tokens using a learned tokeniser, producing image tokens that live in the same vocabulary as text tokens, though the exact architecture has not been publicly disclosed by OpenAI. The model then processes these as a unified sequence.

Trade-offs: No modality boundary means the model can reason more deeply about relationships between text and image at every layer. End-to-end training allows the vision and language representations to co-evolve. The cost is enormous: joint pretraining requires vastly more compute, data, and engineering complexity. The architectural details of GPT-4o and Claude's vision system are not publicly disclosed.

Example models: GPT-4V, GPT-4o, Claude 3 Opus Vision, Gemini 1.5 Pro, Chameleon (Meta).


How CLIP Works

CLIP (Contrastive Language-Image Pretraining) is foundational to understanding most open-source VLMs. Published by OpenAI in 2021, CLIP trains two encoders simultaneously: an image encoder (typically a ViT) and a text encoder (a transformer). The training signal is contrastive: for a batch of N (image, caption) pairs, the model is trained to maximise the cosine similarity of the N matching pairs and minimise the similarity of the N^2 - N non-matching pairs.

CLIP applies a contrastive form of supervised learning: instead of predicting a single label, it learns to match images to their correct captions out of an entire batch. An image encoder and a text encoder are trained together on 400 million image-caption pairs. After training, images and their descriptions land close together in a shared vector space, which is why CLIP representations transfer so naturally into language models as visual backbones.

Trained on 400 million (image, text) pairs scraped from the internet, CLIP's image encoder learns to produce representations that are semantically aligned with language. A CLIP embedding of a photo of a golden retriever will be close to the text embedding of "a golden retriever", far from "a sports car". This alignment is exactly what makes CLIP representations useful as a visual backbone for VLMs: the image features are already in a language-compatible semantic space.

ViT Patch Tokenisation

CLIP's image encoder is a Vision Transformer (ViT). The ViT processes images as follows:

  1. Divide the image into a grid of non-overlapping patches. For ViT-L/14, each patch is 14x14 pixels. A 224x224 image produces 16x16 = 256 patches. A 336x336 image produces 24x24 = 576 patches.
  2. Flatten each patch into a 1D vector of length 14*14*3 = 588, then project it to the model's hidden dimension D (e.g., 1024) via a learned linear layer. This is the "patch embedding".
  3. Add a learnable [CLS] token prepended to the sequence. Add learnable 2D positional embeddings to all patch embeddings.
  4. Pass the resulting sequence (length 577 for 336x336 with ViT-L/14) through L transformer layers with multi-head self-attention.
  5. The [CLS] token output is typically used as the global image representation for CLIP's contrastive loss. The full patch token sequence (without [CLS]) is used as visual tokens in LLaVA.

The critical insight is that each patch token in the final layer's output corresponds to a specific spatial region of the image. Self-attention allows patches to attend to each other, so a patch token representing the sky can incorporate information from the horizon patches. But the spatial correspondence is preserved: visual token 42 always corresponds to the same 14x14 region.


Visual Token Projection

The vision encoder produces patch embeddings in its own hidden space (e.g., D_vision = 1024 for ViT-L/14). The LLM operates in its own embedding space (e.g., D_llm = 5120 for LLaMA-2-13B). These spaces are not compatible: a vector from CLIP cannot be directly inserted into LLaMA's residual stream and produce meaningful computation.

The projection layer solves this by learning a mapping from D_vision to D_llm. Three main approaches are used:

Linear Projection (LLaVA-1.5)

A single linear layer: W ∈ R^(D_llm x D_vision), applied independently to each patch token. Fast, simple, surprisingly effective. LLaVA-1.5 found that a two-layer MLP with a GELU activation outperformed a single linear layer.

# Simplified linear projection
import torch.nn as nn

class LinearProjector(nn.Module):
    def __init__(self, vision_dim=1024, llm_dim=4096):
        super().__init__()
        self.proj = nn.Sequential(
            nn.Linear(vision_dim, llm_dim),
            nn.GELU(),
            nn.Linear(llm_dim, llm_dim)
        )

    def forward(self, image_features):
        # image_features: (batch, num_patches, vision_dim)
        return self.proj(image_features)
        # output: (batch, num_patches, llm_dim)

Q-Former (BLIP-2, InstructBLIP)

The Querying Transformer (Q-Former) is a more sophisticated bottleneck. It contains a fixed set of N learnable query vectors (e.g., 32 queries) that cross-attend to the full patch sequence from the frozen vision encoder. The Q-Former's output is N vectors in D_llm space, regardless of the original image resolution. This dramatically compresses the visual token count (from 576 down to 32) at the cost of some spatial detail, and adds a trained interface that can be pretrained on image-text tasks independently.

Token Count: Why 576 Tokens

With ViT-L/14@336px: the image is 336x336 pixels. The patch size is 14x14. The number of patches is (336/14) x (336/14) = 24 x 24 = 576. Each patch becomes one visual token in the LLM's input sequence. This is why a single 336x336 image in LLaVA-1.5 consumes 576 context tokens, which is significant in a 4096-token context window. LLaVA-NeXT addresses this with higher resolution support via dynamic resolution strategies that tile the image.


Training Pipeline

Most open-source VLMs follow a two-stage training pipeline, pioneered by LLaVA and refined by subsequent work.

Stage 1: Projection Pretraining (Feature Alignment)

Goal: Teach the projection layer to map vision encoder outputs into vectors that the frozen LLM can interpret.

Data: Large-scale image-caption pairs (e.g., CC3M, LAION-CC-SBU, approximately 558K pairs in LLaVA-1.5 Stage 1).

Setup: Both the vision encoder and the LLM are frozen. Only the projection layer weights are updated. The LLM is trained to predict the caption tokens given the projected visual tokens.

Why freeze the LLM? The LLM already has strong language priors. Updating it on caption pairs alone could cause catastrophic forgetting of its broader language capabilities. Stage 1 focuses exclusively on teaching the projection layer to speak the LLM's language.

Duration: Typically 1 epoch on the alignment dataset. Computationally cheap compared to Stage 2.

Stage 2: Visual Instruction Tuning

Goal: Teach the model to follow multimodal instructions, answer questions about images, and engage in visual dialogue.

Data: Multimodal instruction-following datasets: LLaVA-Instruct-150K, ShareGPT4V, VQA datasets, TextVQA, GQA, OCR-VQA, and document understanding datasets. LLaVA-1.5 uses approximately 665K instruction samples.

Setup: The vision encoder is frozen. The projection layer and the LLM are both trained (or the LLM is trained with LoRA adapters to reduce compute). The model is trained to generate correct responses to instructions like "Describe this image in detail", "What is the text in this sign?", "How many people are in the image?".

Why curriculum matters: Stage 1 must complete before Stage 2. If both are run together, the untrained projection layer produces garbage vectors, and the LLM's updates will attempt to compensate, degrading its language quality. The sequential curriculum cleanly separates the alignment problem from the instruction-following problem.

Data quality matters more than quantity: LLaVA-1.5 achieved state-of-the-art results with only 665K instruction samples by using GPT-4-generated high-quality conversation data, outperforming models trained on 10x more but lower-quality data.


Practical Example: "What Is in This Image?"

Let's trace exactly what happens when a user sends a photo of a busy street with the question "What is in this image?" to a LLaVA-1.5 (13B) model.

Step 1: Image Preprocessing

The image is resized and center-cropped to 336x336 pixels. Pixel values are normalised using CLIP's mean and std. The image tensor has shape (3, 336, 336).

Step 2: Patch Tokenisation

The ViT-L/14@336px divides the image into 24x24 = 576 patches, each 14x14 pixels. Each patch is linearly embedded to a 1024-dimensional vector. A [CLS] token is prepended, giving sequence length 577.

Step 3: Vision Encoder Processing

The 577-token sequence passes through 24 transformer layers (ViT-L configuration). Each layer applies multi-head self-attention (16 heads, dim 64 each) and an MLP. Patches corresponding to buildings, cars, people, and traffic lights develop specialised representations as higher layers encode increasingly abstract features. The output is 576 patch embeddings, each of shape (1024,). (The [CLS] token is discarded for LLaVA; some models use it for global context.)

Step 4: Projection to Language Space

The two-layer MLP projector maps each of the 576 patch embeddings from (1024,) to (5120,), matching LLaMA-2-13B's hidden dimension. Output: 576 visual tokens in LLM embedding space.

Step 5: Token Sequence Construction

The text question "What is in this image?" is tokenised to approximately 7 text tokens. A special <image> placeholder in the prompt template is replaced by the 576 visual tokens. The final input sequence looks like:

The final input sequence fed to the LLM begins with 576 visual tokens (one per image patch), followed by the 7 text tokens that represent the question "What is in this image?". The total context length is approximately 583 tokens. Every transformer layer in the LLM can attend across this full sequence, meaning each word the model generates can be influenced by any image patch.

Step 6: LLM Autoregressive Generation

LLaMA-2-13B processes the 583-token sequence. All 40 transformer layers apply self-attention over the full sequence, meaning every text token can attend to every visual token. The model attends to the spatial regions relevant to each generated word: when generating "street", it attends heavily to road-patch tokens; when generating "buildings", it attends to upper-image patches.

The model generates tokens one at a time: "The", "image", "shows", "a", "busy", "city", "street", "with", ... until an end-of-sequence token is produced.


Python Implementation

The following example shows how to load LLaVA-1.5 using the HuggingFace transformers library and run visual inference.

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image
import requests

# Load model and processor
model_id = "llava-hf/llava-v1.6-mistral-7b-hf"  # LLaVA-NeXT, HF-compatible
processor = LlavaNextProcessor.from_pretrained(model_id)

model = LlavaNextForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    device_map="auto"          # automatically distributes across available GPUs/CPU
)

# Load an image from URL (or use PIL.Image.open for local files)
url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/240px-PNG_transparency_demonstration_1.png"
image = Image.open(requests.get(url, stream=True).raw)

# Build the conversation prompt using the LLaVA chat template
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What is in this image? Describe it in detail."},
        ],
    },
]

# Apply the processor's chat template to format the prompt
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

# Process inputs: tokenises text + encodes image into visual tokens
inputs = processor(
    images=image,
    text=prompt,
    return_tensors="pt"
).to(model.device)

# Print token counts
num_image_tokens = (inputs["input_ids"] == processor.tokenizer.convert_tokens_to_ids("<image>")).sum()
print(f"Total input tokens: {inputs['input_ids'].shape[1]}")

# Generate response
with torch.inference_mode():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=False,           # greedy decoding for determinism
        temperature=1.0,
    )

# Decode only the newly generated tokens (not the prompt)
generated_ids = output_ids[:, inputs["input_ids"].shape[1]:]
response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print("Model response:")
print(response)

For LLaVA-1.5 specifically (older API):

from transformers import AutoTokenizer, AutoModelForCausalLM, CLIPImageProcessor
import torch
from PIL import Image

# LLaVA-1.5 uses a slightly different loading pattern
model_path = "liuhaotian/llava-v1.5-7b"

# Load the vision processor (CLIP's image preprocessor)
image_processor = CLIPImageProcessor.from_pretrained(model_path)

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
    device_map="auto"
)

def run_llava_inference(image: Image.Image, question: str) -> str:
    """Run LLaVA-1.5 inference on a single image and question."""

    # Preprocess image: resize to 336x336, normalise with CLIP stats
    # Output shape: (1, 3, 336, 336)
    pixel_values = image_processor(
        images=image,
        return_tensors="pt"
    )["pixel_values"].to(model.device, dtype=torch.float16)

    # Format prompt with LLaVA's special image token
    # The model expects <image> placeholder where visual tokens will be inserted
    prompt = f"USER: <image>\n{question}\nASSISTANT:"
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)

    with torch.inference_mode():
        output_ids = model.generate(
            input_ids=input_ids,
            images=pixel_values,       # passed separately; model inserts at <image> position
            max_new_tokens=256,
            use_cache=True,
        )

    # Decode only new tokens
    output_text = tokenizer.decode(
        output_ids[0, input_ids.shape[1]:],
        skip_special_tokens=True
    ).strip()

    return output_text

# Example usage
image = Image.open("street.jpg")
answer = run_llava_inference(image, "How many cars are in this image?")
print(answer)

For batch inference with multiple images (important for production throughput):

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image

processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
model = LlavaNextForConditionalGeneration.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

def batch_inference(images: list, questions: list, batch_size: int = 4):
    """Process multiple image-question pairs in batches."""
    results = []

    for i in range(0, len(images), batch_size):
        batch_images = images[i:i + batch_size]
        batch_questions = questions[i:i + batch_size]

        conversations = [
            [{"role": "user", "content": [
                {"type": "image"},
                {"type": "text", "text": q}
            ]}]
            for q in batch_questions
        ]

        prompts = [
            processor.apply_chat_template(conv, add_generation_prompt=True)
            for conv in conversations
        ]

        # Padding is required for batch processing
        inputs = processor(
            images=batch_images,
            text=prompts,
            return_tensors="pt",
            padding=True
        ).to(model.device)

        with torch.inference_mode():
            output_ids = model.generate(**inputs, max_new_tokens=256)

        for j, out in enumerate(output_ids):
            prompt_len = inputs["input_ids"][j].shape[0]
            response = processor.decode(out[prompt_len:], skip_special_tokens=True)
            results.append(response)

    return results

Comparison of Major VLMs

Model Architecture Type Vision Encoder LLM Backbone Open / Closed Best Use Case
LLaVA-1.5 Encoder + MLP Projector + LLM CLIP ViT-L/14@336px Vicuna-7B / 13B Open (weights) General VQA, baseline research, self-hosted deployment
LLaVA-NeXT Encoder + MLP Projector + LLM (dynamic resolution) CLIP ViT-L/14@336px (tiled) Mistral-7B / LLaMA-3-8B / 70B Open (weights) High-res documents, OCR, chart understanding
BLIP-2 Encoder + Q-Former + LLM CLIP ViT-L or EVA-ViT-G OPT-2.7B / FlanT5-XXL Open (weights) Image captioning, zero-shot VQA
InstructBLIP Encoder + Q-Former + LLM (instruction-tuned) CLIP ViT-L or EVA-ViT-G Vicuna-7B / 13B, FlanT5 Open (weights) Instruction-following VQA, science diagrams
Flamingo Cross-attention fusion (Perceiver Resampler) NFNet-F6 Chinchilla-70B (frozen) Closed (weights not released) Few-shot multimodal reasoning, interleaved image-text
GPT-4V / GPT-4o Native multimodal (unified tokenisation) Undisclosed GPT-4 class (undisclosed) Closed (API only) Complex visual reasoning, multimodal agents, fine-grained OCR
Claude 3 Vision Native multimodal (undisclosed) Undisclosed Claude 3 class (undisclosed) Closed (API only) Document analysis, chart interpretation, long-form visual reasoning
Gemini Vision Native multimodal (interleaved tokens) Undisclosed (likely SigLIP-based) Gemini 1.5 Pro / Flash Closed (API only) Long-context video understanding, document OCR at scale

Advantages of VLMs

  • Visual reasoning: VLMs can answer complex questions that require integrating visual evidence with world knowledge. "Is the food in this image appropriate for someone with celiac disease?" requires recognising food items, knowing their ingredients, and understanding dietary restrictions simultaneously.
  • Zero-shot generalisation: CLIP-pretrained VLMs generalise to visual concepts not explicitly seen in instruction tuning, because the vision encoder's representations already cover a vast range of visual categories.
  • Document understanding: Combining OCR capability with language understanding, VLMs can process contracts, forms, invoices, and research papers in a single pass, extracting structured information without explicit layout parsing.
  • Chart and table parsing: VLMs understand the visual grammar of charts (axes, legends, bars, lines) and can extract data, identify trends, and answer quantitative questions about plotted data.
  • Accessibility applications: Image captioning and visual question answering enable screen readers and assistive tools that describe images to visually impaired users in rich, contextual language.
  • Unified pipeline: A single VLM replaces a pipeline of specialised models (object detector, OCR engine, caption model, VQA model), reducing inference infrastructure complexity and the error propagation that occurs when chaining separate models.

Limitations and Trade-offs

  • Hallucination on fine-grained visual details: VLMs frequently hallucinate object attributes, counts, and spatial relationships. Asking "How many red cars are in the parking lot?" often yields plausible but incorrect numbers. The language model's priors about what is likely to appear in a scene can dominate over actual visual evidence.
  • Poor spatial reasoning: Tasks requiring precise spatial understanding ("Is the red ball to the left or right of the blue cube?") are systematically difficult because the patch tokenisation and self-attention mechanism do not preserve strong spatial inductive biases.
  • High token cost per image: A single 336x336 image consumes 576 context tokens in LLaVA-1.5. Processing 10 images in a conversation consumes 5,760 tokens before any text. This limits the number of images per conversation and drives up inference cost significantly.
  • Resolution constraints: CLIP ViT-L/14 was pretrained at 224x224. Fine-tuning at 336x336 helps but images with small text or fine detail (e.g., PCB diagrams, microscopy) still lose information. LLaVA-NeXT's dynamic tiling partially addresses this.
  • Text in images: While VLMs can read printed text in images, they struggle with handwriting, dense text layouts, non-Latin scripts, and low-contrast text. Dedicated OCR systems like Tesseract or cloud vision APIs still outperform general VLMs on heavy-OCR tasks.
  • No true image understanding in closed models: Proprietary VLMs cannot be audited for what they actually "see". Their visual capabilities are characterised only through benchmarks and empirical testing, not by examining internal representations.

Common Mistakes

  • Over-relying on VLMs for precise measurements: VLMs cannot reliably read exact numerical values from charts, measurements from photos, or precise coordinates. If your application requires precise numerical extraction, combine the VLM with specialised computer vision tools.
  • Ignoring resolution limits: Sending a 4000x3000 pixel image to LLaVA-1.5 will downsample it to 336x336 before processing, discarding most of the detail. If the task requires reading small text or detecting small objects, use a model with dynamic high-resolution support (LLaVA-NeXT, GPT-4o) or pre-crop the region of interest.
  • Not providing sufficient text context: VLMs perform significantly better when the text prompt provides context about the task. "What do you see?" is worse than "You are analysing a medical X-ray. Describe any abnormalities in the lung region." The instruction-tuned LLM backbone benefits from context just as it does in text-only tasks.
  • Using the wrong model for the task: A general VQA model is not the right tool for production OCR at scale. If you need to extract all text from thousands of scanned documents, use a document-specific model (PaddleOCR, AWS Textract, Azure Form Recognizer). Use VLMs where visual reasoning, not just text extraction, is needed.
  • Forgetting to benchmark on your specific distribution: A VLM that achieves 80% on VQAv2 may perform far worse on your domain-specific images (medical scans, satellite imagery, engineering drawings). Always evaluate on representative samples from your target distribution before production deployment.
  • Processing images sequentially when batching is available: For offline processing tasks, batching images together (with padding) achieves significantly higher GPU utilisation than one-at-a-time inference.

Best Practices

Image Resolution Selection

Match resolution to task requirements. For general scene understanding and conversational QA, 336x336 (LLaVA-1.5) is sufficient. For document parsing, dense text, or fine-grained recognition, use models with higher native resolution or dynamic tiling (LLaVA-NeXT supports up to 1344x336 via tiling). Never send images larger than the model's native resolution without checking how the library handles resizing.

Prompt Engineering for Visual Tasks

Structure prompts to specify: (1) what the image contains or what type it is, (2) what specific information you need, (3) the format of the answer. Example: "This image is a bar chart. Extract the numerical value for each bar and return them as a Python dictionary with bar labels as keys." is far more effective than "Read the chart."

When to Use VLMs vs Dedicated Models

Use VLMs when: the task requires combining visual evidence with reasoning or world knowledge, the task is too varied for a specialised model, or you need a conversational interface over visual content. Use dedicated models when: you need maximum accuracy on a well-defined narrow task (face detection, license plate OCR, medical image segmentation), latency is critical, or cost per image must be minimised.

Evaluation with Visual Benchmarks

Standard benchmarks for VLM evaluation:

  • MMBench: Multi-task visual understanding benchmark with objective multiple-choice questions across 20 ability dimensions.
  • MMMU: Massive Multidisciplinary Multimodal Understanding. College-level questions across 30 subjects requiring domain expertise and visual reasoning.
  • TextVQA: Questions that require reading and reasoning about text within images. Specifically targets OCR capability integrated with language understanding.
  • GQA: Real-world visual reasoning with compositional questions and scene graphs for structural evaluation.
  • MME: Perception and cognition benchmarks with binary yes/no answers, measuring specific fine-grained capabilities.
  • POPE: Polling-based Object Probing Evaluation, specifically designed to measure object hallucination rates.

Frequently Asked Questions

How is GPT-4o different from GPT-4V?

GPT-4V (the visual capability of GPT-4 Turbo) was an adaptation of GPT-4 to accept images, likely using a connector-based approach. GPT-4o was trained natively as a multimodal model from pretraining, processing images, text, and audio in a unified architecture. The key practical differences: GPT-4o is significantly faster (optimised for real-time use), has lower per-token cost, handles higher-resolution images more effectively, and supports native audio input/output in addition to vision. GPT-4o also reportedly tokenises images into discrete visual tokens natively rather than mapping through a separate encoder, enabling tighter integration between modalities.

Why do VLMs hallucinate about images?

VLM hallucination has several root causes. First, the language model backbone has strong prior distributions over co-occurring concepts: if the visual context suggests a kitchen, the model's language priors strongly prefer "refrigerator", "sink", "counter" over unusual objects even if they are not present. Second, the vision encoder produces continuous, compressed representations that lose fine-grained detail: two objects that look different to a human may produce similar patch embeddings. Third, training data often contains noisy or incomplete image-caption pairs, so the model learns to generate plausible descriptions rather than accurate ones. Fourth, the projection layer may not perfectly convey spatial and attribute information from vision to language space. Addressing hallucination requires special training techniques (RLHF-V, POPE-guided training) and careful evaluation.

Can VLMs understand video?

Yes, with varying approaches. The simplest method is to sample N frames from a video and concatenate their visual tokens, treating the video as a long image sequence. This is the approach used by Video-LLaVA, Video-ChatGPT, and similar models. The limitation is context length: even at 1 frame per second, a 30-second video produces 17,280 visual tokens at LLaVA's standard token count. Long-context models (Gemini 1.5 Pro with 1M token context) handle this better. More specialised video VLMs use temporal encoding mechanisms or hierarchical frame sampling to handle longer videos efficiently. GPT-4o and Gemini 1.5 Pro support native video input via their APIs.

How many tokens does one image use?

It depends on the model and resolution. Reference values: LLaVA-1.5 at 336x336 uses 576 tokens (24x24 patches). LLaVA-NeXT with dynamic tiling at high resolution can use up to 2880 tokens per image (5 tiles of 576 each). BLIP-2 with Q-Former uses 32 tokens regardless of resolution. GPT-4V/4o uses approximately 85 tokens for low-detail mode and 170 tokens per 512x512 tile for high-detail mode (so a 1024x1024 image in high detail uses approximately 765 tokens). Claude's API does not publicly disclose exact visual token counts but processes images up to 8000x8000 pixels with pricing based on image area.

Is CLIP the best vision encoder?

CLIP ViT-L/14 is the most commonly used encoder for open-source VLMs due to its strong semantic alignment and wide availability, but it is not universally the best. EVA-CLIP (from BAAI) is a stronger encoder with better performance on dense prediction tasks and is used in InstructBLIP and some LLaVA-NeXT variants. SigLIP (Google, sigmoid loss variant of CLIP) shows better performance on image-text retrieval and is used in PaliGemma. For domain-specific applications, specialised encoders (medical image encoders, satellite imagery encoders) will outperform general-purpose CLIP on their target domain. The trend in 2025-2026 is toward larger encoders (ViT-G, ViT-H class) trained on more diverse data.


References

  • Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual Instruction Tuning (LLaVA). NeurIPS 2023. arXiv:2304.08485
  • Liu, H., Li, C., Li, Y., & Lee, Y. J. (2024). Improved Baselines with Visual Instruction Tuning (LLaVA-1.5). CVPR 2024. arXiv:2310.03744
  • Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). ICML 2021. arXiv:2103.00020
  • Alayrac, J. B., Donahue, J., Luc, P., et al. (2022). Flamingo: a Visual Language Model for Few-Shot Learning. NeurIPS 2022. arXiv:2204.14198
  • Li, J., Li, D., Savarese, S., & Hoi, S. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. ICML 2023. arXiv:2301.12597
  • Dai, W., Li, J., Li, D., et al. (2023). InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. NeurIPS 2023. arXiv:2305.06500
  • OpenAI. (2023). GPT-4V(ision) System Card. openai.com
  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT). ICLR 2021. arXiv:2010.11929
  • Liu, H., Li, C., Li, Y., et al. (2024). LLaVA-NeXT: Improved reasoning, OCR, and world knowledge. llava-vl.github.io

Key Takeaways

  • VLMs require three components: a vision encoder (converts pixels to semantic patch embeddings), a projection layer (maps vision space to language space), and an LLM backbone (reasons over the combined token sequence). Each component's quality limits overall performance.
  • CLIP is the dominant vision backbone for open-source VLMs because its contrastive training produces image representations that are already semantically aligned with language, making the projection learning task tractable.
  • A 336x336 image becomes 576 visual tokens in LLaVA-1.5's pipeline. This token cost is a first-class engineering concern: it determines context window usage, inference latency, and API cost. Dynamic tiling (LLaVA-NeXT) and Q-Former compression (BLIP-2) are the two main strategies for managing it.
  • Two-stage training is the standard recipe: Stage 1 aligns the projection layer by training on image-caption pairs with the LLM frozen; Stage 2 instills instruction-following via multimodal conversation data with the full model (or LoRA adapters) unfrozen. Skipping Stage 1 leads to poor alignment.
  • Hallucination is structural, not accidental. The language model's strong priors over plausible visual scenes can override actual visual evidence, especially for fine-grained counts, attributes, and spatial relationships. POPE is the standard benchmark for measuring hallucination rates.
  • The right tool for the task matters: Use VLMs for tasks requiring visual reasoning plus language understanding. Use dedicated OCR, detection, or segmentation models for tasks requiring maximum precision on narrow, well-defined visual subtasks. The best production systems often combine both.

Related Articles

LLM as Judge: How to Evaluate AI Models Automatically at Scale
LLM as Judge: How to Evaluate AI Models Automatically at Scale
Human evaluation of LLM outputs is slow and expensive. LLM-as-judge uses a...
Read More →
Edge AI: Running LLMs on Your Phone Without the Cloud
Edge AI: Running LLMs on Your Phone Without the Cloud
LLMs no longer require a data center. Phi-3, Gemma, and Apple Intelligence...
Read More →
Found this useful?