Back to research
AI/ML/NLPDecember 12, 202512 min read

Vision-Language Models Explained: GPT-4V, LLaVA, Claude & Multimodal AI Architecture

Understand how GPT-4V, LLaVA, and Claude see images. Learn VLM architecture - vision encoders, multimodal fusion, and how to reduce hallucination in production.

Space Services

Space Services

Vision-Language Models Explained: GPT-4V, LLaVA, Claude & Multimodal AI Architecture
Share:

The convergence of computer vision and natural language processing has produced one of the most transformative developments in artificial intelligence: vision-language models (VLMs). These systems can describe images, answer questions about visual content, and reason across both modalities simultaneously. As we close out 2025, VLMs have become foundational to applications ranging from accessibility tools to scientific discovery.

This article provides a technical deep-dive into VLM architectures, examining the design choices that matter, the mathematics that power them, and the challenges that remain—particularly the stubborn problem of hallucination.

Vision Transformer Patch Visualization
See how images are divided into patches for transformer processing
Ctrl/Cmd + Enter to run

The Architecture of Vision-Language Models

A vision-language model typically consists of three core components: a vision encoder that processes images into meaningful representations, a language model that handles text generation and understanding, and a projection mechanism that bridges these two modalities. Recent systematic studies on what matters when building vision-language models have revealed that careful design choices at each stage significantly impact downstream performance.

Vision Encoders: From Pixels to Patches

The dominant approach for visual encoding in modern VLMs is the Vision Transformer (ViT). Unlike convolutional neural networks that process images through hierarchical local filters, ViT treats an image as a sequence of patches, applying the same self-attention mechanism that revolutionized NLP.

Given an input image IRH×W×CI \in \mathbb{R}^{H \times W \times C} with height HH, width WW, and CC color channels, the image is first divided into NN non-overlapping patches of size P×PP \times P. Each patch is flattened into a vector and linearly projected into an embedding space:

zi=Wpflatten(xi)+bpz_i = W_p \cdot \text{flatten}(x_i) + b_p

where WpRD×(P2C)W_p \in \mathbb{R}^{D \times (P^2 \cdot C)} is the patch embedding projection matrix, DD is the embedding dimension, and xix_i represents the ii-th image patch. The number of patches is N=H×WP2N = \frac{H \times W}{P^2}.

Position information is injected through learnable position embeddings added to each patch embedding:

zi(0)=zi+Epos[i]z_i^{(0)} = z_i + E_{pos}[i]

where EposRN×DE_{pos} \in \mathbb{R}^{N \times D} contains the position embeddings. A special classification token [CLS][\text{CLS}] is prepended to the sequence, and the final representation at this position serves as the global image embedding.

Dynamic Tiling: Handling Variable Resolutions

A significant limitation of vanilla ViT is its fixed input resolution. Real-world images vary dramatically in size and aspect ratio, and naive resizing can destroy fine-grained details crucial for tasks like document understanding or small object recognition.

DeepSeek-VL2 introduces dynamic tiling as an elegant solution. Rather than forcing all images to a fixed resolution, the approach tiles high-resolution images into multiple sub-images, processes each tile independently through the vision encoder, and then combines the resulting representations.

For an image with dimensions H×WH \times W, the dynamic tiling strategy determines an optimal grid configuration (nh,nw)(n_h, n_w) that minimizes aspect ratio distortion while respecting computational constraints:

minimizeHWnhnwsubject tonhnwNmax\text{minimize} \quad \left| \frac{H}{W} - \frac{n_h}{n_w} \right| \quad \text{subject to} \quad n_h \cdot n_w \leq N_{max}

Each tile is processed independently, yielding representations {z(1),z(2),,z(nhnw)}\{z^{(1)}, z^{(2)}, \ldots, z^{(n_h \cdot n_w)}\}, which are concatenated with positional information encoding their spatial arrangement in the original image.

Multimodal Fusion: Bridging Vision and Language

The critical challenge in VLM design is effectively combining visual and textual representations. Several fusion strategies have emerged, each with distinct trade-offs.

Projection-Based Fusion

The simplest approach uses a learned projection layer to map visual features into the language model's embedding space. Given visual features VRNv×DvV \in \mathbb{R}^{N_v \times D_v} from the vision encoder:

V=WprojV+bprojV' = W_{proj} \cdot V + b_{proj}

where WprojRDl×DvW_{proj} \in \mathbb{R}^{D_l \times D_v} projects from the visual dimension DvD_v to the language model dimension DlD_l. The projected visual tokens are then concatenated with text embeddings and processed by the language model.

This approach is computationally efficient but limits the depth of cross-modal interaction. The language model must learn to interpret visual information through a relatively shallow transformation.

Cross-Attention Mechanisms

More sophisticated architectures introduce explicit cross-attention layers that allow the language model to selectively attend to relevant visual features. In cross-attention, the language model's hidden states serve as queries, while visual features provide keys and values:

CrossAttn(Q,K,V)=softmax(QKTdk)V\text{CrossAttn}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V

where Q=WQHtextQ = W_Q \cdot H_{\text{text}}, K=WKHvisionK = W_K \cdot H_{\text{vision}}, and V=WVHvisionV = W_V \cdot H_{\text{vision}}. This allows each generated token to dynamically weight which visual regions are most relevant, enabling fine-grained grounding of language in visual content.

Contrastive Pre-training: The CLIP Foundation

Many VLMs build upon representations learned through contrastive pre-training, most notably CLIP (Contrastive Language-Image Pre-training). The contrastive objective learns to align visual and textual representations in a shared embedding space.

Given a batch of NN image-text pairs, let viv_i and tit_i denote the normalized embeddings of the ii-th image and text respectively. The contrastive loss encourages matching pairs to have high similarity while pushing non-matching pairs apart:

L=loges(v,t)/τjes(v,tj)/τ\mathcal{L} = -\log \frac{e^{s(v,t)/\tau}}{\sum_j e^{s(v,t_j)/\tau}}

where s(v,t)=vTts(v, t) = v^T t is the cosine similarity, and τ\tau is a learned temperature parameter that controls the sharpness of the distribution. The symmetric loss is computed for both image-to-text and text-to-image directions.

This pre-training creates a foundation where semantically similar images and texts cluster together, providing strong initialization for downstream VLM training.

Mixture-of-Experts for Efficient Scaling

Recent work has demonstrated that Mixture-of-Experts (MoE) architectures can dramatically improve the capacity of VLMs without proportional increases in computational cost. DeepSeek-VL2 combines dynamic tiling vision encoding with an MoE language model, where different experts specialize in different types of visual-linguistic reasoning.

In an MoE layer, a gating network G(x)G(x) routes each token to a subset of kk experts from a pool of EE total experts:

MoE(x)=iTopK(G(x))G(x)iEi(x)\text{MoE}(x) = \sum_{i \in \text{TopK}(G(x))} G(x)_i \cdot E_i(x)

This sparse activation allows the model to maintain a large parameter count while keeping inference costs tractable—a crucial consideration for deploying VLMs in production.

The Hallucination Problem

Despite remarkable progress, VLMs suffer from a persistent and troubling failure mode: hallucination. Models confidently describe objects, attributes, or relationships that simply do not exist in the input image. This is not merely an academic concern—hallucination undermines trust and limits deployment in high-stakes applications.

Understanding Hallucination

A comprehensive evaluation of object hallucination in large vision-language models systematically categorized this phenomenon. Hallucinations can be classified into several types:

Object hallucination: The model describes objects not present in the image. When asked to describe a bedroom, a model might mention a television that does not exist.

Attribute hallucination: The model assigns incorrect properties to real objects. A red car might be described as blue, or a standing person as sitting.

Relationship hallucination: The model fabricates spatial or semantic relationships. Two unrelated objects might be described as interacting.

The study found that hallucination rates vary significantly across object categories, with models more likely to hallucinate objects that frequently co-occur in training data. If bedrooms typically contain televisions in the training set, the model develops a statistical prior that can override visual evidence.

Sources of Hallucination

Several factors contribute to hallucination in VLMs:

Language model priors: Pre-trained language models carry strong statistical associations learned from text corpora. These priors can dominate when visual evidence is ambiguous or when the model is uncertain.

Training data biases: If certain object combinations appear frequently in training, the model learns spurious correlations. The presence of object A makes the model predict object B, regardless of whether B is actually visible.

Vision encoder limitations: When the vision encoder fails to capture fine-grained details—due to resolution limits, occlusion, or poor feature extraction—the language model fills gaps with plausible but incorrect content.

Instruction tuning artifacts: The fine-tuning process that makes models helpful can inadvertently encourage confident generation even when appropriate epistemic humility would be better.

Mitigation Strategies

Research has produced several approaches to reduce hallucination:

Improved visual grounding: Training objectives that explicitly require models to ground generated text in specific image regions. This creates accountability for each generated claim.

Contrastive decoding: Comparing model outputs with and without visual input, downweighting tokens that would be generated based on language priors alone.

Calibration training: Teaching models to express uncertainty appropriately, saying "I cannot determine" when visual evidence is insufficient rather than guessing.

RLHF with hallucination penalties: Reinforcement learning from human feedback specifically targeting hallucinated content, training models to prefer accurate descriptions over confident fabrications.

Universal Multimodal Embeddings

Beyond generation tasks, VLMs enable powerful retrieval and similarity search across modalities. VLM2Vec extends this capability by training vision-language models to produce universal multimodal embeddings suitable for massive embedding tasks.

The approach converts any multimodal input—images, text, or combinations—into a single dense vector representation. These embeddings support:

  • Cross-modal retrieval: Finding images given text queries, or finding relevant text given an image
  • Multimodal similarity: Comparing image-text pairs to other image-text pairs
  • Zero-shot classification: Classifying images by comparing embeddings to textual class descriptions

The training objective combines multiple embedding tasks, creating representations that transfer across domains and applications without task-specific fine-tuning.

Design Choices That Matter

The systematic study on building vision-language models identified several critical design decisions:

Vision encoder selection: Larger vision encoders consistently improve performance, but the relationship is not linear. Careful selection of encoder architecture and pre-training data matters as much as scale.

Resolution and token count: Higher resolution images provide more information but increase computational cost quadratically. Dynamic tiling offers a practical compromise.

Fusion depth: Deeper integration between vision and language (more cross-attention layers) improves performance on tasks requiring fine-grained reasoning but increases training complexity.

Training data mixture: The balance between image-text pairs, interleaved documents, and pure text data significantly affects both capability and robustness.

Instruction tuning format: How visual inputs are referenced in instructions—and the diversity of instruction formats during training—strongly influences generalization.

Looking Forward

As 2025 draws to a close, vision-language models have established themselves as a cornerstone of modern AI systems. The combination of dynamic tiling for resolution flexibility, mixture-of-experts for efficient scaling, and sophisticated fusion mechanisms has produced models capable of remarkable multimodal reasoning.

Yet challenges remain. Hallucination continues to limit reliability in high-stakes applications. Computational costs, while improved through sparse architectures, still restrict deployment. And the field is only beginning to explore three-way fusion with additional modalities like audio and video.

The mathematics of VLMs—from patch embeddings to contrastive losses to cross-attention—provides the foundation for understanding both current capabilities and future directions. As researchers continue to refine these architectures, we move closer to AI systems that perceive and reason about the world with the integrated multimodal understanding that comes naturally to humans.

Test Your Knowledge: Vision-Language Models
Question 1 of 5
How do Vision Transformers (ViT) process images?

The pixels and words are learning to speak the same language.

Share:

Related Articles

Space landscape

SPACE SERVICES