The convergence of computer vision and natural language processing has produced one of the most transformative developments in artificial intelligence: vision-language models (VLMs). These systems can describe images, answer questions about visual content, and reason across both modalities simultaneously. As we close out 2025, VLMs have become foundational to applications ranging from accessibility tools to scientific discovery.
This article provides a technical deep-dive into VLM architectures, examining the design choices that matter, the mathematics that power them, and the challenges that remain—particularly the stubborn problem of hallucination.
The Architecture of Vision-Language Models
A vision-language model typically consists of three core components: a vision encoder that processes images into meaningful representations, a language model that handles text generation and understanding, and a projection mechanism that bridges these two modalities. Recent systematic studies on what matters when building vision-language models have revealed that careful design choices at each stage significantly impact downstream performance.
Vision Encoders: From Pixels to Patches
The dominant approach for visual encoding in modern VLMs is the Vision Transformer (ViT). Unlike convolutional neural networks that process images through hierarchical local filters, ViT treats an image as a sequence of patches, applying the same self-attention mechanism that revolutionized NLP.
Given an input image with height , width , and color channels, the image is first divided into non-overlapping patches of size . Each patch is flattened into a vector and linearly projected into an embedding space:
where is the patch embedding projection matrix, is the embedding dimension, and represents the -th image patch. The number of patches is .
Position information is injected through learnable position embeddings added to each patch embedding:
where contains the position embeddings. A special classification token is prepended to the sequence, and the final representation at this position serves as the global image embedding.
Dynamic Tiling: Handling Variable Resolutions
A significant limitation of vanilla ViT is its fixed input resolution. Real-world images vary dramatically in size and aspect ratio, and naive resizing can destroy fine-grained details crucial for tasks like document understanding or small object recognition.
DeepSeek-VL2 introduces dynamic tiling as an elegant solution. Rather than forcing all images to a fixed resolution, the approach tiles high-resolution images into multiple sub-images, processes each tile independently through the vision encoder, and then combines the resulting representations.
For an image with dimensions , the dynamic tiling strategy determines an optimal grid configuration that minimizes aspect ratio distortion while respecting computational constraints:
Each tile is processed independently, yielding representations , which are concatenated with positional information encoding their spatial arrangement in the original image.
Multimodal Fusion: Bridging Vision and Language
The critical challenge in VLM design is effectively combining visual and textual representations. Several fusion strategies have emerged, each with distinct trade-offs.
Projection-Based Fusion
The simplest approach uses a learned projection layer to map visual features into the language model's embedding space. Given visual features from the vision encoder:
where projects from the visual dimension to the language model dimension . The projected visual tokens are then concatenated with text embeddings and processed by the language model.
This approach is computationally efficient but limits the depth of cross-modal interaction. The language model must learn to interpret visual information through a relatively shallow transformation.
Cross-Attention Mechanisms
More sophisticated architectures introduce explicit cross-attention layers that allow the language model to selectively attend to relevant visual features. In cross-attention, the language model's hidden states serve as queries, while visual features provide keys and values:
where , , and . This allows each generated token to dynamically weight which visual regions are most relevant, enabling fine-grained grounding of language in visual content.
Contrastive Pre-training: The CLIP Foundation
Many VLMs build upon representations learned through contrastive pre-training, most notably CLIP (Contrastive Language-Image Pre-training). The contrastive objective learns to align visual and textual representations in a shared embedding space.
Given a batch of image-text pairs, let and denote the normalized embeddings of the -th image and text respectively. The contrastive loss encourages matching pairs to have high similarity while pushing non-matching pairs apart:
where is the cosine similarity, and is a learned temperature parameter that controls the sharpness of the distribution. The symmetric loss is computed for both image-to-text and text-to-image directions.
This pre-training creates a foundation where semantically similar images and texts cluster together, providing strong initialization for downstream VLM training.
Mixture-of-Experts for Efficient Scaling
Recent work has demonstrated that Mixture-of-Experts (MoE) architectures can dramatically improve the capacity of VLMs without proportional increases in computational cost. DeepSeek-VL2 combines dynamic tiling vision encoding with an MoE language model, where different experts specialize in different types of visual-linguistic reasoning.
In an MoE layer, a gating network routes each token to a subset of experts from a pool of total experts:
This sparse activation allows the model to maintain a large parameter count while keeping inference costs tractable—a crucial consideration for deploying VLMs in production.
The Hallucination Problem
Despite remarkable progress, VLMs suffer from a persistent and troubling failure mode: hallucination. Models confidently describe objects, attributes, or relationships that simply do not exist in the input image. This is not merely an academic concern—hallucination undermines trust and limits deployment in high-stakes applications.
Understanding Hallucination
A comprehensive evaluation of object hallucination in large vision-language models systematically categorized this phenomenon. Hallucinations can be classified into several types:
Object hallucination: The model describes objects not present in the image. When asked to describe a bedroom, a model might mention a television that does not exist.
Attribute hallucination: The model assigns incorrect properties to real objects. A red car might be described as blue, or a standing person as sitting.
Relationship hallucination: The model fabricates spatial or semantic relationships. Two unrelated objects might be described as interacting.
The study found that hallucination rates vary significantly across object categories, with models more likely to hallucinate objects that frequently co-occur in training data. If bedrooms typically contain televisions in the training set, the model develops a statistical prior that can override visual evidence.
Sources of Hallucination
Several factors contribute to hallucination in VLMs:
Language model priors: Pre-trained language models carry strong statistical associations learned from text corpora. These priors can dominate when visual evidence is ambiguous or when the model is uncertain.
Training data biases: If certain object combinations appear frequently in training, the model learns spurious correlations. The presence of object A makes the model predict object B, regardless of whether B is actually visible.
Vision encoder limitations: When the vision encoder fails to capture fine-grained details—due to resolution limits, occlusion, or poor feature extraction—the language model fills gaps with plausible but incorrect content.
Instruction tuning artifacts: The fine-tuning process that makes models helpful can inadvertently encourage confident generation even when appropriate epistemic humility would be better.
Mitigation Strategies
Research has produced several approaches to reduce hallucination:
Improved visual grounding: Training objectives that explicitly require models to ground generated text in specific image regions. This creates accountability for each generated claim.
Contrastive decoding: Comparing model outputs with and without visual input, downweighting tokens that would be generated based on language priors alone.
Calibration training: Teaching models to express uncertainty appropriately, saying "I cannot determine" when visual evidence is insufficient rather than guessing.
RLHF with hallucination penalties: Reinforcement learning from human feedback specifically targeting hallucinated content, training models to prefer accurate descriptions over confident fabrications.
Universal Multimodal Embeddings
Beyond generation tasks, VLMs enable powerful retrieval and similarity search across modalities. VLM2Vec extends this capability by training vision-language models to produce universal multimodal embeddings suitable for massive embedding tasks.
The approach converts any multimodal input—images, text, or combinations—into a single dense vector representation. These embeddings support:
- Cross-modal retrieval: Finding images given text queries, or finding relevant text given an image
- Multimodal similarity: Comparing image-text pairs to other image-text pairs
- Zero-shot classification: Classifying images by comparing embeddings to textual class descriptions
The training objective combines multiple embedding tasks, creating representations that transfer across domains and applications without task-specific fine-tuning.
Design Choices That Matter
The systematic study on building vision-language models identified several critical design decisions:
Vision encoder selection: Larger vision encoders consistently improve performance, but the relationship is not linear. Careful selection of encoder architecture and pre-training data matters as much as scale.
Resolution and token count: Higher resolution images provide more information but increase computational cost quadratically. Dynamic tiling offers a practical compromise.
Fusion depth: Deeper integration between vision and language (more cross-attention layers) improves performance on tasks requiring fine-grained reasoning but increases training complexity.
Training data mixture: The balance between image-text pairs, interleaved documents, and pure text data significantly affects both capability and robustness.
Instruction tuning format: How visual inputs are referenced in instructions—and the diversity of instruction formats during training—strongly influences generalization.
Looking Forward
As 2025 draws to a close, vision-language models have established themselves as a cornerstone of modern AI systems. The combination of dynamic tiling for resolution flexibility, mixture-of-experts for efficient scaling, and sophisticated fusion mechanisms has produced models capable of remarkable multimodal reasoning.
Yet challenges remain. Hallucination continues to limit reliability in high-stakes applications. Computational costs, while improved through sparse architectures, still restrict deployment. And the field is only beginning to explore three-way fusion with additional modalities like audio and video.
The mathematics of VLMs—from patch embeddings to contrastive losses to cross-attention—provides the foundation for understanding both current capabilities and future directions. As researchers continue to refine these architectures, we move closer to AI systems that perceive and reason about the world with the integrated multimodal understanding that comes naturally to humans.
The pixels and words are learning to speak the same language.
