Multimodal Models

All sixteen sections are in draft status. Open problems are flagged inline and consolidated in §14.

This chapter is a new chapter with no direct AIMA 4e antecedent. AIMA’s treatment of perception (Ch 24-25) covered classical computer-vision and speech-processing as separate topics; the modern multimodal model unifies these (and more) into single architectures that handle text, images, audio, video, code, structured data, and (in some cases) physical action. The shift is qualitative: a single foundation model handles cross-modal queries (“describe this image”, “convert this audio to text”, “generate code for this diagram”, “translate the speech in this video”) that previously required separate specialized systems.

The chapter consolidates multimodal-specific material that appears in many other chapters: cross-modal SSL (SSL §7), CLIP and contrastive multimodal training (SSL §5, Generative Models §3), multimodal generative models (Generative Models §9), computer-use agents (AI Agents §7), and the perception-side of robotic systems. This chapter develops the unified treatment.

The chapter assumes the Foundation Models chapter, the Large Language Models chapter (especially LLM §3 for Transformer architecture and §8 for tool use), the Self-Supervised Learning chapter (especially §7 for cross-modal SSL), and the Generative Models chapter for modality-specific generation.


Scope and What This Chapter Is About

The chapter develops multimodal models - neural networks that process and/or generate content across multiple modalities (text, images, audio, video, code, structured data, action). We cover the conceptual framing (cross-modal vs unimodal models; alignment vs fusion; native multimodality vs late fusion), the dominant architectures (CLIP-style contrastive models; encoder-fusion patterns; decoder-only multimodal Transformers; diffusion-based multimodal generators), the modality-specific patterns (vision-language, audio-language, video, vision-language-action), the training methodologies (multimodal pretraining, contrastive learning at scale, native interleaved training), the evaluation methodology, and the production landscape.

Approximate length target: 18,000–25,000 words (a major chapter - multimodal models are central to modern AI deployment).


§1. Motivation and Scope

Three worked instances

To anchor the chapter, three concrete instances spanning the modern multimodal landscape.

Instance 1: GPT-4o describes an image. A user uploads a photograph of a kitchen and asks “What’s wrong with this kitchen layout?”. GPT-4o (OpenAI, May 2024) - a native multimodal model - receives the image and the text query together, processes both jointly, and produces a text response identifying issues: “The refrigerator is positioned too far from the prep area; the sink-to-stove-to-fridge work triangle is broken; the island blocks workflow between the cooking and serving areas.” The response is grounded in the specific image, not generic kitchen advice.

Instance 2: Whisper transcribes a multilingual meeting. A user provides an audio recording of a meeting with speakers in English, Spanish, and Mandarin. Whisper (OpenAI, 2022) - a speech-to-text model trained on 680,000+ hours of multilingual audio - produces a transcript with speaker turns identified, language detected per utterance, and translation if requested. The model handles the cross-lingual structure without explicit per-language configuration.

Instance 3: π0 controls a robot arm. A user shows a robot arm controlled by π0 (Physical Intelligence, 2024-2025) a basket of laundry and a folding table; says “fold this laundry.” The model takes camera images of the scene, processes them jointly with the verbal instruction, and produces a sequence of motor commands. The robot picks up shirts, folds them, stacks them. The model - a vision-language-action (VLA) model - bridges perception (vision), language understanding (instruction), and physical action (motor control).

These three instances span the multimodal landscape: vision-language understanding (GPT-4o), audio-language (Whisper), and vision-language-action (π0). They share architectural patterns (large Transformer models processing tokenized inputs from multiple modalities); they differ in modality combinations, training data, and deployment.

What multimodal models are

A working definition. Multimodal models are neural networks that process and/or generate content across more than one modality - text, images, audio, video, code, structured data, action commands, sensor readings.

The crucial property: the modalities are processed jointly, not by separate specialized models stitched together. A multimodal model can answer questions about an image because the image and the question are processed in the same architecture; the answer reflects joint understanding, not pipeline composition.

The contrast with classical computer vision. Pre-deep-learning systems for “image understanding plus text generation” used separate components: a vision system extracted features; an NLP system generated text. The systems were trained separately; interfaces were narrow. Modern multimodal models are single architectures with joint training.

The crucial enabler. The success of multimodal models depends on unifying representations across modalities. The dominant approach in 2026: tokenize all modalities into discrete (or continuous-but-structured) tokens; process the unified token sequence with a Transformer. The model learns cross-modal relationships through joint pretraining on multimodal data.

What multimodal models are not

Several boundaries.

Multimodal is not just “text plus images.” Audio, video, code, structured data, sensor data, action commands are all relevant modalities. The “vision-language model” subset has gotten most attention but is one part of a broader landscape.

Multimodal is not the same as multi-task. A model that does many tasks within a single modality (e.g., translation, summarization, classification - all text) is not multimodal. The defining feature is cross-modality, not task diversity.

Multimodal is not necessarily “native” multimodal. Some multimodal systems are late-fusion: separate per-modality encoders, then a joint head. Others are native multimodal: tokens from all modalities flow through the same Transformer from the start. Both qualify as “multimodal”; the architectural difference matters.

Multimodal is not a single architecture. CLIP-style contrastive models, encoder-fusion patterns, decoder-only multimodal Transformers, diffusion-based multimodal generators - all are multimodal models with different architectures. The chapter develops the range.

The shift from specialized to unified models

A specific industry transition. Pre-2021, computer vision and NLP were largely separate fields with separate conferences (CVPR vs ACL), separate benchmarks, separate communities. AI deployment combined them through pipelines - a vision component plus a text component.

Post-2021, the boundaries dissolved rapidly. CLIP (2021) trained vision and language jointly. Flamingo (2022) handled image+text in a single network. GPT-4V (2023) and GPT-4o (2024) brought multimodal capabilities to deployed frontier LLMs. By 2026, every major frontier model has substantial multimodal capability.

The implications.

Model unification. A single deployed model handles many modalities. Reduces engineering complexity (no separate vision + text pipelines); improves cross-modal capability (joint training enables emergent abilities single-modality models lack).

Capability emergence. Some multimodal capabilities emerge from joint training rather than being explicitly trained. GPT-4o can interpret diagrams, read charts, understand UI screenshots - capabilities not explicitly in the training objective.

Deployment economics. Multimodal-capable models are more useful per deployment than single-modal alternatives; the economic case for multimodality is strong.

Research consolidation. The previously-separate research communities (CV, NLP, speech) have substantially merged. Modern AI labs do not maintain separate research tracks; multimodal work spans all.

Why multimodal matters in 2026

Four motivations.

1. Production deployment. Frontier consumer products are increasingly multimodal. ChatGPT, Claude, Gemini all handle image uploads. Computer-use agents (Anthropic Computer Use, OpenAI Operator) depend on vision. Voice interfaces (ChatGPT Voice Mode, Gemini Live) integrate speech with text. Video understanding is rolling out (Gemini’s video analysis; competitor products following). Multimodal is no longer optional for frontier deployment.

2. Robotics and physical action. VLA models bridge perception, language, and action - enabling AI systems that operate in the physical world. The 2024-2026 progress in robotics is substantially driven by multimodal models (RT-2, OpenVLA, π0, Cosmos and successors). Cross-reference Robotics (planned).

3. Scientific applications. Multimodal models support scientific applications spanning many data types. Microscopy + text descriptions; molecular structures + properties; satellite imagery + climate data. Cross-reference AI for Science.

4. The path to broader AI. Whether or not “AGI” is the right target, broader AI capability requires handling the full range of human communication and perception modalities. Multimodality is a necessary condition for AI systems that interact with the world the way humans do.

Boundaries with adjacent chapters

This chapter consolidates multimodal-specific content that overlaps substantially with other chapters.

  • Self-Supervised Learning §7 covers cross-modal SSL pretraining (CLIP, ALIGN, the vision-language pretraining literature). This chapter develops the modeling aspects on top of that pretraining substrate.

  • Generative Models §9 covers modality-specific generation (text-to-image, video, audio, 3D, molecules). This chapter develops the cross-modal and unified aspects of these systems.

  • Large Language Models §3 develops Transformer architecture. Multimodal models extend LLM architecture to handle non-text tokens; the architectural foundations are LLM-side.

  • Foundation Models provides the FM-as-substrate framing. Multimodal FMs are FMs; the broader pretraining/adaptation framing applies.

  • AI Agents §7 covers computer-use agents (Anthropic Computer Use, OpenAI Operator). These are multimodal in capability; the agent-specific aspects live in the AI Agents chapter.

  • Robotics (planned) develops embodied AI. VLA models (§9 of this chapter) are central to modern robotics; this chapter covers them as a multimodal pattern; the broader robotics context lives in the Robotics chapter.

  • Reinforcement Learning is relevant for VLA training. Many VLA models are trained with combinations of supervised behaviour cloning and RL; the RL details live in the RL chapter.

  • Evaluation §10 covers multimodal evaluation as part of agent evaluation; this chapter §11 covers the multimodal-specific aspects.

  • Deep Learning §4-§6 develop the architectural primitives (CNNs, Transformers, attention) that multimodal models combine.

  • Causality §10 has minor connection (causal reasoning across modalities is a niche but interesting research direction).

What this chapter does not try to do

Several explicit exclusions.

  • We do not provide a complete history of computer vision. The classical-CV-to-deep-CV transition (LeNet → AlexNet → ResNet → ViT) is in DL §4 and is assumed background.

  • We do not develop modality-specific generation in depth. Text-to-image diffusion, video generation, audio synthesis are in Generative Models §9.

  • We do not cover every multimodal benchmark. The space is large; we focus on the dominant benchmarks and methodology.

  • We do not develop the full robotics context for VLA models. The planned Robotics chapter develops embodied AI; this chapter touches VLA as a multimodal pattern.

  • We do not extensively cover the engineering of multimodal serving (multimodal inference infrastructure, vision-tokenizer engineering, mixed-modal-batching). These matter for production but are largely engineering rather than research content.

Position taken in this chapter

The chapter takes multimodal models seriously as a substantial paradigm shift in AI architecture and deployment. The shift is real (frontier models are multimodal; deployment is multimodal; user experience is multimodal). The chapter is appropriately cautious about specific capability claims - multimodal hallucinations, modality-specific failure modes, and uneven quality across modalities are real concerns.

The chapter’s overall framing: multimodal is the modern reality of AI architecture; specific capabilities are uneven across modalities; the unification trend is strong and continuing. The technical content develops what multimodal models do, how, and what remains hard.


§2. Historical Context

This section traces multimodal models from pre-deep-learning specialized systems through the 2024-2026 native-multimodal frontier.

A timeline of the inflection points:

   Pre-2012     Multimodal as pipeline: separate vision and
                  language systems combined via narrow interfaces.
                  Each component trained separately; multimodal
                  applications limited.
                                  │
                                  ▼
   2014-2015    Image captioning emerges: deep-CNN-encoder
                  + RNN-decoder pipelines (Show and Tell, 2014;
                  Karpathy and Fei-Fei, 2015). First true
                  joint vision-language deep learning.
                                  │
                                  ▼
   2015-2018    Visual question answering (VQA) develops as
                  research direction. CNN + RNN architectures
                  with attention. Substantial benchmark
                  development (VQA, GQA, CLEVR).
                                  │
                                  ▼
   2018-2020    Vision-language pretraining via masked-
                  modeling on paired data: ViLBERT (Lu et al.
                  2019), VisualBERT (Li et al. 2019), and many
                  successors. The BERT-style multimodal era.
                                  │
                                  ▼
   2021         CLIP (Radford et al., OpenAI): contrastive
                  vision-language pretraining at internet scale
                  (400M image-text pairs). Zero-shot transfer
                  to many vision tasks. Inflection moment for
                  multimodal: simple objective, massive scale,
                  emergent capability.
                                  │
                                  ▼
   2021-2022    CLIP-style models proliferate: ALIGN (Google),
                  Florence, OpenCLIP. Vision-language
                  pretraining becomes standard.
                                  │
                                  ▼
   2022         BLIP (Li et al., Salesforce): unified
                  vision-language understanding-and-generation.
                  Combines bootstrapping with various objectives.
                  Flamingo (Alayrac et al., DeepMind): few-shot
                  in-context learning for visual tasks via a
                  large frozen LLM with image-perception module.
                                  │
                                  ▼
   2022-2023    LLaVA (Liu et al.) and the open vision-language
                  model proliferation: combining open LLMs with
                  CLIP image encoders via projection layers.
                  Substantial open-source progress.
                                  │
                                  ▼
   2023         GPT-4V (OpenAI, September 2023): GPT-4's vision
                  capabilities released publicly. The first
                  frontier vision-language model widely deployed.
                                  │
                                  ▼
   2023         Gemini 1.0 (Google DeepMind, December 2023):
                  trained natively multimodal from the start
                  (not added later). Pushes the "native
                  multimodality" framing.
                                  │
                                  ▼
   2024 May     GPT-4o (OpenAI): native multimodal frontier
                  model handling text, image, audio in a single
                  unified model. Substantial step beyond GPT-4V
                  in unification and capability. Audio I/O
                  capability brings voice interfaces.
                                  │
                                  ▼
   2024 Mar     Claude 3 (Anthropic) with vision capabilities.
                  Substantial vision-language capability in
                  Claude 3 Opus and successors.
                                  │
                                  ▼
   2024 May     Chameleon (Meta): early-fusion mixed-modal
                  autoregressive model. Image + text generated
                  in a single token stream.
                                  │
                                  ▼
   2024 Oct     Anthropic Computer Use (cross-reference AI
                  Agents §7): vision-and-action agent
                  controlling computers via screenshots.
                                  │
                                  ▼
   2023-2024    Vision-Language-Action models mature: RT-1
                  (2022), RT-2 (Google DeepMind 2023), PaLM-E
                  (2023). OpenVLA (2024) opens the field with
                  open weights. π0 (Physical Intelligence,
                  2024) and successors push robotic-
                  manipulation capability.
                                  │
                                  ▼
   2024-2025    Video generation matures: Sora (OpenAI, Feb
                  2024), Veo (Google), Gen-3 (Runway), Mochi
                  (Genmo). Text-to-video at minute-length
                  becomes possible (cross-reference Generative
                  Models §9).
                                  │
                                  ▼
   2024-2025    Audio-language models advance: AudioLM line,
                  MusicGen, AudioGen; voice-mode integration
                  in frontier chatbots (ChatGPT Advanced
                  Voice Mode, Gemini Live).
                                  │
                                  ▼
   2025-2026    Multimodal becomes standard. Every major
                  frontier model has substantial multimodal
                  capability (text + vision + audio + video).
                  VLA models become the standard for robotics.
                  Computer-use capabilities mature into
                  deployment.

We develop each phase below.

Pre-deep-learning multimodal as pipeline

The early architecture. Pre-deep-learning systems combining modalities did so via pipelines - a vision component produced features; an NLP component consumed them; outputs flowed through narrow interfaces.

Example. Image captioning before deep learning: extract image features (SIFT, HOG, etc.); compute similarity to caption templates; produce a caption via template selection. The system had no joint understanding; the components were independently trained.

The limitations. Pipeline systems were narrow (each component handled a specific task); brittle (errors compounded across the pipeline); non-adaptive (changes to one component required updates to others).

Image captioning and the deep-learning multimodal era

The first true multimodal deep learning. Show and Tell (Vinyals, Toshev, Bengio, Erhan, Google 2015) - CNN encoder + LSTM decoder for image captioning. Trained end-to-end; substantially better than pipeline systems.

Karpathy and Fei-Fei (2015) “Deep Visual-Semantic Alignments for Generating Image Descriptions” extended this with attention over image regions.

The pattern. Encoder-decoder multimodal: a CNN encodes the image; an RNN decoder generates text token-by-token, attending to image features. The architecture became standard for image captioning and related vision-language tasks through 2015-2018.

The limitations. Specialized for vision-to-text; required paired training data; performance plateaued before reaching deployment-grade quality.

Visual question answering

A specific task category that drove multimodal research. Visual Question Answering (VQA): given an image and a natural-language question, produce the answer.

Benchmarks. VQA (Antol et al., 2015); GQA (Hudson and Manning, 2019); CLEVR (Johnson et al., 2017, focused on compositional reasoning); VCR (Visual Commonsense Reasoning).

Architectures. CNN image encoder + RNN/Transformer question encoder + joint fusion + answer prediction. Multiple architectural variants explored attention mechanisms, bilinear pooling, graph-based reasoning.

The trajectory. VQA performance improved steadily through 2018-2020 but remained substantially below human level on many subsets. The architectural innovations of the period (attention, fusion methods) became foundations for later work.

Vision-language pretraining: BERT-style era

The next inflection. Following BERT’s success in NLP (2018), researchers developed vision-language pretraining models with similar self-supervised objectives.

ViLBERT (Lu et al., 2019) - two-stream Transformer with cross-attention; masked-language-modeling and image-text matching objectives.

VisualBERT (Li et al., 2019) - single-stream architecture.

LXMERT (Tan and Bansal, 2019) - cross-modality encoder.

UNITER (Chen et al., 2020) - universal image-text representation learning.

The pattern. Pretrain on paired image-text data with masked-modeling-style objectives; fine-tune for downstream vision-language tasks. The approach substantially improved VQA, image captioning, and similar tasks.

The limitations. The models required paired data (caption-image pairs); training was expensive but scales were modest by later standards (millions of pairs); zero-shot transfer to novel tasks was limited.

CLIP and the contrastive multimodal era

The inflection moment. CLIP (Contrastive Language-Image Pre-training; Radford, Kim, Hallacy et al., OpenAI, 2021) “Learning Transferable Visual Models From Natural Language Supervision.”

The recipe. Train two encoders (one for images, one for text) such that paired image-text examples have similar embeddings and unpaired examples have dissimilar embeddings. The training data: 400 million image-text pairs scraped from the web.

The result. CLIP’s image and text representations are aligned in a shared embedding space. This enables:

  • Zero-shot classification. Classify an image by computing similarity to text descriptions of candidate classes; no per-class training needed.

  • Image retrieval. Find images matching a text query (or vice versa) via embedding similarity.

  • Foundation for downstream models. CLIP-encoded features become substrate for many downstream vision-language tasks.

The CLIP paper showed zero-shot performance on dozens of vision benchmarks rivaling supervised models. The result was striking; multimodal pretraining at internet scale produced general visual capability.

The trajectory. CLIP catalyzed extensive subsequent work:

  • ALIGN (Jia et al., Google, 2021) - similar approach at larger scale.

  • OpenCLIP (Ilharco et al., 2022+) - open-source CLIP reproduction at increasing scale.

  • Florence (Yuan et al., Microsoft, 2021) - vision-language model with broader task coverage.

  • EVA-CLIP, SigLIP, and many variants - methodological refinements.

By 2022-2023, CLIP-style models were the standard visual-encoder substrate for downstream vision-language work. They remain central in 2026.

BLIP, Flamingo, and the encoder-decoder scaling

A different line of work. Several 2022-2023 systems combined large language models with vision encoders to produce generative vision-language models.

BLIP (Li et al., Salesforce, 2022) - unified vision-language understanding and generation. Combined CLIP-style contrastive objectives with image-grounded text generation objectives. Followed by BLIP-2 (Li et al., 2023) - a Q-Former architecture for connecting a frozen vision encoder to a frozen LLM.

Flamingo (Alayrac et al., DeepMind, 2022) - frozen vision encoder + frozen language model + lightweight cross-attention layers for connecting them. Supports few-shot in-context learning on visual tasks. Substantially advanced vision-language capability.

LLaVA (Liu, Li, Wu, Lee, 2023) - open-source vision-language model combining LLaMA (or other open LLMs) with CLIP image encoder via a projection layer. Sparked extensive open-source vision-language model work.

The pattern. Combine pretrained components (vision encoder + language model) with lightweight connectors. Train the connector (and sometimes parts of the LM) on image-text data. Achieve substantial vision-language capability at much lower cost than from-scratch multimodal pretraining.

GPT-4V and the frontier-VLM era

A specific frontier release. GPT-4V (OpenAI, announced September 2023) brought vision capabilities to GPT-4. The first widely-deployed frontier-LLM with substantial vision capability.

The reception. Demonstrated substantially better vision-language capability than prior open systems. Users could upload images to ChatGPT; ChatGPT could discuss them, answer questions, identify content, OCR text.

The trajectory. Subsequent frontier models substantially advanced vision-language capability:

  • GPT-4V (September 2023).

  • Gemini 1.0 (December 2023) - natively multimodal from the start.

  • Claude 3 family (March 2024) - vision in Opus, Sonnet, Haiku.

  • GPT-4o (May 2024) - substantially advanced unified multimodal (text + vision + audio).

  • Gemini 1.5 Pro (early 2024) - long-context multimodal; video processing.

  • Claude 3.5 Sonnet (2024) - vision capability competitive with GPT-4o.

  • GPT-4.5, GPT-5, Claude 4, Gemini 2.0+ (2025-2026) - continued capability advancement.

The pattern. Vision-language capability became expected in frontier models. Deployment of vision-text-audio-video unified models became standard.

Gemini and native multimodality

A specific architectural choice. Gemini 1.0 (DeepMind, December 2023) was explicitly described as “natively multimodal” - trained on multimodal data from the start of pretraining, rather than starting with a text-pretrained LLM and adding vision later.

The argument. Native multimodal training produces better cross-modal capabilities than late-fusion approaches. Modalities can attend to each other from the earliest layers; cross-modal patterns emerge naturally.

The reception. The “native multimodality” framing has become widely adopted. Subsequent frontier models (GPT-4o, Gemini 1.5+, Claude 3.5+) increasingly emphasize joint multimodal pretraining.

The honest assessment. Whether fully native multimodality is necessary, or whether late-fusion methods can reach similar capability with appropriate scale, is contested. The 2026 evidence: both approaches produce capable systems; the differences may be substantial in specific capabilities (audio understanding, real-time interaction) but not universal.

Vision-language-action models

A specific direction with substantial 2023-2026 progress. Vision-Language-Action (VLA) models extend the multimodal paradigm to physical action.

RT-1 (Robotic Transformer 1; Google DeepMind, 2022). First large-scale Transformer-based robot policy; trained on hundreds of thousands of robot trajectories.

RT-2 (Google DeepMind, 2023). Builds on RT-1 with a vision-language-action foundation model. Uses a VLM as the backbone; fine-tunes for robotic action.

PaLM-E (Google DeepMind, 2023). Embodied multimodal language model integrating PaLM with robot state and visual input.

OpenVLA (Kim, Pertsch, Karamcheti et al., 2024). Open-source VLA model; democratized VLA research.

π0 (Pi-zero) (Physical Intelligence, late 2024). Frontier manipulation policy; demonstrates substantial capability on dexterous manipulation tasks.

Cosmos (NVIDIA, 2025+) and related world-model approaches for robotics.

The trajectory. VLA models have moved from research-stage demonstrations to substantive robotic capability in 2024-2026. Cross-reference Robotics (planned chapter) for the broader context.

Computer-use agents as multimodal applications

A specific applied direction. Anthropic Computer Use (October 2024) and OpenAI Operator (January 2025) - agents controlling computers via screenshots and keyboard/mouse. Cross-reference AI Agents §7.

The multimodal aspect. Computer-use agents are vision-and-action multimodal systems: they perceive screens (vision); they understand goals (language); they execute UI actions (action). The same paradigm as VLA but in a different domain (computer interfaces rather than physical robots).

Where this leaves us in 2026

The current state. Multimodal models are standard in frontier AI. Every major frontier model is multimodal; deployment is multimodal; user experience is multimodal. The capability range spans text, vision, audio, video, action.

Specific capability levels.

  • Vision-language understanding. Mature; production-grade for most tasks.

  • Audio-language understanding. Substantially mature; production-grade for speech-to-text.

  • Multi-modal generation. Mature for image and audio; rapidly maturing for video.

  • Vision-language-action. Research-grade to early-production; rapid progress.

  • Long-context multimodal. Emerging; some frontier models handle hour-long videos or large multimodal documents.

  • Cross-modal reasoning. Emerging; works for many cases; fails in subtle ways.

The remaining sections develop the technical content. §3 covers the modeling framework. §4 covers vision-language models. §5 covers vision generation (briefly; cross-reference Generative Models). §6 covers audio and speech. §7 covers video. §8 covers native multimodality. §9 covers VLA. §10 covers cross-modal retrieval. §11 covers evaluation. §12-§16 close out.

Editorial note. Multimodal models are rapidly evolving. Specific model releases and capability levels will date faster than the architectural framework. The chapter is a snapshot of the methodology and the state-of-art as of mid-2026.


§3. The Multimodal Modeling Framework

A conceptual framework for organizing the multimodal modeling landscape. This section develops the vocabulary used throughout the chapter - modality, representation, fusion patterns, native vs late multimodality.

Modality vs representation

A useful initial distinction.

A modality is a form of data - text (sequences of characters), images (2D pixel arrays), audio (waveforms or spectrograms), video (sequences of images), 3D shapes (meshes, point clouds), structured data (tables, graphs), action commands (motor signals, mouse events).

A representation is how a modality is encoded inside a model - discrete tokens (text subwords, image patches as discrete codes from VQ-VAE), continuous embeddings (vision Transformer patch embeddings), spectrograms (frequency-domain audio).

The same modality can have multiple representations. Audio can be represented as raw waveform samples; as spectrograms (frequency-time matrices); as discrete tokens from a learned codec (EnCodec, SoundStream). Each representation is more suitable for different model architectures and tasks.

The implication. Multimodal model design involves choices both about which modalities to handle and how to represent each. These choices substantially shape the architecture.

Unimodal, bimodal, multimodal

Terminology levels.

Unimodal models process one modality. GPT-3 (text-only), AlphaFold 2 (sequence-only at the input level), pre-2021 vision Transformers (image-only). The classical deep-learning paradigm.

Bimodal models combine exactly two modalities. CLIP (image + text); audio-text models like Whisper (audio + text); early vision-language models. Substantively richer than unimodal but constrained to specific modality pairs.

Multimodal models combine more than two modalities. GPT-4o (text + image + audio); Gemini (text + image + audio + video); VLA models (vision + language + action). The 2024-2026 frontier.

The trend in 2026 is toward more modalities in single models. The boundaries between bimodal and multimodal are fluid; the central trend is integration.

Cross-modal vs joint-modal vs unified

A finer distinction within “multimodal.”

Cross-modal models map between modalities. A speech-to-text model (audio → text); a text-to-image model (text → image). The model has input modalities and output modalities; the mapping is the primary operation.

Joint-modal models process multiple modalities together to produce a unified output. A VQA model (image + question → answer); a vision-language model answering questions about images. The model jointly processes multiple input modalities.

Unified multimodal models can both input and output across multiple modalities, often interchangeably. GPT-4o (input: text, image, or audio; output: text, image, or audio). The most flexible category; the 2024-2026 frontier.

The classification matters because architectural choices differ.

  • Cross-modal often uses encoder-decoder (encode source modality; decode target modality).

  • Joint-modal uses fusion architectures (encode each modality; combine for downstream task).

  • Unified uses tokenization-across-modalities + Transformer (all modalities flow through the same architecture).

Alignment vs fusion vs generation

A different axis. What does the multimodal model do with the modalities?

Alignment. Train representations such that paired examples across modalities are similar in embedding space; unpaired are different. The CLIP recipe (§4). Useful for retrieval, classification, search.

Fusion. Combine representations from multiple modalities into a joint representation for a downstream task. The BLIP / Flamingo / LLaVA recipe (§4). Useful for VQA, image captioning, multimodal reasoning.

Generation. Produce output in one or more modalities conditional on inputs. Text-to-image diffusion (§5; cross-reference Generative Models §6); multimodal text-and-image generation (GPT-4o image generation; §8). Useful for content creation, design assistance.

A given system may do multiple of these. CLIP does alignment; it can be used as a component of fusion systems and as a conditioning signal for generation. GPT-4o does fusion (multimodal input → text output) and generation (text input → image/audio output).

Native multimodality vs late fusion

A specific architectural choice that has been contested. Native multimodality vs late fusion.

Late fusion. Start with specialized per-modality components (a vision encoder; a language model). Train them separately on per-modality data. Combine them later via lightweight connectors (projection layers, cross-attention modules). The CLIP-then-LLaVA pattern; many open-source vision-language models.

   LATE FUSION ARCHITECTURE (e.g., LLaVA)

   Image input ─→ Vision Encoder (e.g., CLIP) ─┐
                  (pretrained, frozen)          │
                                                ├─→ Projection Layer ─┐
                                                │   (small, trained)    │
                                                │                       │
                                                │                       ▼
                                                │              Language Model
                                                │              (pretrained, may
                                                │               be fine-tuned)
                                                │                       │
   Text input ─────────────────────────────────┘                       │
                                                                        ▼
                                                                  Output text

Advantages of late fusion. Cost-effective - reuse existing pretrained components. Modular - swap encoders or LMs without retraining everything. Open-source-friendly - combine open vision encoders with open LMs.

Disadvantages. Limited cross-modal integration - modalities only interact at the connection point; complex cross-modal reasoning may be limited. Constrained capability - the system inherits the limitations of its components without joint optimization.

Native multimodality. Train a single model from the start on multimodal data. Tokens from all modalities flow through the same architecture from the earliest layers; cross-modal attention happens at every layer.

   NATIVE MULTIMODAL ARCHITECTURE (e.g., Gemini, GPT-4o, Chameleon)

   Image input ─→ Image tokenizer ─┐
                                    │
   Audio input ─→ Audio tokenizer ─┤
                                    │
   Text input  ─→ Text tokenizer  ─┼─→ Unified token sequence
                                    │
   (any modality)                   │
                                    ▼
                       ┌─────────────────────────┐
                       │  Transformer (all       │
                       │  layers, all tokens     │
                       │  attend to each other)  │
                       └─────────────────────────┘
                                    │
                                    ▼
                       Output: token sequence
                       (text tokens, image tokens,
                        audio tokens - any modality)

Advantages of native multimodality. Deep cross-modal integration - modalities interact at every layer; cross-modal reasoning is emergent. Capability ceiling may be higher than late fusion for complex multimodal tasks.

Disadvantages. Expensive - requires full multimodal pretraining. Less modular - modalities are entangled; harder to update specific components. Less open-source-accessible - only major labs can afford frontier-scale native-multimodal training.

The 2026 picture. Frontier proprietary models (Gemini, GPT-4o, Claude 3.5+) are increasingly native multimodal. Open-source models predominantly use late fusion. The capability gap is real but not absolute - late-fusion models can be very capable, and native multimodal is not always strictly better for specific tasks.

Tokenization across modalities

A specific technical concern for native multimodality. To process multiple modalities in a single Transformer, each modality must be tokenized into a sequence the Transformer can consume.

Text tokenization. Standard subword tokenization (BPE, SentencePiece, Tiktoken). Well-established; vocabulary sizes ~50K-200K tokens.

Image tokenization. Several approaches.

  • Patch embeddings. Vision Transformer (ViT) splits image into 16x16 patches; each patch is linearly embedded into a continuous vector. Continuous rather than discrete tokens.

  • VQ-VAE discrete codes. Train a VQ-VAE to map image patches (or whole images) to discrete codes from a learned codebook. Discrete tokens like text.

  • VQGAN, learned tokenizers. Refinements producing higher-quality discrete image tokens.

Audio tokenization. Similar choices.

  • Continuous features. Mel-spectrograms, wav2vec embeddings.

  • Discrete codecs. EnCodec, SoundStream, DAC - neural audio codecs producing discrete tokens.

Video tokenization. Extend image tokenization to space-time.

  • Per-frame tokenization. Independent tokenization of each frame.

  • Space-time tokenization. 3D tokens spanning space and time. Sora-style.

Action tokenization. For VLA models.

  • Discrete action codes. Discretize action space; train action tokenizer.

  • Continuous actions. Output continuous values directly; less Transformer-friendly.

The architectural implication. Unified discrete tokenization across all modalities enables fully-unified Transformer processing - every modality is a token sequence; the Transformer doesn’t distinguish them. Chameleon (Meta) and similar systems take this approach. Continuous representations require additional architectural machinery (cross-attention, projection layers).

The trade-off. Discrete tokenization is simpler architecturally but loses information (especially for high-resolution images and audio). Continuous tokenization is higher-fidelity but more complex. The 2026 trend: hybrid approaches that maintain continuous representations where fidelity matters and discrete where simplicity matters.

Modality-specific characteristics

A practical observation. Different modalities have different characteristics that affect design.

Text. Discrete; relatively compact (~1 token per several characters); well-structured (grammar, syntax); rich semantic content. The most-developed modality in current AI.

Images. Continuous (or high-dimensional discrete); spatially structured; substantial information per item (a single image is ~256-2048 tokens depending on resolution and tokenizer). Visually rich; semantically complex.

Audio. Continuous (waveforms) or feature-based (spectrograms); temporally structured; substantial information density (1 second of audio is hundreds to thousands of tokens). Less semantically dense than images.

Video. Space-time; very large (1 minute of video can be millions of tokens at high resolution); temporally and spatially structured. The largest-token-budget modality.

Action. Often low-dimensional (motor commands, mouse positions); temporally structured; varies dramatically by domain (text typing vs robot manipulation).

Code. Like text but with stricter syntax and richer semantic structure; sometimes treated as a separate modality, sometimes as a text subspecies.

Structured data. Tables, graphs, knowledge bases. Often flattened to text for current models; native handling is research-stage.

The implication. Modalities have different token budgets and information densities. Multimodal model design must allocate context-window and compute appropriately across modalities.

Where the framework fits

The summary. The conceptual framework - modality vs representation, unimodal/bimodal/multimodal, cross-modal vs joint vs unified, alignment vs fusion vs generation, native vs late, tokenization across modalities, modality-specific characteristics - provides the vocabulary for the remaining sections. The choices vary; the framework is consistent.

Where multimodal modeling sits in 2026

The 2026 frontier is unified, native multimodal models that handle many modalities through unified tokenization and Transformer architecture, with both discrete and continuous representations depending on the modality. The open-source ecosystem remains predominantly late-fusion; the architectural gap matters for some applications and not for others.

The next sections develop the specific modality combinations: §4 covers vision-language; §5 vision generation; §6 audio; §7 video; §8 native unified multimodality; §9 vision-language-action; §10 cross-modal retrieval; §11 evaluation.


§4. Vision-Language Models

The most-developed multimodal model category. Vision-Language Models (VLMs) combine vision and language, supporting tasks like image captioning, visual question answering, image-grounded reasoning, and multimodal chat. This section develops the dominant architectural patterns, the production frontier systems, and the empirical state.

CLIP and contrastive pretraining

The foundational modern approach. CLIP (Contrastive Language-Image Pre-training; Radford, Kim, Hallacy et al., OpenAI, 2021).

The recipe (review from §2).

  • Two encoders. An image encoder (Vision Transformer or modified ResNet) produces image embeddings. A text encoder (Transformer) produces text embeddings. Both produce vectors of the same dimensionality.

  • Contrastive objective. Train on 400M image-text pairs. For each batch, the diagonal of the image-text similarity matrix should be high (matching pairs); off-diagonal should be low (non-matching pairs).

  • Joint embedding space. After training, image and text embeddings live in a shared semantic space. An image and a matching caption have similar embeddings; image-text similarity computes semantic match.

The applications.

Zero-shot image classification. To classify an image with N candidate classes, encode the image and the N class descriptions (“a photo of a [class]”); pick the class with highest embedding similarity. No per-class training needed.

Image retrieval. Given a text query, retrieve images with high text-image similarity. Given an image, retrieve similar-content images via embedding similarity.

Foundation for downstream models. CLIP image embeddings are general-purpose visual representations. Many subsequent vision-language models use CLIP’s image encoder as their visual substrate.

Filtering datasets. CLIP-image-text similarity helps filter training data. LAION-2B, LAION-5B, and other large datasets use CLIP filtering as a quality signal.

The CLIP family. Substantial follow-up work:

  • ALIGN (Google, 2021). Similar approach at larger scale (~1B pairs); demonstrated scaling-law behaviour.

  • OpenCLIP (Ilharco, Wortsman et al., 2022+). Open-source CLIP reproduction at increasing scale (up to CLIP-G with substantial compute).

  • SigLIP (Zhai et al., Google, 2023). Sigmoid loss instead of softmax cross-entropy; better at large batch sizes.

  • EVA-CLIP (Sun et al., 2023). Substantial scaling; state-of-the-art zero-shot performance.

The 2026 state. CLIP-style models are standard infrastructure - frequently used as image encoders in larger systems; widely deployed for retrieval and search.

Encoder-fusion patterns: BLIP and BLIP-2

A different multimodal architecture. BLIP (Bootstrapping Language-Image Pre-training; Li, Li, Xiong, Hoi, Salesforce, 2022) unified vision-language understanding and generation.

The architecture. Multiple modules:

  • An image encoder.

  • A text encoder.

  • An image-grounded text encoder (cross-attention from text to image).

  • An image-grounded text decoder (generates text given image features).

Training objectives. Three objectives applied jointly: image-text contrastive learning (CLIP-style); image-text matching (binary classification of whether image-text pair matches); image-grounded language modeling (generate captions).

The result. BLIP supports both understanding (classification, retrieval) and generation (captioning, VQA in generative form). Substantial advance over CLIP for tasks requiring text generation.

BLIP-2 (Li et al., 2023) refined this. Used a frozen large language model with a small “Q-Former” (Query Transformer) bridging the vision encoder to the LM. Much more parameter-efficient; leveraged large pretrained LMs effectively.

The pattern. Bridge a vision encoder to a language model via a learned connector. Train only the connector (and sometimes the LM) on image-text data; leverage existing components.

Decoder-fusion patterns: Flamingo and LLaVA

A specific architectural pattern that became dominant. Decoder-fusion uses a pretrained language model as the backbone; injects visual information through specific attention mechanisms.

Flamingo (Alayrac, Donahue, Luc et al., DeepMind, 2022) “Flamingo: a Visual Language Model for Few-Shot Learning.”

The architecture.

  • Frozen vision encoder. Pretrained image encoder (Normalizer-Free ResNet).

  • Frozen large language model. Pretrained Chinchilla.

  • Perceiver Resampler. Variable-length image features → fixed-length tokens for the LM.

  • Cross-attention layers interleaved into the frozen LM. These layers attend from text positions to image features.

The training. Train only the Perceiver Resampler and the cross-attention layers. The vision encoder and LM remain frozen.

The result. Flamingo supports few-shot in-context learning on visual tasks: provide a few image-text examples in the prompt; the model performs the task on a new image. Substantial advance for vision-language at the time.

LLaVA (Liu, Li, Wu, Lee, 2023) “Visual Instruction Tuning.”

The architecture. Simpler than Flamingo.

  • CLIP vision encoder. Produces image embeddings.

  • Projection layer. A linear or two-layer MLP mapping image embeddings to LM token-embedding space.

  • Open large language model. LLaMA or other open LM.

The training. Train the projection layer (and optionally fine-tune the LM) on image-instruction-response triplets generated using GPT-4.

The contribution. Open-source vision-language modeling at competitive capability. Demonstrated that modest fine-tuning of an LM with a simple projection from CLIP features could produce capable VLM.

The trajectory. LLaVA spawned a substantial open-source ecosystem:

  • LLaVA-1.5, LLaVA-NeXT, LLaVA-OneVision. Improvements with better data, training, and architecture.

  • MiniGPT-4 (Zhu et al., 2023). Similar pattern.

  • InternVL (Chen et al., 2024). Open vision-language model with substantial scale.

  • Qwen-VL (Bai et al., 2023, 2024). Alibaba’s open VLM family.

  • Many others. The open-source VLM ecosystem became substantial.

By 2024-2026, the LLaVA-style pattern (CLIP encoder + projection + open LM) was the dominant open-source VLM architecture. The capability gap with frontier proprietary models exists but narrowed over time.

The native multimodal frontier

The 2023-2024 inflection. Frontier proprietary models moved from late-fusion (vision added to text-pretrained LM) to native multimodal (joint pretraining from the start).

GPT-4V (OpenAI, September 2023). GPT-4 with vision capability. Architecture details not fully disclosed; based on public information, appears to use a vision encoder bridged to GPT-4.

Gemini 1.0 (Google DeepMind, December 2023). Explicitly described as natively multimodal - trained on text, images, audio, video from the start.

GPT-4o (OpenAI, May 2024). “Omni” model - text, image, audio handled in a single unified architecture. Substantial step beyond GPT-4V in unification.

Claude 3 family (Anthropic, March 2024). Vision capabilities in Opus, Sonnet, Haiku. Substantial vision-language capability.

Claude 3.5 Sonnet (October 2024) - competitive vision capability.

Gemini 1.5 Pro (2024). Long-context multimodal (handles hour-long videos, large multimodal documents).

The trajectory. Each new release adds multimodal capability. The 2026 picture: every major frontier model handles at minimum text+vision; most handle text+vision+audio; some handle text+vision+audio+video.

Performance and benchmarks

A specific empirical state. Vision-language benchmark performance for frontier models:

MMMU (Massive Multi-discipline Multimodal Understanding; Yue et al., 2024). College-level multimodal questions across many subjects. Frontier model performance ~60-75% by 2026; substantial but below human-expert level.

MathVista (Lu et al., 2024). Mathematical reasoning over diagrams, charts, geometric figures. Frontier model performance ~70-85% by 2026.

ChartQA, DocVQA, AI2D. Specialized benchmarks for charts, documents, diagrams.

MMVet (Yu et al., 2023). Tests integrated vision-language capabilities (recognition + OCR + knowledge + spatial reasoning + math + language generation).

RealWorldQA (X.AI). Real-world image understanding.

The pattern. Frontier vision-language capability has substantially improved in 2023-2026. Many benchmarks are approaching saturation; harder benchmarks (MathVerse, MMMU-Pro, others) are emerging.

The gaps. Vision capability is uneven:

  • Visual recognition. Strong; near-human on many tasks.

  • OCR. Strong; production-grade for many use cases.

  • Spatial reasoning. Moderate; substantial failures on precise spatial tasks.

  • Chart and diagram understanding. Good and improving; not yet uniformly reliable.

  • Multi-image reasoning. Moderate; performance drops with more images.

  • Fine-grained recognition. Variable; struggles with rare or technical content.

  • Counting. Surprisingly weak; counting more than ~10 objects is often inaccurate.

The honest picture. Frontier VLMs are capable but uneven. Specific deployment depends on which capabilities are required.

Common VLM failure modes

A practical inventory of where current VLMs fail.

Visual hallucination. The model describes content not in the image. May invent objects, attribute properties incorrectly, miss content present in the image.

OCR confidence errors. The model reads text from images; sometimes invents text when the image is illegible or text isn’t actually there.

Spatial relationship errors. “The dog is to the left of the cat” - VLMs often get spatial relationships wrong even when both objects are correctly identified.

Counting errors. Counting more than ~5-10 objects in an image is unreliable.

Following along with image references. “What did I show you in the previous image?” Multi-image context is handled imperfectly.

Refusal artifacts on images. VLMs sometimes refuse benign image-related requests because of training overcautious about visual content.

Modality confusion. Sometimes the model treats text in images as a separate text input rather than as visual content.

These failure modes inform deployment caveats; cross-reference §11 (evaluation) and §13 (critiques).

Where VLMs sit in 2026

The summary. Vision-language models are substantial production reality. Frontier deployment handles many visual tasks; the open-source ecosystem provides accessible alternatives.

The gaps. Specific capabilities (spatial reasoning, counting, fine-grained recognition, multi-image) remain imperfect. Visual hallucinations persist. Cost and latency for vision processing are substantial compared to text-only operations.

The trajectory. Continued capability improvement; expanded multimodal integration; declining cost per query. The 2026 state is much improved over 2022 baselines; further substantial improvement is expected.

The next section briefly covers vision generation; the more-detailed treatment lives in Generative Models §6-§9.


§5. Vision Generation: Text-to-Image and Beyond

This section briefly covers vision generation; the detailed treatment lives in Generative Models §6-§9. This chapter develops the multimodal aspect - how vision generation integrates with broader multimodal systems.

Text-to-image as multimodal generation

The dominant deployed application. Text-to-image takes a text prompt as input and produces an image. The architecture spans multiple modalities: language understanding (process the prompt); cross-modal alignment (relate text to image content); image generation (produce the pixel output).

Cross-reference Generative Models §6 for the underlying diffusion mechanism; §7 for flow-matching; §8 for conditional generation; §9 for modality-specific architectures. This section covers the multimodal-modeling aspects.

The standard architecture

The canonical text-to-image pipeline (review from Generative Models §8):

   TEXT-TO-IMAGE PIPELINE (review)

   text prompt
       │
       ▼
   Text encoder (CLIP / T5)
       │
       ▼
   Conditioning vectors
       │
       ▼
   Diffusion/flow-matching denoising loop
       │  (in latent space, conditioned on text)
       │  20-50 sampling steps
       ▼
   Latent representation
       │
       ▼
   VAE decoder
       │
       ▼
   Output image (e.g., 1024×1024)

The multimodal aspects.

  • Text encoder. CLIP or T5; produces text representations.

  • Cross-modal conditioning. The denoising network attends to text features via cross-attention.

  • Output modality. Image (different modality from input).

Notable production systems (cross-reference Generative Models §9):

  • Midjourney v6/v7. Visually distinctive; closed.

  • DALL-E 3 / GPT-4o image generation. Tightly integrated with chat.

  • Stable Diffusion 3 / SDXL. Open weights; widely deployed.

  • Flux (Black Forest Labs). Flow-matching-based; high quality.

  • Imagen 3 / Imagen 4 (Google).

  • Firefly (Adobe). Stock-photography-trained; clean licensing.

Specialized vision generation models

Beyond general text-to-image, specialized vision generation has matured.

ControlNet (Zhang et al., 2023). Adds spatial conditioning (depth maps, edge maps, pose skeletons) to text-to-image generation. Allows precise spatial control.

Image editing models. InstructPix2Pix, DALL-E 3 editing, Adobe Firefly editing. Take an image and instructions; produce edited image.

Image-to-image style transfer. Specialized models for stylistic transformations.

Layout-conditional generation. Generate images respecting specified bounding boxes or scene structures.

Personalized generation. DreamBooth, LoRA-based personalization. Generate images of specific subjects (people, objects) from a few reference images.

These specialized capabilities increasingly integrate into general multimodal models; the boundary between general and specialized is fluid.

The unification with vision-language understanding

A specific 2024 trend. Native multimodal models (GPT-4o, Gemini, Chameleon) increasingly unify vision understanding and vision generation in single architectures.

The architectural pattern.

   UNIFIED VISION UNDERSTANDING + GENERATION (e.g., GPT-4o)

   Mixed input tokens:
     - Text tokens (subword BPE)
     - Image tokens (VQ-VAE codes or patch embeddings)
       
       │
       ▼
   Unified Transformer
   (all tokens flow through same model;
    cross-modal attention at every layer)
       │
       ▼
   Mixed output tokens:
     - Text tokens (text response)
     - Image tokens (generated image)
     - May interleave (text with embedded images)

The capability. The model can:

  • Describe an image. (Text output from image input.)

  • Generate an image. (Image output from text input.)

  • Edit an image. (Image output from image + text input.)

  • Generate interleaved text and images. (Mixed output.)

  • Reason about generated images. (Self-grounded generation with verification.)

The trade-off. Unified systems are more flexible but less specialized. A dedicated text-to-image model (Stable Diffusion, Flux) may produce higher-quality images than a unified system; the unified system handles broader tasks with adequate quality.

The 2026 landscape. Both paradigms coexist. Specialized text-to-image dominates content creation use cases (artists, designers). Unified multimodal dominates general use (chatbots producing images as part of responses, agents that need to generate and reason about images).

Where vision generation in multimodal sits in 2026

The summary. Vision generation is mature production technology. The integration with vision-language understanding (unified multimodal models) is the 2024-2026 frontier. Specialized text-to-image and unified multimodal both have legitimate niches.

Cross-reference Generative Models for the detailed treatment of the underlying generative-modeling techniques.


§6. Audio and Speech

A specific modality with substantial 2022-2026 maturation. Audio spans speech (the dominant subcategory), music, environmental sound, and acoustic data more broadly. This section develops speech-to-text, text-to-speech, general audio understanding, music generation, and the integration with unified multimodal models.

Speech-to-text: Whisper and beyond

The dominant deployed audio application. Speech-to-text (or automatic speech recognition, ASR) takes audio input and produces text transcription.

Whisper (Radford, Kim et al., OpenAI, 2022) “Robust Speech Recognition via Large-Scale Weak Supervision.” The landmark modern speech-to-text model.

The recipe.

  • Architecture. Encoder-decoder Transformer. Encoder takes log-mel spectrograms of audio; decoder generates text token-by-token.

  • Training data. 680,000 hours of multilingual, multi-task audio paired with transcripts. The largest open-domain ASR training corpus to date.

  • Multi-task training. Same model handles transcription, translation (audio in language X → text in English), language identification, voice activity detection. Trained with a unified objective.

  • Robustness. Trained on noisy real-world data; substantially more robust to accents, background noise, and recording quality than prior systems.

The result. Whisper substantially advanced open-source ASR quality. Multiple model sizes released (Whisper-Tiny through Whisper-Large-v3). Becomes standard infrastructure.

The trajectory. Subsequent ASR models built on Whisper-style architectures:

  • Whisper-v3 (OpenAI, 2023). Improved multilingual performance.

  • Distil-Whisper (Hugging Face, 2023). Distilled smaller versions of Whisper.

  • Seamless (Meta, 2023). Streaming ASR + translation in a single model.

  • Many specialized ASR systems. Voice-assistant integrations; medical transcription; legal transcription.

The 2026 state. Speech-to-text is substantially solved for clean recordings of major languages. Edge cases (heavy accents, very low-resource languages, severely noisy audio, technical domain vocabulary) remain imperfect.

Text-to-speech

The inverse task. Text-to-speech (TTS) takes text and produces audio.

The pre-deep-learning era used concatenative synthesis (stitching together recorded phonemes); the result was robotic-sounding. Deep-learning TTS (Tacotron, WaveNet) substantially improved quality through the late 2010s.

The 2022-2026 advances. Modern TTS produces speech indistinguishable from human in many settings.

Notable systems.

Tortoise TTS (Betker, 2022). Open-source high-quality TTS; expressive control; voice cloning from short reference samples.

XTTS, XTTS-v2 (Coqui, 2023). Multilingual TTS with voice cloning.

VALL-E (Microsoft, 2023). Speech LM trained on 60K hours of data; high-quality voice cloning from 3-second reference.

NaturalSpeech 3, NaturalSpeech 4 (Microsoft). Diffusion-based TTS.

Voicebox (Meta, 2023). In-context-learning TTS.

ElevenLabs. Commercial high-quality TTS with voice cloning.

OpenAI TTS, Google TTS, Anthropic voice integration. Production-grade TTS in major chatbot products.

The 2026 capabilities.

  • Voice cloning. Replicate a specific voice from short reference audio.

  • Emotional control. Generate speech with specified emotion or style.

  • Multilingual. Single models handling many languages.

  • Real-time. Low-latency synthesis for interactive applications.

The concerns. Voice cloning enables deepfake voice impersonation. Production deployments include detection mechanisms; the cat-and-mouse with adversarial deepfake creation is ongoing.

General audio understanding

A broader category. General audio understanding extends speech-to-text to all kinds of audio - music, environmental sound, acoustic events.

AudioLM (Borsos et al., Google, 2022). Audio language model trained on raw audio. Can continue audio sequences (music continuation, speech continuation); learns acoustic structure without text supervision.

MusicLM (Agostinelli et al., Google, 2023). Music generation from text descriptions. Hierarchical audio generation; produces 30+ second music clips from prompts.

AudioGen (Kreuk et al., Meta, 2023). Generate sound effects from text descriptions (“a dog barking at a thunderstorm”).

CLAP (Contrastive Language-Audio Pretraining; LAION, 2022+). Audio analogue of CLIP. Joint audio-text embeddings for audio retrieval and classification.

AudioFlamingo (Kong et al., NVIDIA, 2024). Audio-language model in the Flamingo lineage.

Qwen-Audio, Qwen2-Audio (Alibaba). Open audio-language models.

The 2026 state. General audio understanding is less mature than speech-to-text or vision-language. Production deployment exists but is narrower; research is active.

Music generation

A specific application. Music generation has become substantial commercial reality.

Notable systems.

MusicLM (Google, 2023). Foundational research; not deployed as product.

MusicGen (Meta, 2023). Open-source music generation from text.

Suno (commercial, 2023-2024+). High-quality vocal-and-instrumental music generation from text prompts. Reached significant commercial scale by 2024-2025.

Udio (commercial, 2024+). Competitor to Suno; similar capabilities.

Stable Audio, Stable Audio 2.0, Stable Audio Open (Stability AI). Diffusion-based music and sound generation; open weights for some variants.

Adobe Generative Sound. Commercial sound design tools.

The 2026 state. Music generation can produce short polished pieces (30 seconds to a few minutes) in many genres. Longer pieces, complex compositions, specific stylistic requirements remain challenging.

The legal context. Music generation has been subject to substantial copyright concerns. Several lawsuits filed against Suno and Udio (2024) alleging training-data infringement; outcomes still evolving in 2026. The legal framework affects deployment but does not fundamentally change technical capabilities.

Audio in unified multimodal models

The 2024-2026 frontier. Audio is increasingly integrated into native multimodal models.

GPT-4o (OpenAI, May 2024). Handles audio input and output natively in a unified model. Voice Mode integrates speech with text in a single architecture. Substantial step beyond bolted-on TTS+ASR pipelines.

Gemini Live (Google, 2024). Real-time voice interaction with Gemini.

Claude voice integration (Anthropic, planned/rolling out 2024-2025).

The capability. The model can:

  • Listen. Process spoken queries.

  • Speak. Generate spoken responses.

  • Reason about audio content. Discuss music, identify sounds, analyze audio patterns.

  • Interactive conversation. Real-time back-and-forth conversation including voice.

The advantages over bolted-on TTS+ASR.

  • Tone preservation. The model “hears” emotional cues; can respond appropriately.

  • Conversational naturalness. Latency reduction; interruption handling; turn-taking dynamics.

  • Audio reasoning. Integrated understanding of audio content (music, ambient sound) within reasoning.

The 2026 deployment. Voice interfaces have become production-grade for major chatbots. Real-time voice conversation with frontier models is mainstream user experience.

Where audio sits in 2026

The summary. Audio modalities have substantially matured in 2022-2026. Speech-to-text (Whisper-family) is production-grade. Text-to-speech is at human-or-better quality. Music generation has substantial commercial deployment. General audio understanding is research-stage but advancing. Unified audio-language-vision multimodal is the frontier.

The remaining issues. Long-form music generation (full songs with development). High-quality audio at low compute cost (current models are still expensive). Audio understanding for non-speech non-music (environmental, acoustic monitoring). Voice cloning safeguards.

The trajectory. Continued capability improvement; expanded multimodal integration; declining cost. The 2026 state is much improved over 2022 baselines; further substantial improvement is expected.

The next section develops video, the modality with the largest data scarcity and highest compute requirements.


§7. Video Models

The modality with the largest data dimensions, the highest compute requirements, and the most-rapid 2024-2026 advancement. This section develops video understanding and generation, with cross-references to Generative Models §9 for the modality-specific generation techniques.

Video as space-time data

The structural characteristic. Video is space-time data - a sequence of images (spatial structure) over time (temporal structure). The combination produces very high data volume.

The token-budget calculation. At standard resolution:

  • A single image at 1024×1024 with patch size 16×16 produces 4,096 patches → ~4,096 tokens.

  • One second of video at 30 fps produces 30 frames → 30 × 4,096 = ~120,000 tokens.

  • One minute of video produces ~7.2 million tokens.

This is enormous. Standard model contexts (128K-1M tokens) cannot hold full-resolution video at meaningful length. Video processing requires aggressive compression and abstraction.

The compression approaches.

Reduced frame rate. Process 1-2 fps instead of 30. Loses motion fidelity; preserves semantic content.

3D tokenization. Group adjacent frames into space-time tokens. A 4-frame 64×64 spatial block becomes one token. Sora-style approach.

Hierarchical encoding. Process video at multiple scales - coarse global features plus selected fine-grained details.

Selective attention. Process all frames but use sparse attention patterns that don’t have quadratic cost in frame count.

The 2026 state. Video tokenization remains an active engineering area. Different systems make different choices; no single approach dominates.

Video understanding

The task category. Video understanding takes video input and produces analytic output - captions, answers to questions about content, action recognition, temporal localization.

The evolution from image understanding. Pre-2022 video models were mostly trained per-task (action recognition; activity detection; etc.) with specialized architectures. Post-2022, video understanding increasingly extends vision-language models to handle temporal structure.

Notable systems.

Video-LLaMA, Video-ChatGPT, VideoChat. Open-source video-language models built on the LLaVA pattern with temporal extensions.

Gemini 1.5 Pro (Google, 2024). Native multimodal model with substantial video understanding. Can process hour-long videos within its long context window.

GPT-4o video understanding. Vision capabilities extended to video frames; multi-frame reasoning improving.

Claude 3.5+ video capability rolling out 2024-2025.

Qwen2.5-VL, InternVL2+ open-source video understanding.

The capabilities (2026).

  • Short-video understanding (< 1 minute): mature; production-grade for many tasks.

  • Long-video understanding (1-60 minutes): emerging; some frontier models (Gemini 1.5+) handle this.

  • Hour-scale video (1+ hour): Gemini 1.5+ Pro demonstrated; not widely available.

  • Temporal reasoning (cause-and-effect, action ordering): partial; reliable for simple cases.

  • Action recognition: good; was already mature.

  • Video question answering: good for factual questions; poorer for reasoning-intensive.

The benchmarks.

Video generation

The high-profile 2024-2025 development. Sora (OpenAI, February 2024) demonstrated minute-length text-to-video generation at substantially higher quality than prior systems. Cross-reference Generative Models §9 for the underlying mechanisms.

Notable systems (review from Generative Models §9).

Sora (OpenAI, Feb 2024). Diffusion-based; minute-length coherent generation; DiT architecture; substantial training compute. Initial product release April 2024.

Veo, Veo 2, Veo 3 (Google DeepMind, 2024-2025). Competitor to Sora; improving capability through 2025-2026.

Gen-3, Gen-4 (Runway, 2024-2025). Commercial text-to-video for creative applications.

Mochi (Genmo, 2024). Open-weight text-to-video model.

Hunyuan Video (Tencent, 2024). Open-source video generation.

Stable Video Diffusion (Stability AI, 2023+). Open-weight video generation foundation models.

Kling (Kuaishou, China, 2024+). Substantial commercial video generation.

Adobe Firefly Video (Adobe, 2024+). Creative-application-focused.

The capabilities (2026).

  • Short-clip generation (5-15 seconds): mature; production-grade for many use cases.

  • Minute-length generation: substantial; quality improving rapidly.

  • Multi-minute generation: emerging; reliability and quality still uneven.

  • Specific control (camera angles, character consistency, scene transitions): partial; getting better.

  • Visual coherence (consistent objects across frames): good for most cases; struggles with rare interactions.

  • Physical realism (gravity, contact, fluid dynamics): variable; better in 2025-2026 than earlier.

The current artefact regimes. Video generation in 2026 still exhibits:

  • Object permanence failures (objects appearing/disappearing).

  • Physics violations (impossible motions, floating objects).

  • Character drift (the same character looks slightly different across frames).

  • Text rendering (in-video text often garbled).

  • Long-range coherence (story consistency over minutes).

The trajectory. Each generation reduces these artefacts. The 2025-2026 frontier is substantially better than the 2024 baseline.

The data scarcity problem for video

A specific structural concern. Video data is substantially scarcer than text or image data.

The comparison.

  • Text. Trillions of tokens available from web scraping; effectively unlimited for training.

  • Images. Billions of image-text pairs (LAION-5B and successors); also effectively unlimited.

  • Audio. Hundreds of thousands of hours of paired audio-text (Whisper’s 680K hours; comparable to other large speech datasets).

  • Video. Hundreds of thousands of hours of paired video-text - but each hour is much larger in information content than text or images.

The mismatch. Video has orders of magnitude more bits per data point than text but orders of magnitude less data overall. The information bottleneck is opposite to text - too much per-item content, too few items.

The implications.

Training data quality matters more. Each video example contributes substantial information; careful curation has high marginal value.

Self-supervised pretraining is essential. Video models cannot rely on paired captions alone; they must use self-supervised objectives (predict next frame; reconstruct masked patches; etc.) to leverage unpaired video.

Data efficiency improvements matter. Methods that learn from less video data (e.g., transferring from image pretraining, using synthetic data) have outsized value.

Long-video benchmarks are expensive. Each evaluation point requires processing long video; benchmark costs are substantial.

Multimodal training compensates. Video models trained jointly with text and image data benefit from cross-modal transfer; pure-video pretraining is less effective.

Where video sits in 2026

The summary. Video generation has substantially advanced in 2024-2026; minute-length high-quality generation is production reality. Video understanding has matured for short clips; long-video understanding is emerging. The data scarcity problem persists but is partially compensated by multimodal training.

The remaining issues. Long-form video coherence remains imperfect. Physical-realism artefacts persist. Long-video understanding is uneven. Compute costs are substantial. Cross-reference Generative Models §10 OP-GM-8 (long-horizon coherent generation) for the related open problem.

The trajectory. Video is the fastest-improving multimodal category in 2024-2026. Continued advancement is expected.


§8. Native Multimodality and Token-Unified Models

The 2024-2026 frontier in multimodal modeling. Native multimodal models train on multiple modalities from the start; token-unified models process all modalities as a single token sequence. This section develops the dominant patterns and the production systems.

The native multimodal philosophy

A specific architectural commitment. Native multimodal training proposes:

  • Don’t start with a text-pretrained LLM and add vision/audio/video later.

  • Do train from the start on a mixture of modalities - text, images, audio, video, code, structured data.

  • The result: a model with integrated multimodal understanding from the earliest layers.

The argument. Modalities are not independent. Visual content has linguistic structure (text in images; visual metaphors in language; cross-references). Audio has linguistic content (speech; named entities in sound effects). Code is structured text. Adding modalities to a text-pretrained model produces bolted-on understanding; native training produces integrated understanding.

The empirical evidence. Frontier native multimodal models (Gemini 1.0+, GPT-4o, Chameleon) demonstrate emergent capabilities not easily replicated by late-fusion systems:

  • Reading diagrams and producing code based on them.

  • Interpreting screenshots of UIs and producing exact-match actions.

  • Reasoning about audio content (music structure, emotional cues) within text responses.

  • Cross-modal context grounding (referring back to images shown earlier in a conversation).

The contrarian view. Late-fusion models with sufficient scale can match native multimodal on most tasks. The capability gap is real but not absolute; it costs substantial compute to achieve and may not justify itself for all use cases.

The 2026 industry pattern. Frontier proprietary models go native; most open-source models are late-fusion. The gap is one factor (among others) driving the proprietary-vs-open capability differential.

Tokenization for unified multimodal

A specific technical requirement. To process multiple modalities in a single Transformer, all modalities must be tokenized into a common representation.

The unified-discrete approach. Tokenize every modality into discrete tokens; concatenate; feed to a Transformer that does not distinguish modalities.

  • Text: BPE/SentencePiece tokens.

  • Images: VQ-VAE codes or similar discrete image tokens.

  • Audio: EnCodec or similar discrete audio codes.

  • Video: 3D VQ-VAE or per-frame discrete codes.

  • Action: Discretized motor commands or UI events.

The advantages.

  • Architectural simplicity. Single Transformer; no modality-specific machinery.

  • Unified generation. The model can generate any modality’s tokens; output flexible.

  • Cross-modal attention is automatic. All tokens attend to each other; cross-modal relationships emerge naturally.

The disadvantages.

  • Quality loss in tokenization. VQ-VAE-style tokenization loses some information; high-resolution images / high-fidelity audio suffers.

  • Tokenization is a learned component. Bad tokenizers limit downstream performance; tokenizer engineering matters.

The hybrid approach. Many modern systems use continuous representations for some modalities (high-fidelity vision via patch embeddings) and discrete tokens for others (text). The Transformer handles both via appropriate input adaptations.

Chameleon: early-fusion mixed-modal

A specific architectural example. Chameleon (Meta AI, 2024) “Chameleon: Mixed-Modal Early-Fusion Foundation Models.” A native multimodal model trained on interleaved image+text data.

The architecture.

  • Image and text tokens flow through the same Transformer from the start.

  • Images tokenized via VQ-VAE into discrete codes; text via standard BPE.

  • Output is also unified: the model produces token sequences that may interleave text and image tokens; image tokens are decoded by a VQ-VAE decoder into actual images.

The training. Trained on internet-scale interleaved image-text data (web pages, articles, documents with mixed content). No separate image-only or text-only stages.

The result. Chameleon demonstrates single-model image-and-text generation and understanding. Can produce interleaved outputs (“Here’s the analysis [text]. The graph would look like this: [generated image]. As you can see [text]...”).

The trade-off. Chameleon’s image-generation quality is below specialized text-to-image systems (Stable Diffusion, Flux); its image-understanding quality is below specialized VLMs. But it does both in one model.

GPT-4o: the omni model

The flagship native multimodal release. GPT-4o (OpenAI, May 2024) - “o” for “omni.” Handles text, image, audio in a unified architecture.

The capabilities.

  • Input. Text, images, audio (and combinations).

  • Output. Text, images, audio (and combinations).

  • Voice Mode. Real-time spoken conversation; latency comparable to human conversation; emotional tone in voice output.

  • Image generation. Improved image generation released later (specific release dates vary by region).

  • Cross-modal reasoning. Reason about audio content while producing text; describe images in spoken responses; etc.

The technical details (largely undisclosed). OpenAI has not published detailed architectural specifications. Based on capabilities and behaviour, GPT-4o appears to be a native multimodal model with unified tokenization across modalities.

The reception. Substantial qualitative advance over GPT-4V. Voice Mode in particular changed the user-experience landscape for AI conversation.

Gemini: native multimodality from the start

A different frontier example. Gemini family (Google DeepMind, December 2023+).

Gemini 1.0 was explicitly natively multimodal from initial training. Gemini 1.5 Pro added long context (1M+ tokens) supporting hour-long video processing. Gemini 2.0+ continued capability expansion through 2025-2026.

The differentiators.

  • Long context. Gemini 1.5+ handles substantially longer contexts than competitors, including long videos.

  • Multimodal context. The long context can include mixed text, images, video, audio.

  • Native training emphasis. Marketing-and-engineering claim that native training produces superior cross-modal capability.

The 2026 product landscape. Gemini is one of three frontier multimodal model families (alongside GPT and Claude). The capability comparison varies by task; all three have substantial multimodal capability.

Mixed-modal output generation

A specific capability worth detailed treatment. Some modern systems can generate interleaved outputs across modalities - text with embedded images, transcripts with audio, etc.

The applications.

  • Documents with generated diagrams. Producing reports with explanatory diagrams alongside text.

  • Interactive tutorials. Generating educational content with images and explanations.

  • Multimodal storytelling. Stories with illustrations generated alongside narrative.

  • Code with visualizations. Code outputs that include generated plots, diagrams.

The technical pattern.

   MIXED-MODAL OUTPUT GENERATION

   Input: "Explain how photosynthesis works."

   Model output (token sequence):
     [text tokens] "Photosynthesis is the process by which..."
     [text tokens] "It can be summarized in this diagram:"
     [image tokens] (generates a diagram of photosynthesis)
     [text tokens] "As shown above, light energy is captured..."
     [text tokens] "Here's a chemical equation:"
     [text tokens] "6CO2 + 6H2O → C6H12O6 + 6O2"
     [text tokens] "Now let me explain each step..."

   The output is a single token stream that the decoder
   renders as text + embedded images.

The 2026 state. Mixed-modal output generation is increasingly common in frontier systems. Quality and reliability vary; the capability is real and growing.

The 2024-2026 frontier

A specific industry assessment. The native-multimodal trajectory through 2024-2026:

  • 2023. Gemini 1.0 introduces native multimodality framing. GPT-4V brings vision to GPT-4 (likely late fusion).

  • 2024 May. GPT-4o demonstrates substantial native multimodal capability; sets the high-water mark.

  • 2024 Sept. Chameleon and similar open-source systems demonstrate the native multimodal pattern is feasible outside frontier proprietary labs.

  • 2025. Native multimodality becomes table-stakes for frontier labs. Native video, audio, image, text integration is standard.

  • 2026. Expanding modality coverage (3D, structured data, action); cross-modal reasoning improves; long-multimodal-context becomes routine.

The trajectory continues. Native multimodality is the dominant architectural direction for frontier AI.

Where native multimodality sits in 2026

The summary. Native multimodal models are the frontier of AI architecture. Major frontier proprietary models are native multimodal; capability advances are substantial. The open-source ecosystem trails but is catching up.

The remaining issues. Training cost is substantial (native multimodal training requires multimodal data at scale). Tokenizer engineering is non-trivial. Cross-modal hallucinations occur. Long-multimodal-context quality is uneven.

The trajectory. Continued expansion across modalities; continued capability advancement; declining cost relative to capability. The 2026 state is substantially advanced over 2022; further substantial improvement is expected.

The next section develops vision-language-action models - the application of multimodal modeling to robotic control.


§9. Vision-Language-Action Models

The extension of multimodal modeling to physical action. Vision-Language-Action (VLA) models combine perception (vision), instruction-understanding (language), and motor control (action) in unified architectures. This section develops the VLA paradigm; the broader robotics context lives in the planned Robotics chapter.

The robotics connection

The motivating problem. Pre-VLA robotic control followed two patterns:

Classical control. Hand-engineered controllers for specific tasks. Reliable on the task they were designed for; brittle outside that task. Each new task required new engineering.

Specialized RL. Train an RL policy on a specific task via simulation or real-world rollouts (cross-reference RL §7). Substantially better than classical for many tasks; still narrow per-policy; sample-inefficient.

The promise of VLA. Use large pretrained multimodal models as policy backbones. The model brings broad visual understanding, language understanding, and (with appropriate adaptation) action capability. Tasks specified in natural language; behaviour emerges from large-scale pretraining plus task-specific fine-tuning.

The pattern. A VLA model receives:

  • Visual input - camera images of the robot’s environment.

  • Language input - task description (“pick up the red block”).

  • State input - robot proprioception (joint angles, gripper position) where relevant.

And produces:

  • Action output - motor commands (joint torques, gripper open/close, end-effector poses).

The architecture. Modern VLAs are typically Transformer-based with multimodal tokenization. Action tokens are integrated into the same vocabulary as vision and language tokens; the model generates action tokens autoregressively.

RT-1: the first scaled robotic Transformer

The early-2023 inflection. RT-1 (Robotic Transformer 1; Brohan et al., Google DeepMind, 2022-2023). The first large-scale Transformer-based robot policy.

The setup.

  • Architecture. Image encoder (EfficientNet-based) + Token Learner + Transformer.

  • Action representation. Discretized 11-dimensional action space (arm pose, gripper, mobile base) into 256 bins per dimension; output is a sequence of action tokens.

  • Training data. ~130,000 robot trajectories across 13 months of data collection on real robots. Substantial scale for the time.

  • Tasks. 700+ language-described tasks; demonstrated cross-task generalization.

The result. RT-1 substantially advanced robotic learning at scale. Demonstrated that Transformer-based policies could absorb large diverse robot data; cross-task transfer was real.

The limitations. Required substantial real-robot data collection (expensive); generalization beyond training environment was limited; computational cost was significant.

RT-2: vision-language-action with a VLM backbone

The follow-up that defined the VLA paradigm. RT-2 (Brohan et al., Google DeepMind, 2023). “Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.”

The architecture.

  • Backbone. A large vision-language model (PaLI-X or PaLM-E) pretrained on web-scale image-text data.

  • Co-fine-tuning. Continue pretraining on a mixture of: (a) web image-text data; (b) robot trajectories.

  • Action representation. Actions tokenized into the VLM’s existing text vocabulary (using existing token IDs to represent action codes).

  • Output. The VLM produces text-token outputs that include action tokens; the action tokens are decoded to motor commands.

The result. RT-2 substantially outperformed RT-1 on novel tasks. The web-pretraining provided world knowledge the robot inherited (recognizing objects it had never been trained on; following novel instructions). Demonstrated substantial generalization from web data to robotic action.

The conceptual lesson. Pretraining on web data benefits robotic policies. The VLA paradigm leverages large-scale multimodal pretraining as substrate for robotic capability.

PaLM-E and the embodied multimodal era

A specific framing. PaLM-E (Driess, Xia, Sajjadi et al., Google DeepMind, 2023). “PaLM-E: An Embodied Multimodal Language Model.”

The setup. A PaLM language model with multimodal input - text, images, robot proprioception, even raw sensor readings. Trained jointly on multimodal data and robotic data.

The contribution. Demonstrated that embodied multimodality - including robot-specific modalities like joint angles and sensor readings - could be unified in a single language model. Substantial cross-domain transfer (robot tasks benefiting from general vision-language training).

The 2023-2026 trajectory. PaLM-E and RT-2 established the VLA paradigm. Subsequent work has refined and extended.

OpenVLA and the open VLA family

The open-source 2024 inflection. OpenVLA (Kim, Pertsch, Karamcheti, Xiao et al., Stanford / Berkeley / Toyota Research Institute, 2024). “OpenVLA: An Open-Source Vision-Language-Action Model.”

The setup.

  • Backbone. Prismatic-7B vision-language model (open-source).

  • Training data. Open X-Embodiment dataset (Open X-Embodiment Collaboration, 2023-2024) - a unified collection of ~1M+ robot trajectories from 22+ robot platforms.

  • Action representation. Tokenized 7-DoF actions.

The contribution. Open VLA at competitive capability. Sparked extensive open-source VLA research. Demonstrated that VLA at substantial capability is feasible outside frontier proprietary labs.

The trajectory. OpenVLA, its successors (OpenVLA-OFT, others), and the broader open VLA ecosystem (Octo, RDT-1B, Pi-zero successors) have become the open-source VLA substrate.

π0 and frontier manipulation

A specific high-profile 2024-2025 system. π0 (Pi-zero) (Physical Intelligence, October 2024). “π0: A Vision-Language-Action Flow Model for General Robot Control.”

The architecture. Combines:

  • A vision-language model backbone.

  • A flow-matching (cross-reference Generative Models §7) action head producing continuous-valued actions.

  • Substantial training on diverse manipulation data.

The capabilities. Demonstrated substantial dexterous manipulation: folding laundry, packing groceries, complex multi-step tasks. Capability levels approached deployment-relevant thresholds for the first time in many tasks.

The reception. Substantial industry attention. Physical Intelligence raised substantial funding; multiple competitors emerged (Figure AI, 1X Technologies, others working on humanoid robotic systems).

The 2025-2026 trajectory.

  • π0.5 (Physical Intelligence, 2025) - successor with improved capability.

  • Helix (Figure AI, 2025) - humanoid-robot VLA with substantial dexterity.

  • Gemini Robotics (Google DeepMind, 2025) - VLA extending Gemini.

  • Cosmos (NVIDIA, 2025) - world-model-based approach for robotic learning.

The 2026 state. VLA models have moved from research-stage to early-deployment for specific applications. Substantial commercial activity; substantial unsolved problems.

The data and benchmarks problem

A specific structural concern. VLA training data is scarce relative to text or image data.

The comparison.

  • Text. ~10^13 tokens available.

  • Images. ~10^9 image-text pairs.

  • Video. ~10^5-10^6 hours.

  • Robot trajectories. ~10^6-10^7 episodes (Open X-Embodiment scale); each episode is short (seconds to minutes).

The bottleneck. Real-robot data collection is expensive - requires physical robots, human supervision, time. Synthetic data (simulation) helps but has its own sim-to-real gap concerns.

The approaches.

Cross-embodiment training. Train on data from many robot platforms; transfer across embodiments. Open X-Embodiment is a substantial example.

Simulation pretraining. Use simulated data for pretraining; fine-tune on real-robot data.

Web-data pretraining transfer. Leverage web vision-language pretraining (the RT-2 / OpenVLA pattern) to amortize robot-specific data needs.

Imitation from video. Learn from observational video of humans performing tasks; no robot trajectories required. Active research direction.

The benchmarks problem. Evaluating VLA is hard.

  • Real-robot evaluation. Expensive; per-trial costs in time and setup.

  • Simulation evaluation. Sim-to-real gap; results may not transfer.

  • Standardization. Different labs use different robots; comparisons are non-trivial.

Notable benchmarks.

  • LIBERO (Liu et al., 2023). Standardized robotic manipulation benchmark in simulation.

  • CALVIN (Mees et al., 2022). Language-conditioned manipulation tasks.

  • Open X-Embodiment evaluations. Cross-embodiment benchmark from the collaboration.

  • RoboBench, ManiSkill, RoboCasa. Simulation-based comprehensive benchmarks.

  • Real-robot benchmark suites (per-lab and growing community standards).

The 2026 state. VLA benchmarking is less mature than LLM benchmarking. Comparisons across systems require care; many published results are not directly comparable.

Vision-language-action in computer use

A specific connection. Computer-use agents (Anthropic Computer Use, OpenAI Operator; cross-reference AI Agents §7) are also a form of VLA - they take visual input (screenshots), language instructions, and produce actions (mouse, keyboard).

The differences from robotic VLA.

  • Action space. Discrete UI events (clicks at pixel coordinates, key presses) rather than continuous motor commands.

  • Environment. Computer interfaces (digital, well-defined) rather than physical world (continuous, noisy).

  • Failure modes. Different - UI changes vs physics violations.

The architectural similarity. Both are multimodal models producing action sequences conditional on visual and language input. The same paradigm; different action spaces.

The 2026 state. Computer-use VLA is more mature than robotic VLA - easier action space, easier environments, easier data collection (screen recordings can substitute for action labels).

Where VLA sits in 2026

The summary. VLA has moved from research demonstrations to early commercial deployment in 2024-2026. Frontier systems (π0, Helix, Gemini Robotics) demonstrate substantive manipulation capability. The open-source ecosystem (OpenVLA, successors) supports broader research.

The remaining issues. Data scarcity persists. Benchmark standardization is incomplete. Real-world reliability is uneven. Long-horizon manipulation remains hard. Sim-to-real gaps persist.

The cross-references. The planned Robotics chapter will develop the broader embodied-AI context; this chapter develops the multimodal-model aspect. Cross-reference Generative Models §7 for the flow-matching machinery used in π0 and similar; cross-reference RL for the RL-side of VLA training; cross-reference AI Agents §7 for the computer-use side.

The trajectory. VLA is one of the fastest-improving multimodal categories in 2024-2026. Continued advancement is expected; the destination (reliable general-purpose robotic VLA) is approached but not reached.


A specific multimodal application with substantial commercial deployment. Cross-modal retrieval searches one modality using queries from another - find images matching a text description, find audio clips matching a description, find videos matching a visual query. This section develops the dominant techniques and applications.

CLIP-style retrieval

The dominant retrieval mechanism. CLIP (§4) and similar contrastive models produce aligned embeddings for paired image-text data; embedding similarity supports cross-modal retrieval.

The mechanism.

   CLIP-STYLE CROSS-MODAL RETRIEVAL

   Setup:
     - Image encoder produces image embeddings.
     - Text encoder produces text embeddings.
     - Both live in shared semantic space.

   Index time:
     For each image in collection:
       compute image embedding.
       Store in vector index (FAISS, Pinecone, etc.).

   Query time:
     Input: text query.
     1. Compute text embedding.
     2. Find top-K images with highest cosine similarity to text embedding.
     3. Return ranked image results.

The variants. The same mechanism works for any modality pair with aligned embeddings: text → image, image → text, text → audio (with CLAP), text → video (with VideoCLIP variants), image → product (for e-commerce).

The deployment. CLIP-style retrieval is substantial industrial infrastructure. Used in:

  • Stock-photo search (Adobe Stock, Getty Images, Shutterstock).

  • Image-search in major search engines (Google Images, Bing Images).

  • Product search in e-commerce (Amazon, eBay).

  • Content moderation (find images matching policy descriptions).

  • Personal photo libraries (Google Photos, Apple Photos search).

  • Document retrieval (find images in documents matching text queries).

The 2026 state. CLIP-style retrieval is mature production technology. Continues to be refined; the fundamental approach is well-established.

Multimodal RAG

A specific application. Multimodal Retrieval-Augmented Generation extends RAG (cross-reference LLM §9; Causality discussions of RAG) to multimodal sources.

The pattern.

   MULTIMODAL RAG (illustrative)

   User query (text): "What does the cooling system in a Tesla
                        Model S look like and how does it work?"

   Step 1: Retrieve relevant content from multimodal corpus.
     - Text documents about Tesla cooling systems.
     - Diagrams and images of the cooling system.
     - Possibly videos demonstrating it.

   Step 2: Construct context for generation.
     - Include retrieved text passages.
     - Include retrieved images (as image tokens).
     - Possibly include video frames.

   Step 3: Generate response.
     - Multimodal LLM processes retrieved context.
     - Produces text response grounded in retrieved content.
     - May include retrieved images in the response.

The applications.

  • Technical documentation Q&A. Mix of text manuals and diagrams.

  • Medical-information retrieval. Text references plus medical imaging.

  • Educational content. Lessons that mix text and figures.

  • Customer support. Retrieving from manuals that include text and images.

The 2026 state. Multimodal RAG is emerging production technology. Several commercial systems (Notion Q&A, Glean, Perplexity multimodal mode) integrate it. The infrastructure is maturing.

Cross-modal search applications

A broader category. Cross-modal search supports many specific applications.

Reverse image search. Find images visually similar to a query image. Foundational since pre-deep-learning era; CLIP-style models substantially improved.

Visual product search. “Show me products that look like this.” Used in e-commerce; integrates with shopping.

Music similarity search. “Find songs similar to this audio clip.” CLAP-style models.

Code search by description. Search code repositories by natural-language description rather than keyword. Used in GitHub Code Search and similar.

Document layout search. Find documents with similar visual structure (layout, formatting).

3D object search. Find 3D models similar to a query (text, image, or 3D). Specialized but growing.

Scientific literature search. Find papers with similar figures, diagrams, or data visualizations. Specialized academic application.

Hybrid retrieval

A specific methodological pattern. Hybrid retrieval combines multiple retrieval signals.

The pattern.

  • Sparse retrieval (BM25-style keyword matching) for exact terms.

  • Dense retrieval (embedding similarity) for semantic match.

  • Cross-modal retrieval (CLIP-style) for image queries.

  • Re-ranking (with a more expensive cross-encoder) for top results.

Combine signals via:

  • Score fusion (weighted combinations of similarity scores).

  • Reciprocal rank fusion.

  • Learned ranking models.

The 2026 production practice. Hybrid retrieval is standard in serious deployments. Pure-embedding retrieval is often a baseline; hybrid systems outperform.

Cross-modal embedding spaces

A more general perspective. CLIP-style training produces a unified embedding space spanning text and images. The same idea generalizes.

  • Tri-modal embeddings. Text + image + audio in one space (CLAP-style extensions). Underlying systems like ImageBind (Meta, 2023) demonstrate six-modality embedding spaces.

  • Cross-lingual cross-modal. Embedding spaces spanning multiple languages and multiple modalities. Multilingual CLIP variants.

  • Embedding for retrieval and downstream tasks. Embeddings used both for retrieval and as features for downstream prediction.

The 2026 state. Multi-modality unified embedding spaces are substantial research and moderate deployment. The applications are clear; the engineering matures.

Where cross-modal retrieval sits in 2026

The summary. Cross-modal retrieval is mature production technology for the dominant cases (text-image retrieval, reverse image search). Multimodal RAG is emerging. Specialized cross-modal search applications (audio, code, 3D) have niches.

The infrastructure. Vector databases (Pinecone, Weaviate, Chroma, FAISS, pgvector), embedding-API providers, and retrieval-orchestration frameworks support the production ecosystem.

The trajectory. Continued capability advancement; expanded multimodal coverage; declining cost. The 2026 state is mature; further refinement rather than paradigm shifts is expected.

The next section develops evaluation of multimodal models - the methodology specific to multimodal capability assessment.


§11. Evaluating Multimodal Models

The methodological challenges specific to multimodal evaluation. Cross-references Evaluation chapter for the cross-cutting methodology; this section develops the multimodal-specific aspects.

Vision-language benchmarks

The dominant category. Most multimodal evaluation centres on vision-language tasks.

MMMU (Yue et al., 2024) “MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark.” College-level questions across many subjects requiring multimodal reasoning. Frontier model performance ~60-75% by 2026; substantial but below human-expert level.

MathVista (Lu et al., 2024). Mathematical reasoning over diagrams, charts, figures. Frontier model performance ~70-85% by 2026.

ChartQA (Masry et al., 2022). Question answering over charts and plots. Frontier model performance ~80%+ by 2026.

DocVQA (Mathew et al., 2021). Document visual question answering. Production-grade for many documents; degrades on complex layouts.

AI2D (Kembhavi et al., 2016). Diagram understanding. Foundational benchmark; still relevant for diagram-specific evaluation.

MMVet (Yu et al., 2023). Integrated capabilities - recognition + OCR + spatial reasoning + math + language generation in unified evaluation.

RealWorldQA, MM-Vet v2, MathVerse, MMMU-Pro. Successor benchmarks with explicit contamination-resistance and increased difficulty.

The pattern. Vision-language benchmarks span recognition, OCR, reasoning, math, document understanding, chart interpretation. Frontier models perform well on many; capability is uneven across categories.

Vision-language evaluation challenges

A specific category of methodological concerns.

Visual grounding. Evaluating whether a model’s response is actually grounded in the image (vs hallucinated). Hard to measure automatically; often requires human review.

Spatial-reasoning evaluation. Standard benchmarks measure aggregate accuracy; spatial-reasoning failures are often qualitatively distinct (off-by-direction; wrong reference frame). Specialized benchmarks (Visual Spatial Reasoning) target this.

Hallucination evaluation. Specifically measure how often models describe content not present in images. Benchmarks like POPE (Polling-based Object Probing Evaluation; Li et al., 2023) target this.

Multi-image reasoning. Tasks requiring reasoning across multiple images. Standardized evaluation is emerging; benchmarks like BLINK (Fu et al., 2024) push on this.

Following image references in dialogue. “What was in the previous image?” Evaluating multi-turn multimodal context is harder than single-turn.

OCR accuracy under various conditions. Standard OCR benchmarks measure performance on clean text; real-world deployment encounters skewed, low-resolution, handwritten, multilingual text.

Video evaluation

A specific category with distinctive challenges.

Short-video benchmarks. MVBench (Li et al., 2023). VideoMME (Fu et al., 2024). Standard for general video understanding evaluation.

Long-video benchmarks. MLVU (Zhou et al., 2024) - multi-level long-video understanding. EgoSchema (Mangalam et al., 2023) - long-form egocentric video.

Temporal-reasoning benchmarks. TempCompass (Liu et al., 2024) - evaluates temporal reasoning specifically. Tests action ordering, cause-and-effect, temporal localization.

Generation benchmarks. For text-to-video: FVD (Frechet Video Distance), human preference (analogues of Chatbot Arena for video), specific generation challenges.

The challenges.

  • Cost. Each evaluation requires substantial video processing; total benchmark cost is high.

  • Variability. Video benchmarks have substantial variability across runs; reproducibility is harder than for static benchmarks.

  • Long-video scarcity. Few well-curated long-video benchmarks; the long-video evaluation infrastructure is younger.

  • Ground-truth ambiguity. What’s the “right” answer to “what happened in this 10-minute video?” Substantive disagreement among human annotators.

Audio evaluation

Specific patterns for audio.

Speech-to-text. Word Error Rate (WER) is the standard metric. Whisper, competitors, and academic systems are evaluated on canonical speech corpora (LibriSpeech, Common Voice, etc.). Production-grade for many tasks.

Text-to-speech. Quality is subjective; human-evaluation pairwise preference dominates. MOS (Mean Opinion Score) on 1-5 scales is the historical metric; recent work uses pairwise preference per Chatbot-Arena-style evaluation.

Music generation. Quality is similarly subjective. FAD (Frechet Audio Distance) provides a quantitative metric; pairwise human preference is the gold standard.

General audio understanding. Less mature; ad-hoc per-task evaluations.

Voice cloning quality. Speaker-similarity metrics; human evaluation for naturalness. Increasingly important as voice cloning becomes a deployed capability.

VLA evaluation

Specific challenges for vision-language-action models.

Real-robot evaluation. The gold standard - actually deploy the policy on a robot; measure success rate. Expensive; time-intensive; environment-specific.

Simulation evaluation. Cheaper but has sim-to-real gap concerns. LIBERO, CALVIN, RoboCasa provide standardized simulators.

Cross-embodiment evaluation. Different robots have different action spaces; standardized evaluation across embodiments is hard.

Long-horizon manipulation. Tasks requiring many sequential actions (folding laundry, packing groceries); evaluation requires substantial setup per trial.

Reproducibility. Real-robot evaluation is hard to reproduce across labs; even simulated evaluation varies with simulator version and hardware.

The reference-free quality problem

A specific instance of the broader problem (Evaluation §3, §6). Many multimodal outputs are open-ended - generated images, generated audio, generated video - and have no clear “reference” to compare against.

The patterns (from Evaluation §6 with multimodal specificity).

Pairwise human preference. Show humans two outputs (e.g., two generated images for the same prompt); pick which is better. The dominant evaluation pattern for generative multimodal.

LMSYS-style arenas. Chatbot Arena Vision compares VLM outputs; LMSYS Image Arena compares text-to-image outputs. Production-grade evaluation infrastructure.

LLM-as-judge with vision. Use a vision-capable LLM as judge for vision outputs. Inherits the biases of LLM-as-judge (Evaluation §7); additionally has visual judgment biases.

Specialized quality metrics. FID for images (Generative Models §10); FVD for videos; FAD for audio. Each captures specific quality dimensions imperfectly.

Human expert evaluation. For specialized domains (medical imaging quality, scientific visualization), domain experts evaluate.

Cross-modal hallucination evaluation

A specific safety-relevant evaluation category. Cross-modal hallucinations occur when models describe content not present in input. Multiple benchmarks specifically target hallucination.

  • POPE (Polling-based Object Probing Evaluation; Li et al., 2023). Object-existence questions on images.

  • HallusionBench (Guan et al., 2024). Vision-language hallucination evaluation across visual illusions, errors, mathematical content.

  • MMHalBench, AMBER. Specialized hallucination benchmarks.

  • CHAIR (Caption Hallucination Assessment with Image Relevance). Image-captioning-specific hallucination metric.

The findings. Modern frontier VLMs hallucinate less than 2023 baselines but still hallucinate substantially. Hallucination rates of 10-30% on various tasks are common; specialized benchmarks reveal substantive failure modes.

Where multimodal evaluation sits in 2026

The summary. Multimodal evaluation is substantial but uneven. Vision-language has the most-mature benchmark ecosystem. Video, audio, VLA evaluation are less mature with distinctive challenges. Hallucination evaluation is increasingly central.

The remaining issues. Reference-free quality measurement requires human evaluation, which has cost and scalability constraints. Long-context multimodal evaluation is expensive. Cross-embodiment VLA evaluation lacks standardization.

Cross-references. Evaluation chapter develops the cross-cutting methodology; this section develops multimodal-specific aspects. AI Agents §10 covers agent evaluation including computer-use agents. Generative Models §10 covers generative-model evaluation including image, video, audio generation.


§12. Connections to Other Chapters

This chapter consolidates multimodal-specific content overlapping with many other chapters. Cross-references throughout.

  • Self-Supervised Learning §7 covers cross-modal SSL pretraining (CLIP and successors at the training-objective level). This chapter develops the modeling aspects on top of that pretraining substrate.

  • Generative Models §9 covers modality-specific generation (text-to-image, video, audio, 3D, molecules) in detail. This chapter (§5) provides the multimodal-context overview.

  • Foundation Models provides the FM-as-substrate framing. Multimodal FMs are FMs; the broader pretraining/adaptation framing applies.

  • Large Language Models §3 develops the Transformer architecture that multimodal models extend. LLM §8 develops tool use; multimodal models often combine multimodal input with tool use.

  • AI Agents §7 develops computer-use agents - a specific multimodal-VLA application. §9 of this chapter discusses VLA more broadly.

  • Reinforcement Learning §7 develops continuous control and §12 develops reasoning-RL; both connect to VLA training. RL §10 develops RLHF which applies to multimodal models with adapted preference data.

  • Robotics (planned chapter) develops embodied AI in domain depth. This chapter §9 covers VLA as a multimodal pattern; the planned Robotics chapter develops the broader robotics context.

  • Evaluation §10 develops cross-cutting evaluation; this chapter §11 develops multimodal-specific evaluation.

  • Alignment §7, §11 develop dangerous-capability evaluations and agentic safety; multimodal capabilities introduce specific safety concerns (deepfakes, computer-use risks).

  • Deep Learning §4-§6 develop the architectural primitives (CNNs, Transformers, attention) that multimodal models combine.

  • Mechanistic Interpretability §10 develops MI for safety; multimodal MI is an emerging research direction.

  • Causality §11 has minor connection (causal reasoning across modalities is a niche but interesting research direction).

  • AI for Science §3-§9 develops scientific multimodal applications (AlphaFold structure, microscopy, climate). Cross-modal scientific applications are a substantial growth area.


§13. Critiques and Alternative Perspectives

This section presents critiques of multimodal models as substantive intellectual positions.

“Multimodality is mostly cross-modal retrieval”

A specific sceptical position. The argument: most of what “multimodal” does in practice is cross-modal retrieval (find images matching text; find products matching descriptions). The more ambitious claims - joint multimodal reasoning, cross-modal understanding - are oversold relative to what’s actually deployed.

The pushback. The retrieval applications are substantial commercial reality but not the only use. GPT-4o’s voice interactions; Sora-style video generation; VLA robot control - none of these are “just retrieval.” The capability range is broader than the critique acknowledges.

The chapter’s position. Cross-modal retrieval is the most-deployed multimodal application; deeper multimodal capabilities exist and matter; both should be acknowledged.

“Native multimodality is unnecessary overhead”

A more substantive critique. The argument: late-fusion models (LLaVA-style: vision encoder + LM + projection) achieve substantial capability at much lower training cost. Native multimodal training is expensive and produces marginal capability gains over well-tuned late fusion. Frontier labs invest in native multimodality partly because they can afford it, not because it’s necessarily the best architectural choice.

The pushback. Native multimodal does provide real capability advantages on specific tasks (real-time audio conversation; tight cross-modal reasoning; mixed-modal output generation). The cost is real but the benefit is also real.

The chapter’s position. The architectural choice depends on goals and budget. For most open-source applications, late fusion is appropriate. For frontier proprietary models pursuing maximum multimodal capability, native is justified.

“Vision-language-action is dominated by data scarcity”

A specific critique. The argument: VLA progress in 2024-2026 is impressive but heavily constrained by robot-data scarcity. Until robotic data collection becomes substantially cheaper, VLA will remain narrow and brittle. The recent advances may saturate quickly.

The pushback. Cross-embodiment training, web-data transfer, simulation pretraining, and video-imitation approaches are reducing data dependence. The trajectory in 2024-2026 has continued advancing; sustained progress is plausible.

The chapter’s position. The data-scarcity concern is real. VLA progress depends partly on whether data-efficiency methods continue advancing. Honest forecasting acknowledges substantial uncertainty.

The interpretability gap

A specific concern. Multimodal models are harder to interpret than text-only models. Vision features are high-dimensional and continuous; cross-modal attention patterns are complex; native multimodal models have less mature interpretability tools (cross-reference Mechanistic Interpretability §10 OP-MI-7).

The chapter’s position. The interpretability gap is real. Multimodal MI is an underdeveloped area. Whether multimodal-specific MI matures quickly enough to support frontier multimodal alignment is uncertain.

Energy and compute concerns

A practical concern. Multimodal models are substantially more expensive than text-only models - both training (multimodal data is large; training compute scales) and inference (image / audio / video processing has substantial per-query cost).

The environmental implication. Multimodal AI consumes more energy than text-only AI per query. As multimodal deployment grows, the energy footprint grows commensurately.

The economic implication. Cost-per-query for multimodal is higher; deployment economics depend on whether users will pay the premium for multimodal capability.

The chapter’s position. The concerns are real. Efficiency improvements (better tokenization, sparse attention, distillation) help but don’t eliminate the gap. Cost-aware multimodal deployment is increasingly standard.

Multimodal hallucinations

A specific failure-mode critique. Multimodal hallucinations (describing content not in images; inventing details in videos; making up content for audio) are systematic failure modes. The rate is substantially nonzero for frontier systems.

The implications for deployment.

  • Medical applications cannot tolerate hallucinated diagnoses.

  • Legal applications cannot tolerate hallucinated case content.

  • Educational applications cannot tolerate hallucinated facts.

  • Many high-stakes applications require human review of multimodal AI outputs.

The chapter’s position. Hallucinations are a substantial deployment limitation for multimodal models. Mitigation (specific training; retrieval grounding; constrained generation; explicit uncertainty) helps but doesn’t eliminate. OP-MM-5 captures this.


§14. Limitations and Open Problems

Consolidated open-problems list. Each carries an OP-MM-N identifier.

  • OP-MM-1. Native multimodal training at frontier scale. Native multimodal training is expensive and largely confined to frontier proprietary labs. Whether the capability advantage justifies the cost in production deployment is contested. Whether open-source ecosystems can produce competitive native multimodal models is open.

  • OP-MM-2. Video data scarcity. Video data is information-rich but underrepresented relative to text and images. Effective video pretraining at the scale needed for frontier capability requires either more data or more data-efficient methods. OP-MM-2 captures this.

  • OP-MM-3. Vision-language-action data scarcity. Real-robot data is expensive; substantial dependence on simulation, cross-embodiment transfer, and web-pretraining transfer. Whether VLA can match LLM-level capability on robotic control with current data approaches is open.

  • OP-MM-4. Long-context multimodal. Multimodal context is information-dense; long-multimodal-context capability requires substantial compute and engineering. Hour-long video reasoning, multi-document multimodal analysis, persistent multimodal context across long interactions - all are emerging frontiers.

  • OP-MM-5. Cross-modal hallucinations. Multimodal hallucinations persist as systematic failure modes. Specific mitigation (grounded generation; retrieval; constraint) reduces but does not eliminate. Adversarial-robustness of multimodal models is partial.

  • OP-MM-6. Multimodal evaluation methodology. Many evaluation categories (long video, audio understanding, VLA, mixed-modal generation, cross-modal reasoning) lack mature standardized methodology. Reference-free evaluation is expensive; LLM-as-judge has biases (cross-reference Evaluation §7).

  • OP-MM-7. Audio understanding at frontier capability. Speech-to-text is mature; general audio understanding (music, environment, acoustic patterns) is less developed. Whether audio LM capability can reach frontier-text-LM level is open.

  • OP-MM-8. Mixed-modal generation quality. Generating interleaved text and images is increasingly possible but quality is below specialized text-to-image. Whether unified generation can match specialized generation is open. Cross-reference Generative Models OP-GM-N for the underlying generative-modeling open problems.

  • OP-MM-9. Multimodal MI. Mechanistic interpretability for multimodal models is underdeveloped relative to text-only MI. Understanding cross-modal computations in detail is open.

  • OP-MM-10. Multimodal safety. Multimodal capabilities introduce specific safety concerns (deepfakes; voice cloning; computer-use risks; jailbreaks via multimodal injection). Multimodal-specific safety methodology is less mature than text-specific.


§15. Further Reading

Opinionated annotated list.

Foundational multimodal

  • Radford, A., et al. (OpenAI 2021). “Learning Transferable Visual Models From Natural Language Supervision.” CLIP.

  • Jia, C., et al. (Google 2021). “Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision.” ALIGN.

  • Li, J., et al. (Salesforce 2022). “BLIP.”

  • Li, J., et al. (Salesforce 2023). “BLIP-2.”

  • Alayrac, J.-B., et al. (DeepMind 2022). “Flamingo.”

  • Liu, H., et al. (2023). “Visual Instruction Tuning.” LLaVA.

Native multimodal frontier

  • OpenAI (2023). “GPT-4V(ision) System Card.”

  • OpenAI (2024). “Hello GPT-4o.” Announcement and capabilities.

  • Google DeepMind (2023). “Gemini: A Family of Highly Capable Multimodal Models.”

  • Chameleon Team / Meta AI (2024). “Chameleon: Mixed-Modal Early-Fusion Foundation Models.”

Audio and speech

  • Radford, A., et al. (OpenAI 2022). “Robust Speech Recognition via Large-Scale Weak Supervision.” Whisper.

  • Borsos, Z., et al. (Google 2022). “AudioLM.”

  • Agostinelli, A., et al. (Google 2023). “MusicLM.”

  • Wang, C., et al. (Microsoft 2023). “Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers.” VALL-E.

Video

  • Yu, J., et al. (Google 2022). “Scaling Autoregressive Models for Content-Rich Text-to-Image Generation.” Parti.

  • Brooks, T., et al. (OpenAI 2024). “Video Generation Models as World Simulators.” Sora technical report.

  • Reid, M., et al. (Google DeepMind 2024). “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.”

Vision-language-action

  • Brohan, A., et al. (Google DeepMind 2022). “RT-1.”

  • Brohan, A., et al. (Google DeepMind 2023). “RT-2: Vision-Language-Action Models.”

  • Driess, D., et al. (Google DeepMind 2023). “PaLM-E: An Embodied Multimodal Language Model.”

  • Kim, M. J., et al. (2024). “OpenVLA.”

  • Black, K., et al. (Physical Intelligence 2024). “π0: A Vision-Language-Action Flow Model for General Robot Control.”

Evaluation

  • Yue, X., et al. (2024). “MMMU.”

  • Lu, P., et al. (2024). “MathVista.”

  • Li, Y., et al. (2023). “POPE: Polling-based Object Probing Evaluation for Object Hallucination.”

  • Fu, C., et al. (2024). “VideoMME.”

Reading-order recommendation

For someone new to multimodal models: start with CLIP (Radford et al. 2021) for the foundational contrastive approach; then Flamingo (Alayrac et al. 2022) for vision-language with LLM backbone; then LLaVA (Liu et al. 2023) for the open-source pattern; then GPT-4o announcement and Gemini paper for native multimodal frontier. For VLA: RT-2 (2023) and π0 (2024). For evaluation: MMMU, MathVista, and POPE.


§16. Exercises and Experiments

Research-style exercises developing multimodal-modeling skills.

  • E1. Train a small CLIP-style model. Implement a small dual-encoder contrastive model. Train on a moderate-sized image-text dataset (e.g., subset of LAION). Evaluate zero-shot classification on a small image dataset. Reflect on contrastive-learning dynamics.

  • E2. Evaluate a frontier VLM. Pick a frontier vision-language model (GPT-4o, Claude 3.5+, Gemini, or open variant like LLaVA-NeXT). Evaluate on a public benchmark (MMMU subset; MathVista; POPE). Analyze patterns of success and failure.

  • E3. Build a small multimodal RAG system. Build a system that indexes a corpus of text-and-image content (e.g., technical documentation); given a query, retrieves relevant text and images; generates a grounded response with citations.

  • E4. Implement a basic VLA pipeline. Using simulation (e.g., Gymnasium or specialized robotics simulator), set up a small VLA pipeline. Use a pretrained vision-language model with a simple action head. Train on a small task. Evaluate; reflect on data efficiency.

  • E5. Compare native vs late-fusion architectures. For a specific task, implement both a late-fusion (CLIP + LM via projection) and a more native (joint pretraining-style) architecture. Compare capability and training cost.

  • E6. Multimodal hallucination analysis. Take a frontier VLM. Construct adversarial inputs designed to elicit hallucinations (images with subtle features; ambiguous content; missing content). Document hallucination patterns; reflect on mitigation strategies.

  • E7. Audio-vision integration. Build a system that takes both audio and video input; produces grounded responses about what’s happening. Evaluate on a small video-with-audio dataset.

  • E8. Cross-modal retrieval system. Build a CLIP-style cross-modal retrieval system over a substantial corpus (e.g., 10K images). Evaluate retrieval quality. Compare with hybrid (CLIP + BM25) retrieval.

  • E9. Voice cloning ethics experiment. Implement (or use) a voice-cloning system. Clone your own voice from a short reference. Reflect on the technical capability, ethical considerations, and detection challenges.

  • E10. Long-video understanding. Take a long-context multimodal model (e.g., Gemini 1.5 Pro). Test on a 30+ minute video. Ask comprehension questions; analyze success and failure patterns. Reflect on the long-multimodal-context capability frontier.


AI: A Living Reference by Fuzue. Content licensed under CC BY-SA 4.0 - share, adapt, and build on it; keep the attribution and the open licence on derivatives.