Multimodal Models

Sections §1–§6 are in draft status. Remaining sections (§7 onward) are still in outline form. Open problems are flagged inline and consolidated in §14.

This chapter is a new chapter with no direct AIMA 4e antecedent. AIMA’s treatment of perception (Ch 24-25) covered classical computer-vision and speech-processing as separate topics; the modern multimodal model unifies these (and more) into single architectures that handle text, images, audio, video, code, structured data, and (in some cases) physical action. The shift is qualitative: a single foundation model handles cross-modal queries (“describe this image”, “convert this audio to text”, “generate code for this diagram”, “translate the speech in this video”) that previously required separate specialized systems.

The chapter consolidates multimodal-specific material that appears in many other chapters: cross-modal SSL (SSL §7), CLIP and contrastive multimodal training (SSL §5, Generative Models §3), multimodal generative models (Generative Models §9), computer-use agents (AI Agents §7), and the perception-side of robotic systems. This chapter develops the unified treatment.

The chapter assumes the Foundation Models chapter, the Large Language Models chapter (especially LLM §3 for Transformer architecture and §8 for tool use), the Self-Supervised Learning chapter (especially §7 for cross-modal SSL), and the Generative Models chapter for modality-specific generation.

Scope and What This Chapter Is About

The chapter develops multimodal models — neural networks that process and/or generate content across multiple modalities (text, images, audio, video, code, structured data, action). We cover the conceptual framing (cross-modal vs unimodal models; alignment vs fusion; native multimodality vs late fusion), the dominant architectures (CLIP-style contrastive models; encoder-fusion patterns; decoder-only multimodal Transformers; diffusion-based multimodal generators), the modality-specific patterns (vision-language, audio-language, video, vision-language-action), the training methodologies (multimodal pretraining, contrastive learning at scale, native interleaved training), the evaluation methodology, and the production landscape.

Approximate length target: 18,000–25,000 words (a major chapter — multimodal models are central to modern AI deployment).

§1. Motivation and Scope

Three worked instances

To anchor the chapter, three concrete instances spanning the modern multimodal landscape.

Instance 1: GPT-4o describes an image. A user uploads a photograph of a kitchen and asks “What’s wrong with this kitchen layout?”. GPT-4o (OpenAI, May 2024) — a native multimodal model — receives the image and the text query together, processes both jointly, and produces a text response identifying issues: “The refrigerator is positioned too far from the prep area; the sink-to-stove-to-fridge work triangle is broken; the island blocks workflow between the cooking and serving areas.” The response is grounded in the specific image, not generic kitchen advice.

Instance 2: Whisper transcribes a multilingual meeting. A user provides an audio recording of a meeting with speakers in English, Spanish, and Mandarin. Whisper (OpenAI, 2022) — a speech-to-text model trained on 680,000+ hours of multilingual audio — produces a transcript with speaker turns identified, language detected per utterance, and translation if requested. The model handles the cross-lingual structure without explicit per-language configuration.

Instance 3: π0 controls a robot arm. A user shows a robot arm controlled by π0 (Physical Intelligence, 2024-2025) a basket of laundry and a folding table; says “fold this laundry.” The model takes camera images of the scene, processes them jointly with the verbal instruction, and produces a sequence of motor commands. The robot picks up shirts, folds them, stacks them. The model — a vision-language-action (VLA) model — bridges perception (vision), language understanding (instruction), and physical action (motor control).

These three instances span the multimodal landscape: vision-language understanding (GPT-4o), audio-language (Whisper), and vision-language-action (π0). They share architectural patterns (large Transformer models processing tokenized inputs from multiple modalities); they differ in modality combinations, training data, and deployment.

What multimodal models are

A working definition. Multimodal models are neural networks that process and/or generate content across more than one modality — text, images, audio, video, code, structured data, action commands, sensor readings.

The crucial property: the modalities are processed jointly, not by separate specialized models stitched together. A multimodal model can answer questions about an image because the image and the question are processed in the same architecture; the answer reflects joint understanding, not pipeline composition.

The contrast with classical computer vision. Pre-deep-learning systems for “image understanding plus text generation” used separate components: a vision system extracted features; an NLP system generated text. The systems were trained separately; interfaces were narrow. Modern multimodal models are single architectures with joint training.

The crucial enabler. The success of multimodal models depends on unifying representations across modalities. The dominant approach in 2026: tokenize all modalities into discrete (or continuous-but-structured) tokens; process the unified token sequence with a Transformer. The model learns cross-modal relationships through joint pretraining on multimodal data.

What multimodal models are not

Several boundaries.

Multimodal is not just “text plus images.” Audio, video, code, structured data, sensor data, action commands are all relevant modalities. The “vision-language model” subset has gotten most attention but is one part of a broader landscape.

Multimodal is not the same as multi-task. A model that does many tasks within a single modality (e.g., translation, summarization, classification — all text) is not multimodal. The defining feature is cross-modality, not task diversity.

Multimodal is not necessarily “native” multimodal. Some multimodal systems are late-fusion: separate per-modality encoders, then a joint head. Others are native multimodal: tokens from all modalities flow through the same Transformer from the start. Both qualify as “multimodal”; the architectural difference matters.

Multimodal is not a single architecture. CLIP-style contrastive models, encoder-fusion patterns, decoder-only multimodal Transformers, diffusion-based multimodal generators — all are multimodal models with different architectures. The chapter develops the range.

The shift from specialized to unified models

A specific industry transition. Pre-2021, computer vision and NLP were largely separate fields with separate conferences (CVPR vs ACL), separate benchmarks, separate communities. AI deployment combined them through pipelines — a vision component plus a text component.

Post-2021, the boundaries dissolved rapidly. CLIP (2021) trained vision and language jointly. Flamingo (2022) handled image+text in a single network. GPT-4V (2023) and GPT-4o (2024) brought multimodal capabilities to deployed frontier LLMs. By 2026, every major frontier model has substantial multimodal capability.

The implications.

Model unification. A single deployed model handles many modalities. Reduces engineering complexity (no separate vision + text pipelines); improves cross-modal capability (joint training enables emergent abilities single-modality models lack).

Capability emergence. Some multimodal capabilities emerge from joint training rather than being explicitly trained. GPT-4o can interpret diagrams, read charts, understand UI screenshots — capabilities not explicitly in the training objective.

Deployment economics. Multimodal-capable models are more useful per deployment than single-modal alternatives; the economic case for multimodality is strong.

Research consolidation. The previously-separate research communities (CV, NLP, speech) have substantially merged. Modern AI labs do not maintain separate research tracks; multimodal work spans all.

Why multimodal matters in 2026

Four motivations.

1. Production deployment. Frontier consumer products are increasingly multimodal. ChatGPT, Claude, Gemini all handle image uploads. Computer-use agents (Anthropic Computer Use, OpenAI Operator) depend on vision. Voice interfaces (ChatGPT Voice Mode, Gemini Live) integrate speech with text. Video understanding is rolling out (Gemini’s video analysis; competitor products following). Multimodal is no longer optional for frontier deployment.

2. Robotics and physical action. VLA models bridge perception, language, and action — enabling AI systems that operate in the physical world. The 2024-2026 progress in robotics is substantially driven by multimodal models (RT-2, OpenVLA, π0, Cosmos and successors). Cross-reference Robotics (planned).

3. Scientific applications. Multimodal models support scientific applications spanning many data types. Microscopy + text descriptions; molecular structures + properties; satellite imagery + climate data. Cross-reference AI for Science.

4. The path to broader AI. Whether or not “AGI” is the right target, broader AI capability requires handling the full range of human communication and perception modalities. Multimodality is a necessary condition for AI systems that interact with the world the way humans do.

Boundaries with adjacent chapters

This chapter consolidates multimodal-specific content that overlaps substantially with other chapters.

Self-Supervised Learning §7 covers cross-modal SSL pretraining (CLIP, ALIGN, the vision-language pretraining literature). This chapter develops the modeling aspects on top of that pretraining substrate.
Generative Models §9 covers modality-specific generation (text-to-image, video, audio, 3D, molecules). This chapter develops the cross-modal and unified aspects of these systems.
Large Language Models §3 develops Transformer architecture. Multimodal models extend LLM architecture to handle non-text tokens; the architectural foundations are LLM-side.
Foundation Models provides the FM-as-substrate framing. Multimodal FMs are FMs; the broader pretraining/adaptation framing applies.
AI Agents §7 covers computer-use agents (Anthropic Computer Use, OpenAI Operator). These are multimodal in capability; the agent-specific aspects live in the AI Agents chapter.
Robotics (planned) develops embodied AI. VLA models (§9 of this chapter) are central to modern robotics; this chapter covers them as a multimodal pattern; the broader robotics context lives in the Robotics chapter.
Reinforcement Learning is relevant for VLA training. Many VLA models are trained with combinations of supervised behaviour cloning and RL; the RL details live in the RL chapter.
Evaluation §10 covers multimodal evaluation as part of agent evaluation; this chapter §11 covers the multimodal-specific aspects.
Deep Learning §4-§6 develop the architectural primitives (CNNs, Transformers, attention) that multimodal models combine.
Causality §10 has minor connection (causal reasoning across modalities is a niche but interesting research direction).

What this chapter does not try to do

Several explicit exclusions.

We do not provide a complete history of computer vision. The classical-CV-to-deep-CV transition (LeNet → AlexNet → ResNet → ViT) is in DL §4 and is assumed background.
We do not develop modality-specific generation in depth. Text-to-image diffusion, video generation, audio synthesis are in Generative Models §9.
We do not cover every multimodal benchmark. The space is large; we focus on the dominant benchmarks and methodology.
We do not develop the full robotics context for VLA models. The planned Robotics chapter develops embodied AI; this chapter touches VLA as a multimodal pattern.
We do not extensively cover the engineering of multimodal serving (multimodal inference infrastructure, vision-tokenizer engineering, mixed-modal-batching). These matter for production but are largely engineering rather than research content.

Position taken in this chapter

The chapter takes multimodal models seriously as a substantial paradigm shift in AI architecture and deployment. The shift is real (frontier models are multimodal; deployment is multimodal; user experience is multimodal). The chapter is appropriately cautious about specific capability claims — multimodal hallucinations, modality-specific failure modes, and uneven quality across modalities are real concerns.

The chapter’s overall framing: multimodal is the modern reality of AI architecture; specific capabilities are uneven across modalities; the unification trend is strong and continuing. The technical content develops what multimodal models do, how, and what remains hard.

§2. Historical Context

This section traces multimodal models from pre-deep-learning specialized systems through the 2024-2026 native-multimodal frontier.

A timeline of the inflection points:

Pre-2012

Multimodal as pipeline

Separate vision and language systems combined via narrow interfaces. Each component trained separately; multimodal applications limited.
2014–2015

Image captioning emerges

Deep-CNN-encoder + RNN-decoder pipelines (Show and Tell, 2014; Karpathy and Fei-Fei, 2015). First true joint vision-language deep learning.
2015–2018

Visual question answering

VQA develops as a research direction. CNN + RNN architectures with attention. Substantial benchmark development (VQA, GQA, CLEVR).
2018–2020

Vision-language pretraining — BERT-style era

Masked-modeling on paired data. ViLBERT (Lu et al. 2019), VisualBERT (Li et al. 2019), and many successors.
2021

CLIP — the contrastive inflection

Radford et al. (OpenAI). Contrastive vision-language pretraining at internet scale (400M image-text pairs). Zero-shot transfer to many vision tasks. Simple objective, massive scale, emergent capability.
2021–2022

CLIP-style models proliferate

ALIGN (Google), Florence, OpenCLIP. Vision-language pretraining becomes standard.
2022

BLIP and Flamingo

BLIP (Li et al., Salesforce): unified vision-language understanding-and-generation. Flamingo (Alayrac et al., DeepMind): few-shot in-context learning for visual tasks via a large frozen LLM with image-perception module.
2022–2023

LLaVA and the open VLM explosion

LLaVA (Liu et al.) and the open vision-language model proliferation — combining open LLMs with CLIP image encoders via projection layers. Substantial open-source progress.
2023

GPT-4V — first frontier VLM deployed

OpenAI, September 2023. GPT-4’s vision capabilities released publicly. The first frontier vision-language model widely deployed.
2023

Gemini 1.0 — native multimodality framing

Google DeepMind, December 2023. Trained natively multimodal from the start (not added later). Pushes the “native multimodality” framing.
2024

GPT-4o, Claude 3, Chameleon, Anthropic Computer Use

GPT-4o (OpenAI, May 2024): native multimodal frontier model handling text, image, audio — audio I/O brings voice interfaces. Claude 3 (Anthropic, March 2024): substantial vision-language capability in Opus and successors. Chameleon (Meta, May 2024): early-fusion mixed-modal autoregressive model, image + text in a single token stream. Anthropic Computer Use (October 2024): vision-and-action agent controlling computers via screenshots (cross-reference AI Agents §7).
2022–2024

Vision-Language-Action models mature

RT-1 (2022), RT-2 (Google DeepMind 2023), PaLM-E (2023). OpenVLA (2024) opens the field with open weights. π0 (Physical Intelligence, 2024) and successors push robotic-manipulation capability.
2024–2025

Video generation matures

Sora (OpenAI, Feb 2024), Veo (Google), Gen-3 (Runway), Mochi (Genmo). Text-to-video at minute-length becomes possible (cross-reference Generative Models §9).
2024–2025

Audio-language models advance

AudioLM line, MusicGen, AudioGen; voice-mode integration in frontier chatbots (ChatGPT Advanced Voice Mode, Gemini Live).
2025–2026

Multimodal becomes standard

Every major frontier model has substantial multimodal capability (text + vision + audio + video). VLA models become the standard for robotics. Computer-use capabilities mature into deployment.

We develop each phase below.

Pre-deep-learning multimodal as pipeline

The early architecture. Pre-deep-learning systems combining modalities did so via pipelines — a vision component produced features; an NLP component consumed them; outputs flowed through narrow interfaces.

Example. Image captioning before deep learning: extract image features (SIFT, HOG, etc.); compute similarity to caption templates; produce a caption via template selection. The system had no joint understanding; the components were independently trained.

The limitations. Pipeline systems were narrow (each component handled a specific task); brittle (errors compounded across the pipeline); non-adaptive (changes to one component required updates to others).

Image captioning and the deep-learning multimodal era

The first true multimodal deep learning. Show and Tell (Vinyals, Toshev, Bengio, Erhan, Google 2015) — CNN encoder + LSTM decoder for image captioning. Trained end-to-end; substantially better than pipeline systems.

Karpathy and Fei-Fei (2015) “Deep Visual-Semantic Alignments for Generating Image Descriptions” extended this with attention over image regions.

The pattern. Encoder-decoder multimodal: a CNN encodes the image; an RNN decoder generates text token-by-token, attending to image features. The architecture became standard for image captioning and related vision-language tasks through 2015-2018.

The limitations. Specialized for vision-to-text; required paired training data; performance plateaued before reaching deployment-grade quality.

Visual question answering

A specific task category that drove multimodal research. Visual Question Answering (VQA): given an image and a natural-language question, produce the answer.

Benchmarks. VQA (Antol et al., 2015); GQA (Hudson and Manning, 2019); CLEVR (Johnson et al., 2017, focused on compositional reasoning); VCR (Visual Commonsense Reasoning).

Architectures. CNN image encoder + RNN/Transformer question encoder + joint fusion + answer prediction. Multiple architectural variants explored attention mechanisms, bilinear pooling, graph-based reasoning.

The trajectory. VQA performance improved steadily through 2018-2020 but remained substantially below human level on many subsets. The architectural innovations of the period (attention, fusion methods) became foundations for later work.

Vision-language pretraining: BERT-style era

The next inflection. Following BERT’s success in NLP (2018), researchers developed vision-language pretraining models with similar self-supervised objectives.

ViLBERT (Lu et al., 2019) — two-stream Transformer with cross-attention; masked-language-modeling and image-text matching objectives.

VisualBERT (Li et al., 2019) — single-stream architecture.

LXMERT (Tan and Bansal, 2019) — cross-modality encoder.

UNITER (Chen et al., 2020) — universal image-text representation learning.

The pattern. Pretrain on paired image-text data with masked-modeling-style objectives; fine-tune for downstream vision-language tasks. The approach substantially improved VQA, image captioning, and similar tasks.

The limitations. The models required paired data (caption-image pairs); training was expensive but scales were modest by later standards (millions of pairs); zero-shot transfer to novel tasks was limited.

CLIP and the contrastive multimodal era

The inflection moment. CLIP (Contrastive Language-Image Pre-training; Radford, Kim, Hallacy et al., OpenAI, 2021) “Learning Transferable Visual Models From Natural Language Supervision.”

The recipe. Train two encoders (one for images, one for text) such that paired image-text examples have similar embeddings and unpaired examples have dissimilar embeddings. The training data: 400 million image-text pairs scraped from the web.

The result. CLIP’s image and text representations are aligned in a shared embedding space. This enables:

Zero-shot classification. Classify an image by computing similarity to text descriptions of candidate classes; no per-class training needed.
Image retrieval. Find images matching a text query (or vice versa) via embedding similarity.
Foundation for downstream models. CLIP-encoded features become substrate for many downstream vision-language tasks.

The CLIP paper showed zero-shot performance on dozens of vision benchmarks rivaling supervised models. The result was striking; multimodal pretraining at internet scale produced general visual capability.

The trajectory. CLIP catalyzed extensive subsequent work:

ALIGN (Jia et al., Google, 2021) — similar approach at larger scale.
OpenCLIP (Ilharco et al., 2022+) — open-source CLIP reproduction at increasing scale.
Florence (Yuan et al., Microsoft, 2021) — vision-language model with broader task coverage.
EVA-CLIP, SigLIP, and many variants — methodological refinements.

By 2022-2023, CLIP-style models were the standard visual-encoder substrate for downstream vision-language work. They remain central in 2026.

BLIP, Flamingo, and the encoder-decoder scaling

A different line of work. Several 2022-2023 systems combined large language models with vision encoders to produce generative vision-language models.

BLIP (Li et al., Salesforce, 2022) — unified vision-language understanding and generation. Combined CLIP-style contrastive objectives with image-grounded text generation objectives. Followed by BLIP-2 (Li et al., 2023) — a Q-Former architecture for connecting a frozen vision encoder to a frozen LLM.

Flamingo (Alayrac et al., DeepMind, 2022) — frozen vision encoder + frozen language model + lightweight cross-attention layers for connecting them. Supports few-shot in-context learning on visual tasks. Substantially advanced vision-language capability.

LLaVA (Liu, Li, Wu, Lee, 2023) — open-source vision-language model combining LLaMA (or other open LLMs) with CLIP image encoder via a projection layer. Sparked extensive open-source vision-language model work.

The pattern. Combine pretrained components (vision encoder + language model) with lightweight connectors. Train the connector (and sometimes parts of the LM) on image-text data. Achieve substantial vision-language capability at much lower cost than from-scratch multimodal pretraining.

GPT-4V and the frontier-VLM era

A specific frontier release. GPT-4V (OpenAI, announced September 2023) brought vision capabilities to GPT-4. The first widely-deployed frontier-LLM with substantial vision capability.

The reception. Demonstrated substantially better vision-language capability than prior open systems. Users could upload images to ChatGPT; ChatGPT could discuss them, answer questions, identify content, OCR text.

The trajectory. Subsequent frontier models substantially advanced vision-language capability:

GPT-4V (September 2023).
Gemini 1.0 (December 2023) — natively multimodal from the start.
Claude 3 family (March 2024) — vision in Opus, Sonnet, Haiku.
GPT-4o (May 2024) — substantially advanced unified multimodal (text + vision + audio).
Gemini 1.5 Pro (early 2024) — long-context multimodal; video processing.
Claude 3.5 Sonnet (2024) — vision capability competitive with GPT-4o.
GPT-4.5, GPT-5, Claude 4, Gemini 2.0+ (2025-2026) — continued capability advancement.

The pattern. Vision-language capability became expected in frontier models. Deployment of vision-text-audio-video unified models became standard.

Gemini and native multimodality

A specific architectural choice. Gemini 1.0 (DeepMind, December 2023) was explicitly described as “natively multimodal” — trained on multimodal data from the start of pretraining, rather than starting with a text-pretrained LLM and adding vision later.

The argument. Native multimodal training produces better cross-modal capabilities than late-fusion approaches. Modalities can attend to each other from the earliest layers; cross-modal patterns emerge naturally.

The reception. The “native multimodality” framing has become widely adopted. Subsequent frontier models (GPT-4o, Gemini 1.5+, Claude 3.5+) increasingly emphasize joint multimodal pretraining.

The honest assessment. Whether fully native multimodality is necessary, or whether late-fusion methods can reach similar capability with appropriate scale, is contested. The 2026 evidence: both approaches produce capable systems; the differences may be substantial in specific capabilities (audio understanding, real-time interaction) but not universal.

Vision-language-action models

A specific direction with substantial 2023-2026 progress. Vision-Language-Action (VLA) models extend the multimodal paradigm to physical action.

RT-1 (Robotic Transformer 1; Google DeepMind, 2022). First large-scale Transformer-based robot policy; trained on hundreds of thousands of robot trajectories.

RT-2 (Google DeepMind, 2023). Builds on RT-1 with a vision-language-action foundation model. Uses a VLM as the backbone; fine-tunes for robotic action.

PaLM-E (Google DeepMind, 2023). Embodied multimodal language model integrating PaLM with robot state and visual input.

OpenVLA (Kim, Pertsch, Karamcheti et al., 2024). Open-source VLA model; democratized VLA research.

π0 (Pi-zero) (Physical Intelligence, late 2024). Frontier manipulation policy; demonstrates substantial capability on dexterous manipulation tasks.

Cosmos (NVIDIA, 2025+) and related world-model approaches for robotics.

The trajectory. VLA models have moved from research-stage demonstrations to substantive robotic capability in 2024-2026. Cross-reference Robotics (planned chapter) for the broader context.

Computer-use agents as multimodal applications

A specific applied direction. Anthropic Computer Use (October 2024) and OpenAI Operator (January 2025) — agents controlling computers via screenshots and keyboard/mouse. Cross-reference AI Agents §7.

The multimodal aspect. Computer-use agents are vision-and-action multimodal systems: they perceive screens (vision); they understand goals (language); they execute UI actions (action). The same paradigm as VLA but in a different domain (computer interfaces rather than physical robots).

Where this leaves us in 2026

The current state. Multimodal models are standard in frontier AI. Every major frontier model is multimodal; deployment is multimodal; user experience is multimodal. The capability range spans text, vision, audio, video, action.

Specific capability levels.

Vision-language understanding. Mature; production-grade for most tasks.
Audio-language understanding. Substantially mature; production-grade for speech-to-text.
Multi-modal generation. Mature for image and audio; rapidly maturing for video.
Vision-language-action. Research-grade to early-production; rapid progress.
Long-context multimodal. Emerging; some frontier models handle hour-long videos or large multimodal documents.
Cross-modal reasoning. Emerging; works for many cases; fails in subtle ways.

The remaining sections develop the technical content. §3 covers the modeling framework. §4 covers vision-language models. §5 covers vision generation (briefly; cross-reference Generative Models). §6 covers audio and speech. §7 covers video. §8 covers native multimodality. §9 covers VLA. §10 covers cross-modal retrieval. §11 covers evaluation. §12-§16 close out.

Editorial note. Multimodal models are rapidly evolving. Specific model releases and capability levels will date faster than the architectural framework. The chapter is a snapshot of the methodology and the state-of-art as of mid-2026.

§3. The Multimodal Modeling Framework

A conceptual framework for organizing the multimodal modeling landscape. This section develops the vocabulary used throughout the chapter — modality, representation, fusion patterns, native vs late multimodality.

Modality vs representation

A useful initial distinction.

A modality is a form of data — text (sequences of characters), images (2D pixel arrays), audio (waveforms or spectrograms), video (sequences of images), 3D shapes (meshes, point clouds), structured data (tables, graphs), action commands (motor signals, mouse events).

A representation is how a modality is encoded inside a model — discrete tokens (text subwords, image patches as discrete codes from VQ-VAE), continuous embeddings (vision Transformer patch embeddings), spectrograms (frequency-domain audio).

The same modality can have multiple representations. Audio can be represented as raw waveform samples; as spectrograms (frequency-time matrices); as discrete tokens from a learned codec (EnCodec, SoundStream). Each representation is more suitable for different model architectures and tasks.

The implication. Multimodal model design involves choices both about which modalities to handle and how to represent each. These choices substantially shape the architecture.

Unimodal, bimodal, multimodal

Terminology levels.

Unimodal models process one modality. GPT-3 (text-only), AlphaFold 2 (sequence-only at the input level), pre-2021 vision Transformers (image-only). The classical deep-learning paradigm.

Bimodal models combine exactly two modalities. CLIP (image + text); audio-text models like Whisper (audio + text); early vision-language models. Substantively richer than unimodal but constrained to specific modality pairs.

Multimodal models combine more than two modalities. GPT-4o (text + image + audio); Gemini (text + image + audio + video); VLA models (vision + language + action). The 2024-2026 frontier.

The trend in 2026 is toward more modalities in single models. The boundaries between bimodal and multimodal are fluid; the central trend is integration.

A finer distinction within “multimodal.”

Cross-modal models map between modalities. A speech-to-text model (audio → text); a text-to-image model (text → image). The model has input modalities and output modalities; the mapping is the primary operation.

Joint-modal models process multiple modalities together to produce a unified output. A VQA model (image + question → answer); a vision-language model answering questions about images. The model jointly processes multiple input modalities.

Unified multimodal models can both input and output across multiple modalities, often interchangeably. GPT-4o (input: text, image, or audio; output: text, image, or audio). The most flexible category; the 2024-2026 frontier.

The classification matters because architectural choices differ.

Cross-modal often uses encoder-decoder (encode source modality; decode target modality).
Joint-modal uses fusion architectures (encode each modality; combine for downstream task).
Unified uses tokenization-across-modalities + Transformer (all modalities flow through the same architecture).

Alignment vs fusion vs generation

A different axis. What does the multimodal model do with the modalities?

Alignment. Train representations such that paired examples across modalities are similar in embedding space; unpaired are different. The CLIP recipe (§4). Useful for retrieval, classification, search.

Fusion. Combine representations from multiple modalities into a joint representation for a downstream task. The BLIP / Flamingo / LLaVA recipe (§4). Useful for VQA, image captioning, multimodal reasoning.

Generation. Produce output in one or more modalities conditional on inputs. Text-to-image diffusion (§5; cross-reference Generative Models §6); multimodal text-and-image generation (GPT-4o image generation; §8). Useful for content creation, design assistance.

A given system may do multiple of these. CLIP does alignment; it can be used as a component of fusion systems and as a conditioning signal for generation. GPT-4o does fusion (multimodal input → text output) and generation (text input → image/audio output).

Native multimodality vs late fusion

A specific architectural choice that has been contested. Native multimodality vs late fusion.

Late fusion. Start with specialized per-modality components (a vision encoder; a language model). Train them separately on per-modality data. Combine them later via lightweight connectors (projection layers, cross-attention modules). The CLIP-then-LLaVA pattern; many open-source vision-language models.

flowchart TD
  ImgIn(["Image input"])
  TxtIn(["Text input"])
  VEnc["Vision Encoder (e.g., CLIP)\n(pretrained, frozen)"]
  Proj["Projection Layer\n(small, trained)"]
  LM["Language Model\n(pretrained, may be fine-tuned)"]
  Out(["Output text"])

  ImgIn --> VEnc --> Proj --> LM
  TxtIn --> LM
  LM --> Out

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  class ImgIn,TxtIn,Out pill
  class VEnc,Proj,LM pill

Advantages of late fusion. Cost-effective — reuse existing pretrained components. Modular — swap encoders or LMs without retraining everything. Open-source-friendly — combine open vision encoders with open LMs.

Disadvantages. Limited cross-modal integration — modalities only interact at the connection point; complex cross-modal reasoning may be limited. Constrained capability — the system inherits the limitations of its components without joint optimization.

Native multimodality. Train a single model from the start on multimodal data. Tokens from all modalities flow through the same architecture from the earliest layers; cross-modal attention happens at every layer.

flowchart TD
  ImgIn(["Image input"])
  AudIn(["Audio input"])
  TxtIn(["Text input"])
  AnyIn(["… any modality"])
  ImgTok["Image tokenizer"]
  AudTok["Audio tokenizer"]
  TxtTok["Text tokenizer"]
  AnyTok["…"]
  Unified["Unified token sequence"]
  Transformer["Transformer\n(all layers; all tokens attend to each other)"]
  Out(["Output token sequence\n(text, image, audio — any modality)"])

  ImgIn --> ImgTok --> Unified
  AudIn --> AudTok --> Unified
  TxtIn --> TxtTok --> Unified
  AnyIn --> AnyTok --> Unified
  Unified --> Transformer --> Out

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  classDef dim  fill:#fff,stroke:#aaa,stroke-dasharray:3 3,color:#888
  class ImgIn,AudIn,TxtIn,AnyIn,Unified,Out pill
  class ImgTok,AudTok,TxtTok,Transformer pill
  class AnyTok dim

Advantages of native multimodality. Deep cross-modal integration — modalities interact at every layer; cross-modal reasoning is emergent. Capability ceiling may be higher than late fusion for complex multimodal tasks.

Disadvantages. Expensive — requires full multimodal pretraining. Less modular — modalities are entangled; harder to update specific components. Less open-source-accessible — only major labs can afford frontier-scale native-multimodal training.

The 2026 picture. Frontier proprietary models (Gemini, GPT-4o, Claude 3.5+) are increasingly native multimodal. Open-source models predominantly use late fusion. The capability gap is real but not absolute — late-fusion models can be very capable, and native multimodal is not always strictly better for specific tasks.

Tokenization across modalities

A specific technical concern for native multimodality. To process multiple modalities in a single Transformer, each modality must be tokenized into a sequence the Transformer can consume.

Text tokenization. Standard subword tokenization (BPE, SentencePiece, Tiktoken). Well-established; vocabulary sizes ~50K-200K tokens.

Image tokenization. Several approaches.

Patch embeddings. Vision Transformer (ViT) splits image into 16x16 patches; each patch is linearly embedded into a continuous vector. Continuous rather than discrete tokens.
VQ-VAE discrete codes. Train a VQ-VAE to map image patches (or whole images) to discrete codes from a learned codebook. Discrete tokens like text.
VQGAN, learned tokenizers. Refinements producing higher-quality discrete image tokens.

Audio tokenization. Similar choices.

Continuous features. Mel-spectrograms, wav2vec embeddings.
Discrete codecs. EnCodec, SoundStream, DAC — neural audio codecs producing discrete tokens.

Video tokenization. Extend image tokenization to space-time.

Per-frame tokenization. Independent tokenization of each frame.
Space-time tokenization. 3D tokens spanning space and time. Sora-style.

Action tokenization. For VLA models.

Discrete action codes. Discretize action space; train action tokenizer.
Continuous actions. Output continuous values directly; less Transformer-friendly.

The architectural implication. Unified discrete tokenization across all modalities enables fully-unified Transformer processing — every modality is a token sequence; the Transformer doesn’t distinguish them. Chameleon (Meta) and similar systems take this approach. Continuous representations require additional architectural machinery (cross-attention, projection layers).

The trade-off. Discrete tokenization is simpler architecturally but loses information (especially for high-resolution images and audio). Continuous tokenization is higher-fidelity but more complex. The 2026 trend: hybrid approaches that maintain continuous representations where fidelity matters and discrete where simplicity matters.

Modality-specific characteristics

A practical observation. Different modalities have different characteristics that affect design.

Text. Discrete; relatively compact (~1 token per several characters); well-structured (grammar, syntax); rich semantic content. The most-developed modality in current AI.

Images. Continuous (or high-dimensional discrete); spatially structured; substantial information per item (a single image is ~256-2048 tokens depending on resolution and tokenizer). Visually rich; semantically complex.

Audio. Continuous (waveforms) or feature-based (spectrograms); temporally structured; substantial information density (1 second of audio is hundreds to thousands of tokens). Less semantically dense than images.

Video. Space-time; very large (1 minute of video can be millions of tokens at high resolution); temporally and spatially structured. The largest-token-budget modality.

Action. Often low-dimensional (motor commands, mouse positions); temporally structured; varies dramatically by domain (text typing vs robot manipulation).

Code. Like text but with stricter syntax and richer semantic structure; sometimes treated as a separate modality, sometimes as a text subspecies.

Structured data. Tables, graphs, knowledge bases. Often flattened to text for current models; native handling is research-stage.

The implication. Modalities have different token budgets and information densities. Multimodal model design must allocate context-window and compute appropriately across modalities.

Where the framework fits

The summary. The conceptual framework — modality vs representation, unimodal/bimodal/multimodal, cross-modal vs joint vs unified, alignment vs fusion vs generation, native vs late, tokenization across modalities, modality-specific characteristics — provides the vocabulary for the remaining sections. The choices vary; the framework is consistent.

Where multimodal modeling sits in 2026

The 2026 frontier is unified, native multimodal models that handle many modalities through unified tokenization and Transformer architecture, with both discrete and continuous representations depending on the modality. The open-source ecosystem remains predominantly late-fusion; the architectural gap matters for some applications and not for others.

The next sections develop the specific modality combinations: §4 covers vision-language; §5 vision generation; §6 audio; §7 video; §8 native unified multimodality; §9 vision-language-action; §10 cross-modal retrieval; §11 evaluation.

§4. Vision-Language Models

The most-developed multimodal model category. Vision-Language Models (VLMs) combine vision and language, supporting tasks like image captioning, visual question answering, image-grounded reasoning, and multimodal chat. This section develops the dominant architectural patterns, the production frontier systems, and the empirical state.

CLIP and contrastive pretraining

The foundational modern approach. CLIP (Contrastive Language-Image Pre-training; Radford, Kim, Hallacy et al., OpenAI, 2021).

The recipe (review from §2).

Two encoders. An image encoder (Vision Transformer or modified ResNet) produces image embeddings. A text encoder (Transformer) produces text embeddings. Both produce vectors of the same dimensionality.
Contrastive objective. Train on 400M image-text pairs. For each batch, the diagonal of the image-text similarity matrix should be high (matching pairs); off-diagonal should be low (non-matching pairs).
Joint embedding space. After training, image and text embeddings live in a shared semantic space. An image and a matching caption have similar embeddings; image-text similarity computes semantic match.

The applications.

Zero-shot image classification. To classify an image with N candidate classes, encode the image and the N class descriptions (“a photo of a [class]”); pick the class with highest embedding similarity. No per-class training needed.

Image retrieval. Given a text query, retrieve images with high text-image similarity. Given an image, retrieve similar-content images via embedding similarity.

Foundation for downstream models. CLIP image embeddings are general-purpose visual representations. Many subsequent vision-language models use CLIP’s image encoder as their visual substrate.

Filtering datasets. CLIP-image-text similarity helps filter training data. LAION-2B, LAION-5B, and other large datasets use CLIP filtering as a quality signal.

The CLIP family. Substantial follow-up work:

ALIGN (Google, 2021). Similar approach at larger scale (~1B pairs); demonstrated scaling-law behaviour.
OpenCLIP (Ilharco, Wortsman et al., 2022+). Open-source CLIP reproduction at increasing scale (up to CLIP-G with substantial compute).
SigLIP (Zhai et al., Google, 2023). Sigmoid loss instead of softmax cross-entropy; better at large batch sizes.
EVA-CLIP (Sun et al., 2023). Substantial scaling; state-of-the-art zero-shot performance.

The 2026 state. CLIP-style models are standard infrastructure — frequently used as image encoders in larger systems; widely deployed for retrieval and search.

Encoder-fusion patterns: BLIP and BLIP-2

A different multimodal architecture. BLIP (Bootstrapping Language-Image Pre-training; Li, Li, Xiong, Hoi, Salesforce, 2022) unified vision-language understanding and generation.

The architecture. Multiple modules:

An image encoder.
A text encoder.
An image-grounded text encoder (cross-attention from text to image).
An image-grounded text decoder (generates text given image features).

Training objectives. Three objectives applied jointly: image-text contrastive learning (CLIP-style); image-text matching (binary classification of whether image-text pair matches); image-grounded language modeling (generate captions).

The result. BLIP supports both understanding (classification, retrieval) and generation (captioning, VQA in generative form). Substantial advance over CLIP for tasks requiring text generation.

BLIP-2 (Li et al., 2023) refined this. Used a frozen large language model with a small “Q-Former” (Query Transformer) bridging the vision encoder to the LM. Much more parameter-efficient; leveraged large pretrained LMs effectively.

The pattern. Bridge a vision encoder to a language model via a learned connector. Train only the connector (and sometimes the LM) on image-text data; leverage existing components.

Decoder-fusion patterns: Flamingo and LLaVA

A specific architectural pattern that became dominant. Decoder-fusion uses a pretrained language model as the backbone; injects visual information through specific attention mechanisms.

Flamingo (Alayrac, Donahue, Luc et al., DeepMind, 2022) “Flamingo: a Visual Language Model for Few-Shot Learning.”

The architecture.

Frozen vision encoder. Pretrained image encoder (Normalizer-Free ResNet).
Frozen large language model. Pretrained Chinchilla.
Perceiver Resampler. Variable-length image features → fixed-length tokens for the LM.
Cross-attention layers interleaved into the frozen LM. These layers attend from text positions to image features.

The training. Train only the Perceiver Resampler and the cross-attention layers. The vision encoder and LM remain frozen.

The result. Flamingo supports few-shot in-context learning on visual tasks: provide a few image-text examples in the prompt; the model performs the task on a new image. Substantial advance for vision-language at the time.

LLaVA (Liu, Li, Wu, Lee, 2023) “Visual Instruction Tuning.”

The architecture. Simpler than Flamingo.

CLIP vision encoder. Produces image embeddings.
Projection layer. A linear or two-layer MLP mapping image embeddings to LM token-embedding space.
Open large language model. LLaMA or other open LM.

The training. Train the projection layer (and optionally fine-tune the LM) on image-instruction-response triplets generated using GPT-4.

The contribution. Open-source vision-language modeling at competitive capability. Demonstrated that modest fine-tuning of an LM with a simple projection from CLIP features could produce capable VLM.

The trajectory. LLaVA spawned a substantial open-source ecosystem:

LLaVA-1.5, LLaVA-NeXT, LLaVA-OneVision. Improvements with better data, training, and architecture.
MiniGPT-4 (Zhu et al., 2023). Similar pattern.
InternVL (Chen et al., 2024). Open vision-language model with substantial scale.
Qwen-VL (Bai et al., 2023, 2024). Alibaba’s open VLM family.
Many others. The open-source VLM ecosystem became substantial.

By 2024-2026, the LLaVA-style pattern (CLIP encoder + projection + open LM) was the dominant open-source VLM architecture. The capability gap with frontier proprietary models exists but narrowed over time.

The native multimodal frontier

The 2023-2024 inflection. Frontier proprietary models moved from late-fusion (vision added to text-pretrained LM) to native multimodal (joint pretraining from the start).

GPT-4V (OpenAI, September 2023). GPT-4 with vision capability. Architecture details not fully disclosed; based on public information, appears to use a vision encoder bridged to GPT-4.

Gemini 1.0 (Google DeepMind, December 2023). Explicitly described as natively multimodal — trained on text, images, audio, video from the start.

GPT-4o (OpenAI, May 2024). “Omni” model — text, image, audio handled in a single unified architecture. Substantial step beyond GPT-4V in unification.

Claude 3 family (Anthropic, March 2024). Vision capabilities in Opus, Sonnet, Haiku. Substantial vision-language capability.

Claude 3.5 Sonnet (October 2024) — competitive vision capability.

Gemini 1.5 Pro (2024). Long-context multimodal (handles hour-long videos, large multimodal documents).

The trajectory. Each new release adds multimodal capability. The 2026 picture: every major frontier model handles at minimum text+vision; most handle text+vision+audio; some handle text+vision+audio+video.

Performance and benchmarks

A specific empirical state. Vision-language benchmark performance for frontier models:

MMMU (Massive Multi-discipline Multimodal Understanding; Yue et al., 2024). College-level multimodal questions across many subjects. Frontier model performance ~60-75% by 2026; substantial but below human-expert level.

MathVista (Lu et al., 2024). Mathematical reasoning over diagrams, charts, geometric figures. Frontier model performance ~70-85% by 2026.

ChartQA, DocVQA, AI2D. Specialized benchmarks for charts, documents, diagrams.

MMVet (Yu et al., 2023). Tests integrated vision-language capabilities (recognition + OCR + knowledge + spatial reasoning + math + language generation).

RealWorldQA (X.AI). Real-world image understanding.

The pattern. Frontier vision-language capability has substantially improved in 2023-2026. Many benchmarks are approaching saturation; harder benchmarks (MathVerse, MMMU-Pro, others) are emerging.

The gaps. Vision capability is uneven:

Visual recognition. Strong; near-human on many tasks.
OCR. Strong; production-grade for many use cases.
Spatial reasoning. Moderate; substantial failures on precise spatial tasks.
Chart and diagram understanding. Good and improving; not yet uniformly reliable.
Multi-image reasoning. Moderate; performance drops with more images.
Fine-grained recognition. Variable; struggles with rare or technical content.
Counting. Surprisingly weak; counting more than ~10 objects is often inaccurate.

The honest picture. Frontier VLMs are capable but uneven. Specific deployment depends on which capabilities are required.

Common VLM failure modes

A practical inventory of where current VLMs fail.

Visual hallucination. The model describes content not in the image. May invent objects, attribute properties incorrectly, miss content present in the image.

OCR confidence errors. The model reads text from images; sometimes invents text when the image is illegible or text isn’t actually there.

Spatial relationship errors. “The dog is to the left of the cat” — VLMs often get spatial relationships wrong even when both objects are correctly identified.

Counting errors. Counting more than ~5-10 objects in an image is unreliable.

Following along with image references. “What did I show you in the previous image?” Multi-image context is handled imperfectly.

Refusal artifacts on images. VLMs sometimes refuse benign image-related requests because of training overcautious about visual content.

Modality confusion. Sometimes the model treats text in images as a separate text input rather than as visual content.

These failure modes inform deployment caveats; cross-reference §11 (evaluation) and §13 (critiques).

Where VLMs sit in 2026

The summary. Vision-language models are substantial production reality. Frontier deployment handles many visual tasks; the open-source ecosystem provides accessible alternatives.

The gaps. Specific capabilities (spatial reasoning, counting, fine-grained recognition, multi-image) remain imperfect. Visual hallucinations persist. Cost and latency for vision processing are substantial compared to text-only operations.

The trajectory. Continued capability improvement; expanded multimodal integration; declining cost per query. The 2026 state is much improved over 2022 baselines; further substantial improvement is expected.

The next section briefly covers vision generation; the more-detailed treatment lives in Generative Models §6-§9.

§5. Vision Generation: Text-to-Image and Beyond

This section briefly covers vision generation; the detailed treatment lives in Generative Models §6-§9. This chapter develops the multimodal aspect — how vision generation integrates with broader multimodal systems.

Text-to-image as multimodal generation

The dominant deployed application. Text-to-image takes a text prompt as input and produces an image. The architecture spans multiple modalities: language understanding (process the prompt); cross-modal alignment (relate text to image content); image generation (produce the pixel output).

Cross-reference Generative Models §6 for the underlying diffusion mechanism; §7 for flow-matching; §8 for conditional generation; §9 for modality-specific architectures. This section covers the multimodal-modeling aspects.

The standard architecture

The canonical text-to-image pipeline (review from Generative Models §8):

text prompt

Text encoder (CLIP / T5)

Conditioning vectors

Diffusion / flow-matching denoising loop

in latent space, conditioned on text; 20–50 sampling steps

Latent representation

VAE decoder

Output image (e.g., 1024×1024)

The multimodal aspects.

Text encoder. CLIP or T5; produces text representations.
Cross-modal conditioning. The denoising network attends to text features via cross-attention.
Output modality. Image (different modality from input).

Notable production systems (cross-reference Generative Models §9):

Midjourney v6/v7. Visually distinctive; closed.
DALL-E 3 / GPT-4o image generation. Tightly integrated with chat.
Stable Diffusion 3 / SDXL. Open weights; widely deployed.
Flux (Black Forest Labs). Flow-matching-based; high quality.
Imagen 3 / Imagen 4 (Google).
Firefly (Adobe). Stock-photography-trained; clean licensing.

Specialized vision generation models

Beyond general text-to-image, specialized vision generation has matured.

ControlNet (Zhang et al., 2023). Adds spatial conditioning (depth maps, edge maps, pose skeletons) to text-to-image generation. Allows precise spatial control.

Image editing models. InstructPix2Pix, DALL-E 3 editing, Adobe Firefly editing. Take an image and instructions; produce edited image.

Image-to-image style transfer. Specialized models for stylistic transformations.

Layout-conditional generation. Generate images respecting specified bounding boxes or scene structures.

Personalized generation. DreamBooth, LoRA-based personalization. Generate images of specific subjects (people, objects) from a few reference images.

These specialized capabilities increasingly integrate into general multimodal models; the boundary between general and specialized is fluid.

The unification with vision-language understanding

A specific 2024 trend. Native multimodal models (GPT-4o, Gemini, Chameleon) increasingly unify vision understanding and vision generation in single architectures.

The architectural pattern.

flowchart TD
  TxtTok["Text tokens (subword BPE)"]
  ImgTok["Image tokens (VQ-VAE codes or patch embeddings)"]
  Transformer["Unified Transformer\n(all tokens flow through same model;\ncross-modal attention at every layer)"]
  OutTxt(["Text tokens (text response)"])
  OutImg(["Image tokens (generated image)"])
  OutMix(["Interleaved text + image tokens"])

  TxtTok --> Transformer
  ImgTok --> Transformer
  Transformer --> OutTxt
  Transformer --> OutImg
  Transformer --> OutMix

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  class TxtTok,ImgTok,Transformer pill
  class OutTxt,OutImg,OutMix pill

The capability. The model can:

Describe an image. (Text output from image input.)
Generate an image. (Image output from text input.)
Edit an image. (Image output from image + text input.)
Generate interleaved text and images. (Mixed output.)
Reason about generated images. (Self-grounded generation with verification.)

The trade-off. Unified systems are more flexible but less specialized. A dedicated text-to-image model (Stable Diffusion, Flux) may produce higher-quality images than a unified system; the unified system handles broader tasks with adequate quality.

The 2026 landscape. Both paradigms coexist. Specialized text-to-image dominates content creation use cases (artists, designers). Unified multimodal dominates general use (chatbots producing images as part of responses, agents that need to generate and reason about images).

Where vision generation in multimodal sits in 2026

The summary. Vision generation is mature production technology. The integration with vision-language understanding (unified multimodal models) is the 2024-2026 frontier. Specialized text-to-image and unified multimodal both have legitimate niches.

Cross-reference Generative Models for the detailed treatment of the underlying generative-modeling techniques.

§6. Audio and Speech

A specific modality with substantial 2022-2026 maturation. Audio spans speech (the dominant subcategory), music, environmental sound, and acoustic data more broadly. This section develops speech-to-text, text-to-speech, general audio understanding, music generation, and the integration with unified multimodal models.

Speech-to-text: Whisper and beyond

The dominant deployed audio application. Speech-to-text (or automatic speech recognition, ASR) takes audio input and produces text transcription.

Whisper (Radford, Kim et al., OpenAI, 2022) “Robust Speech Recognition via Large-Scale Weak Supervision.” The landmark modern speech-to-text model.

The recipe.

Architecture. Encoder-decoder Transformer. Encoder takes log-mel spectrograms of audio; decoder generates text token-by-token.
Training data. 680,000 hours of multilingual, multi-task audio paired with transcripts. The largest open-domain ASR training corpus to date.
Multi-task training. Same model handles transcription, translation (audio in language X → text in English), language identification, voice activity detection. Trained with a unified objective.
Robustness. Trained on noisy real-world data; substantially more robust to accents, background noise, and recording quality than prior systems.

The result. Whisper substantially advanced open-source ASR quality. Multiple model sizes released (Whisper-Tiny through Whisper-Large-v3). Becomes standard infrastructure.

The trajectory. Subsequent ASR models built on Whisper-style architectures:

Whisper-v3 (OpenAI, 2023). Improved multilingual performance.
Distil-Whisper (Hugging Face, 2023). Distilled smaller versions of Whisper.
Seamless (Meta, 2023). Streaming ASR + translation in a single model.
Many specialized ASR systems. Voice-assistant integrations; medical transcription; legal transcription.

The 2026 state. Speech-to-text is substantially solved for clean recordings of major languages. Edge cases (heavy accents, very low-resource languages, severely noisy audio, technical domain vocabulary) remain imperfect.

Text-to-speech

The inverse task. Text-to-speech (TTS) takes text and produces audio.

The pre-deep-learning era used concatenative synthesis (stitching together recorded phonemes); the result was robotic-sounding. Deep-learning TTS (Tacotron, WaveNet) substantially improved quality through the late 2010s.

The 2022-2026 advances. Modern TTS produces speech indistinguishable from human in many settings.

Notable systems.

Tortoise TTS (Betker, 2022). Open-source high-quality TTS; expressive control; voice cloning from short reference samples.

XTTS, XTTS-v2 (Coqui, 2023). Multilingual TTS with voice cloning.

VALL-E (Microsoft, 2023). Speech LM trained on 60K hours of data; high-quality voice cloning from 3-second reference.

NaturalSpeech 3, NaturalSpeech 4 (Microsoft). Diffusion-based TTS.

Voicebox (Meta, 2023). In-context-learning TTS.

ElevenLabs. Commercial high-quality TTS with voice cloning.

OpenAI TTS, Google TTS, Anthropic voice integration. Production-grade TTS in major chatbot products.

The 2026 capabilities.

Voice cloning. Replicate a specific voice from short reference audio.
Emotional control. Generate speech with specified emotion or style.
Multilingual. Single models handling many languages.
Real-time. Low-latency synthesis for interactive applications.

The concerns. Voice cloning enables deepfake voice impersonation. Production deployments include detection mechanisms; the cat-and-mouse with adversarial deepfake creation is ongoing.

General audio understanding

A broader category. General audio understanding extends speech-to-text to all kinds of audio — music, environmental sound, acoustic events.

AudioLM (Borsos et al., Google, 2022). Audio language model trained on raw audio. Can continue audio sequences (music continuation, speech continuation); learns acoustic structure without text supervision.

MusicLM (Agostinelli et al., Google, 2023). Music generation from text descriptions. Hierarchical audio generation; produces 30+ second music clips from prompts.

AudioGen (Kreuk et al., Meta, 2023). Generate sound effects from text descriptions (“a dog barking at a thunderstorm”).

CLAP (Contrastive Language-Audio Pretraining; LAION, 2022+). Audio analogue of CLIP. Joint audio-text embeddings for audio retrieval and classification.

AudioFlamingo (Kong et al., NVIDIA, 2024). Audio-language model in the Flamingo lineage.

Qwen-Audio, Qwen2-Audio (Alibaba). Open audio-language models.

The 2026 state. General audio understanding is less mature than speech-to-text or vision-language. Production deployment exists but is narrower; research is active.

Music generation

A specific application. Music generation has become substantial commercial reality.

Notable systems.

MusicLM (Google, 2023). Foundational research; not deployed as product.

MusicGen (Meta, 2023). Open-source music generation from text.

Suno (commercial, 2023-2024+). High-quality vocal-and-instrumental music generation from text prompts. Reached significant commercial scale by 2024-2025.

Udio (commercial, 2024+). Competitor to Suno; similar capabilities.

Stable Audio, Stable Audio 2.0, Stable Audio Open (Stability AI). Diffusion-based music and sound generation; open weights for some variants.

Adobe Generative Sound. Commercial sound design tools.

The 2026 state. Music generation can produce short polished pieces (30 seconds to a few minutes) in many genres. Longer pieces, complex compositions, specific stylistic requirements remain challenging.

The legal context. Music generation has been subject to substantial copyright concerns. Several lawsuits filed against Suno and Udio (2024) alleging training-data infringement; outcomes still evolving in 2026. The legal framework affects deployment but does not fundamentally change technical capabilities.

Audio in unified multimodal models

The 2024-2026 frontier. Audio is increasingly integrated into native multimodal models.

GPT-4o (OpenAI, May 2024). Handles audio input and output natively in a unified model. Voice Mode integrates speech with text in a single architecture. Substantial step beyond bolted-on TTS+ASR pipelines.

Gemini Live (Google, 2024). Real-time voice interaction with Gemini.

Claude voice integration (Anthropic, planned/rolling out 2024-2025).

The capability. The model can:

Listen. Process spoken queries.
Speak. Generate spoken responses.
Reason about audio content. Discuss music, identify sounds, analyze audio patterns.
Interactive conversation. Real-time back-and-forth conversation including voice.

The advantages over bolted-on TTS+ASR.

Tone preservation. The model “hears” emotional cues; can respond appropriately.
Conversational naturalness. Latency reduction; interruption handling; turn-taking dynamics.
Audio reasoning. Integrated understanding of audio content (music, ambient sound) within reasoning.

The 2026 deployment. Voice interfaces have become production-grade for major chatbots. Real-time voice conversation with frontier models is mainstream user experience.

Where audio sits in 2026

The summary. Audio modalities have substantially matured in 2022-2026. Speech-to-text (Whisper-family) is production-grade. Text-to-speech is at human-or-better quality. Music generation has substantial commercial deployment. General audio understanding is research-stage but advancing. Unified audio-language-vision multimodal is the frontier.

The remaining issues. Long-form music generation (full songs with development). High-quality audio at low compute cost (current models are still expensive). Audio understanding for non-speech non-music (environmental, acoustic monitoring). Voice cloning safeguards.

The trajectory. Continued capability improvement; expanded multimodal integration; declining cost. The 2026 state is much improved over 2022 baselines; further substantial improvement is expected.

The next section develops video, the modality with the largest data scarcity and highest compute requirements.

§7. Video Models

Video as space-time data.
Video understanding (vision-language extended to video).
Video generation (Sora, Veo, Gen-3; cross-reference Generative Models §9).
The data scarcity problem for video.

§8. Native Multimodality and Token-Unified Models

Tokenization across modalities.
Mixed-modal autoregressive training (Chameleon, GPT-4o, Gemini Native).
Image-and-text generation in unified models.
The 2024-2026 frontier.

§9. Vision-Language-Action Models

The robotics connection.
RT-1, RT-2 (Google DeepMind), PaLM-E.
OpenVLA and the open VLA family.
π0 (Pi-zero) and successor manipulation models.
The data and benchmarks problem.

CLIP-style retrieval.
Multimodal RAG.
Cross-modal search applications.

§11. Evaluating Multimodal Models

Multimodal benchmarks (MMMU, MathVista, ChartQA).
Vision-language evaluation challenges.
Video and audio evaluation.
The reference-free quality problem (cross-reference Evaluation §3).

§12. Connections to Other Chapters

Self-Supervised Learning §7.
Generative Models §9.
Foundation Models.
Large Language Models.
AI Agents §7.
Robotics (planned).
Evaluation §10.

§13. Critiques and Alternative Perspectives

“Multimodality is mostly cross-modal retrieval.”
“Native multimodality is unnecessary overhead.”
“Vision-language-action is dominated by data scarcity.”
The interpretability gap.
Energy and compute concerns.

§14. Limitations and Open Problems

OP-MM-1. Native multimodal training at frontier scale.
OP-MM-2. Video data scarcity.
OP-MM-3. Vision-language-action data scarcity.
OP-MM-4. Long-context multimodal.
OP-MM-5. Cross-modal hallucinations.
OP-MM-6. Multimodal evaluation methodology.
OP-MM-7. Audio understanding at frontier capability.
OP-MM-8. Mixed-modal generation quality.

§15. Further Reading

Annotated list. CLIP, Flamingo, BLIP, LLaVA; native-multimodal papers (GPT-4o, Gemini, Chameleon); VLA papers (RT-1, RT-2, π0); evaluation references.

§16. Exercises and Experiments

Tentative.

E1. Train a small CLIP-style model.
E2. Evaluate a frontier VLM on a vision-language benchmark.
E3. Build a small multimodal RAG system.
E4. Implement a basic VLA pipeline.
E5. Compare native vs late-fusion multimodal architectures.
E6. Analyze multimodal hallucination patterns.

Multimodal Models

Scope and What This Chapter Is About

§1. Motivation and Scope

Three worked instances

What multimodal models are

What multimodal models are not

The shift from specialized to unified models

Why multimodal matters in 2026

Boundaries with adjacent chapters

What this chapter does not try to do

Position taken in this chapter

§2. Historical Context

Pre-deep-learning multimodal as pipeline

Image captioning and the deep-learning multimodal era

Visual question answering

Vision-language pretraining: BERT-style era

CLIP and the contrastive multimodal era

BLIP, Flamingo, and the encoder-decoder scaling

GPT-4V and the frontier-VLM era

Gemini and native multimodality

Vision-language-action models

Computer-use agents as multimodal applications

Where this leaves us in 2026

§3. The Multimodal Modeling Framework

Modality vs representation

Unimodal, bimodal, multimodal

Cross-modal vs joint-modal vs unified

Alignment vs fusion vs generation

Native multimodality vs late fusion

Tokenization across modalities

Modality-specific characteristics

Where the framework fits

Where multimodal modeling sits in 2026

§4. Vision-Language Models

CLIP and contrastive pretraining

Encoder-fusion patterns: BLIP and BLIP-2

Decoder-fusion patterns: Flamingo and LLaVA

The native multimodal frontier

Performance and benchmarks

Common VLM failure modes

Where VLMs sit in 2026

§5. Vision Generation: Text-to-Image and Beyond

Text-to-image as multimodal generation

The standard architecture

Specialized vision generation models

The unification with vision-language understanding

Where vision generation in multimodal sits in 2026

§6. Audio and Speech

Speech-to-text: Whisper and beyond

Text-to-speech

General audio understanding

Music generation

Audio in unified multimodal models

Where audio sits in 2026

§7. Video Models

§8. Native Multimodality and Token-Unified Models

§9. Vision-Language-Action Models

§10. Cross-Modal Retrieval and Search

§11. Evaluating Multimodal Models

§12. Connections to Other Chapters

§13. Critiques and Alternative Perspectives

§14. Limitations and Open Problems

§15. Further Reading

§16. Exercises and Experiments