Robotics
All sixteen sections are in draft status. Open problems are flagged inline and consolidated in §14.
This chapter replaces AIMA 4e Chapter 26 (“Robotics”). The AIMA chapter’s classical material (kinematics, motion planning, classical control, SLAM) is preserved as foundational background, but the chapter is restructured around the 2022-2026 transformation of robotics through deep learning, foundation models, and vision-language-action (VLA) systems. The shift is qualitative: a robotic field that for decades depended on hand-engineered controllers, classical planning, and per-task supervision has been substantially transformed by general-purpose learned policies trained on large-scale data.
The chapter consolidates robotics-specific material from many other chapters: VLA models (Multimodal Models §9), continuous-control RL (RL §7), model-based RL (RL §8), embodied agents (AI Agents §11), and connects to AI Safety / Alignment for the deployment-safety concerns. This chapter develops the robotics-specific synthesis.
The chapter assumes the Reinforcement Learning chapter, the Multimodal Models chapter, and the AI Agents chapter as substantial background; AI for Science as relevant context.
Scope and What This Chapter Is About
The chapter develops modern robotics - the field of physical machines that perceive, reason about, and act in the physical world, with substantial emphasis on the 2022-2026 transformation through deep learning and foundation models. We cover the conceptual framing (classical vs learned approaches; embodiment; sim-to-real), the substrate (hardware platforms; sensors; actuators), the methodological pillars (perception; planning; control; learning), the dominant modern paradigms (imitation learning at scale; reinforcement learning for control; vision-language-action models; world models), the application domains (industrial, service, mobile, manipulation, locomotion, humanoids), the safety considerations, and the connections to broader AI.
Approximate length target: 18,000–25,000 words (a major chapter - robotics is a substantial domain at the intersection of AI and physical engineering).
§1. Motivation and Scope
Three worked instances
To anchor the chapter, three concrete instances from 2024-2026 robotics.
Instance 1: A warehouse robot picks and packs. A mobile manipulator (e.g., Amazon Robotics Sparrow, Covariant) operates in a fulfillment centre. The robot navigates to a shelf, identifies a requested item using vision, plans a grasp, picks the item, transports it, places it in the correct shipping container. The full task takes ~30 seconds per item; the robot operates continuously across an 8-hour shift. The underlying technology combines learned vision (object detection, grasp prediction) with classical control (motion planning, low-level joint control). Production deployment at substantial scale.
Instance 2: A humanoid robot folds laundry. A humanoid manipulator (e.g., π0 deployed on a research platform; Figure 02 in pilot deployments) is shown a basket of mixed laundry and asked to fold and stack the items. The robot picks up shirts and towels one at a time, identifies each item’s type via vision, executes a learned folding policy, places folded items in a stack. The full task takes 5-10 minutes; some items (complex shapes, slippery fabrics) cause failures. The underlying technology is a vision-language-action model trained on extensive teleoperation data. Early commercial deployment.
Instance 3: A quadruped robot navigates rough terrain. A quadruped (e.g., Boston Dynamics Spot, Unitree Go2, ANYmal) navigates a construction site, climbing stairs, traversing debris, avoiding obstacles. The robot operates semi-autonomously - given a destination, it plans a path and executes it without per-step human guidance. The underlying technology combines learned locomotion policies (RL-trained for terrain adaptation) with classical SLAM and path planning. Substantial commercial deployment in industrial inspection.
These three instances span the modern robotics landscape: manipulation in structured environments (warehouse), manipulation in unstructured environments (humanoid laundry), and locomotion in unstructured environments (quadruped). They share substantial AI substrate (vision, planning, learned policies); they differ in form factor, environment, and task complexity.
What modern robotics is
A working definition. Robotics is the field of physical machines that perceive, reason about, and act in the physical world. Modern robotics (the 2022-2026 era of focus) is characterized by substantial reliance on deep learning, foundation models, and large-scale data - alongside continued use of classical engineering methods where appropriate.
The five pillars of modern robotic systems.
Hardware. Physical platform - mechanical structure, actuators, sensors, computational hardware. The substrate on which everything else runs.
Perception. Processing sensor inputs (cameras, depth sensors, IMUs, tactile, lidars) to understand the environment. Increasingly dominated by deep-learning perception with classical methods (SLAM, ICP) where needed.
Planning. Deciding what to do - high-level (which actions to take, in what order) and low-level (specific motion trajectories). Combines classical search-based planning with learned components.
Control. Producing motor commands that achieve planned actions. Classical control (PID, MPC, optimal control) still dominates low-level; learned control increasingly used at higher levels.
Learning. Improving from experience - imitation learning from demonstrations, RL from rewards, transfer learning from related tasks. The 2022-2026 development that has substantially transformed the field.
The modern integration. These pillars are increasingly unified by foundation-model-style approaches. Vision-language-action models combine perception (vision), planning (high-level reasoning), and control (action prediction) in single architectures. Classical components remain where their guarantees matter; learned components dominate where flexibility matters.
What modern robotics is not
Several boundaries.
Robotics is not just AI. Hardware engineering, mechanical design, manufacturing, materials science all matter. Software alone - no matter how sophisticated - cannot make a robot that doesn’t exist physically.
Robotics is not just deep learning. Classical methods (motion planning, control theory, SLAM) remain essential for low-level performance and safety guarantees. The 2022-2026 transformation has added learned components; it has not replaced classical methods entirely.
Robotics is not autonomous driving alone. Autonomous vehicles are a substantial subfield with substantial commercial activity. They are also a special case of robotics - focused on navigation and decision-making in a well-structured environment (roads with rules). General-purpose robotics is much broader.
Robotics is not solved. Despite substantial progress, robotic systems remain narrow relative to human capability. General-purpose robots that can handle arbitrary tasks in arbitrary environments do not exist in 2026.
The 2022-2026 transformation
A specific assertion. The 2022-2026 period has seen substantial transformation in robotics, comparable to the 2012-2017 transformation of computer vision.
The key changes.
Foundation models as robotic substrate. Pretrained vision-language models (CLIP, SigLIP, Gemini, etc.) became the standard backbone for robotic perception. Pretrained LLMs began informing robotic planning. The 2022 PaLM-E and 2023 RT-2 papers demonstrated that web-pretrained foundation models could substantially benefit robotic capability.
VLA models. A new architectural pattern emerged (cross-reference Multimodal Models §9). Vision-language-action models combine perception, language understanding, and action prediction in unified architectures. RT-2, OpenVLA, π0 family substantially advanced robotic capability in 2023-2024.
Data scaling. Open X-Embodiment (2023+) aggregated robotic trajectory data across 22+ platforms. The dataset enabled cross-embodiment training; substantially improved generalization.
Humanoid commercial activity. Figure AI, 1X Technologies, Apptronik, Tesla Optimus, Sanctuary AI, and others raised substantial funding for humanoid robotics. Multiple pilot deployments by 2024-2026.
Diffusion policies and modern imitation. Diffusion-based action policies (Chi et al., 2023 “Diffusion Policy”) demonstrated substantial improvement over earlier imitation learning. Combined with VLA training, enabled substantially more capable manipulation.
Learned locomotion at scale. Quadruped locomotion was transformed by RL-based policies trained in simulation and deployed to real robots (Lee et al. ANYmal, 2020; subsequent work). Humanoid locomotion followed similar trajectory in 2024-2025.
The aggregate effect. Robotic capabilities that seemed years away in 2021 (folding laundry, walking on rough terrain, autonomous warehouse manipulation) became production-stage in 2024-2026. The trajectory is fast and accelerating.
The honest caveat. The transformation is real but uneven. Some capabilities (locomotion, simple manipulation in structured environments) are mature. Others (dexterous manipulation in arbitrary environments, generalization to novel tasks) remain hard. The hype around humanoid robots has run substantially ahead of demonstrated capability.
Why robotics matters in 2026
Four motivations.
1. Production deployment is real and growing. Industrial robotics generated $50B+ in annual revenue by 2025. Service robotics (delivery, cleaning, healthcare) is a smaller but growing market. Humanoid robotics is at the pilot-deployment stage with substantial investment. The economic impact is substantial.
2. Labour-market implications. Robotic automation affects employment patterns. Warehouse, manufacturing, agriculture, and service-industry jobs see direct robotic substitution. The pace and breadth of substitution is one of the central socioeconomic questions of the 2020s.
3. Scientific and societal applications. Robotic surgery improves clinical outcomes. Robotic agriculture supports food production. Robotic inspection of infrastructure improves safety. Search-and-rescue robotics save lives.
4. AI-system embodiment. The broader question of what AI systems do in the world - beyond producing text or images - depends on embodiment. Robotics is the technical substrate for embodied AI. Whether the trajectory continues to general-purpose embodied AI (a “robot for everything”) is one of the consequential open questions of the field.
Boundaries with adjacent chapters
This chapter is a substantial domain at the intersection of many.
Reinforcement Learning §7 (continuous control), §8 (model-based RL), §9 (offline RL), §11 (exploration) all directly inform robotic learning. This chapter applies RL to robotics; the RL chapter develops the algorithms.
Multimodal Models §9 develops vision-language-action models as a multimodal pattern. This chapter develops the robotics-specific application and context.
AI Agents §11 covers embodied agentic safety. The agentic concerns translate to robotics with physical-action specifics.
Generative Models §7 (flow-matching) is directly used in modern VLA policies (π0). The Generative Models chapter develops the machinery; this chapter applies it to robotics.
Foundation Models provides the FM-as-substrate framing. Robotic foundation models (RT-2, OpenVLA, π0) are FMs.
Self-Supervised Learning covers pretraining methods applicable to robotic data.
Causality §10 is relevant for interventional reasoning in physical contexts. Robotic systems intervene in the world; causal reasoning about intervention effects matters for alignment and design.
Alignment §11 covers agentic safety; robotic deployment raises specific safety concerns developed here.
Evaluation §10 covers cross-cutting evaluation methodology; this chapter §14 covers robotics-specific evaluation challenges.
AI for Science §6 covers materials and chemistry; some applications integrate with robotic synthesis/characterization.
What this chapter does not try to do
Several explicit exclusions.
We do not provide a complete textbook treatment of classical robotics. Siciliano-Khatib (Springer Handbook of Robotics, 2nd ed.) provides ~1500 pages on classical material; this chapter covers it in summary (§4).
We do not develop autonomous driving in depth. Substantial subfield with its own conventions; covered only where it overlaps with general robotics.
We do not cover hardware design and manufacturing in detail. Substantial engineering disciplines; this chapter covers what software/AI people need to know about hardware.
We do not extensively cover industrial robotics engineering (PLC programming, factory automation systems, robot teach pendants). These are real and important but largely outside the AI-research scope.
We do not develop robotic ethics in depth. The broader societal questions (labour displacement, autonomous weapons, etc.) live in Alignment / Ethics chapters; this chapter touches them where relevant.
Position taken in this chapter
The chapter takes modern robotics seriously as a real and growing field. The 2022-2026 transformation is substantive. The remaining problems (sample efficiency, sim-to-real, dexterous manipulation, safe deployment, hardware cost) are also substantive.
The chapter is cautious about humanoid-robotics hype. Substantial commercial investment; substantial demonstration capability; substantial gap between demonstrations and reliable deployment. Honest characterization acknowledges both the promise and the gap.
The chapter’s overall framing: robotics in 2026 is transforming but not transformed. The technical foundations have shifted substantially; production-grade capability has expanded substantially; the trajectory is positive but the destination (general-purpose robotic autonomy) remains distant.
§2. Historical Context
This section traces robotics from classical origins through the 2022-2026 transformation. The history is essential because modern robotics inherits substantial classical methodology even as it has been transformed by deep learning.
A timeline of the inflection points:
1950s-1960s Early industrial robots: Unimate (1961),
the first deployed industrial robot, on
General Motors assembly line. Classical
control; programmed-by-hand trajectories.
│
▼
1970s-1980s Robotics as engineering discipline. Classical
control theory (Brady, Hollerbach, others);
kinematics and dynamics formalized;
SCARA and other industrial-arm designs.
│
▼
1990s Mobile robotics emerges. SLAM (Simultaneous
Localization And Mapping) formalized.
Probabilistic robotics (Thrun, Burgard,
Fox: 2005 textbook) consolidates the field.
│
▼
2004-2007 DARPA Grand Challenges: autonomous-driving
catalysts. 2004 first challenge: no team
finishes. 2005 second challenge: Stanley
(Stanford) wins. 2007 Urban Challenge:
multiple teams complete urban driving course.
Autonomous-driving R&D substantially
catalyzed.
│
▼
2008-2015 Deep learning enters perception. Pre-deep-
learning robotic perception used hand-
engineered features (SIFT, HOG); deep CNNs
(AlexNet 2012, ResNet 2015) substantially
advance robotic vision capability.
│
▼
2015-2018 Deep RL for robotics begins. Levine et al.
"End-to-end Training of Deep Visuomotor
Policies" (2016). DeepMind's robotic-grasping
work; OpenAI's robotic hand (Rubik's Cube,
2019). Substantial demonstrations; limited
production deployment.
│
▼
2018-2020 Sim-to-real becomes substantial research
programme. Domain randomization (Tobin et al.
2017); meta-learning approaches; substantial
effort on bridging simulation-trained policies
to real deployment.
│
▼
2020 Lee et al. (ETH Zurich / ANYbotics) "Learning
quadrupedal locomotion over challenging
terrain." Sim-to-real RL for ANYmal quadruped;
substantial advance for legged-robot locomotion.
│
▼
2021-2022 Foundation models begin entering robotics.
PaLM-E (2023) and RT-1 (2022-2023) demonstrate
pretrained-foundation-model substrate for
robotic policies. The VLA paradigm emerges.
│
▼
2022 Boston Dynamics' Atlas demonstrations (parkour,
somersaults); demonstrates frontier humanoid
capability. Spot quadruped enters substantial
commercial deployment.
│
▼
2022 Tesla Optimus prototype announcement. Substantial
industry attention to humanoid robotics.
Subsequent updates (2023, 2024, 2025)
demonstrate continued capability advancement.
│
▼
2023 RT-2 (Google DeepMind, July 2023): VLA model
with VLM backbone. Substantial generalization
improvements over RT-1. Establishes the modern
VLA paradigm.
│
▼
2023 Figure AI founded (2022); 1X Technologies
(NEO humanoid); Sanctuary AI (Phoenix);
Apptronik (Apollo); Agility Robotics (Digit);
substantial humanoid-robotics commercial
activity. Multiple companies raise hundreds
of millions in funding.
│
▼
2023 Diffusion Policy (Chi et al., 2023). Diffusion
models for robotic action prediction;
substantial improvement over previous
imitation-learning approaches.
│
▼
2023-2024 Open X-Embodiment Collaboration: cross-
institution effort to standardize robotic-
trajectory data. Releases ~1M trajectories
across 22+ robot platforms. Substantially
advances cross-embodiment training.
│
▼
2024 OpenVLA (Kim, Pertsch et al., 2024): open-
source VLA at competitive capability.
Democratized VLA research.
│
▼
2024 Oct π0 (Pi-zero; Physical Intelligence, October
2024). Flow-matching action policy on
vision-language backbone. Substantial
capability advance in dexterous manipulation
(folding laundry, packing groceries).
│
▼
2024-2025 Humanoid robotics commercial inflection.
Figure 02 (Figure AI). Helix (Figure VLA,
February 2025). 1X NEO Beta. Tesla Optimus
V2/V3 updates. Apptronik Apollo deployments.
Multiple pilot deployments at substantial
customers (BMW, Mercedes, Amazon, others).
│
▼
2025 Gemini Robotics (Google DeepMind, 2025).
VLA model integrated with Gemini foundation
model line.
│
▼
2025-2026 NVIDIA Cosmos and physical world models.
Increasing integration of world models with
robotic learning. Substantial industry-wide
investment in robotic foundation models.
│
▼
2026 Robotics consolidates as substantial AI
application area. Production deployment
growing; humanoid trajectory uncertain;
research frontier active across VLA,
world models, sim-to-real, dexterous
manipulation.We develop each phase below.
The classical robotics era
The 1950s-1990s. Robotics began as an engineering discipline. The first deployed industrial robot - Unimate (1961) - was a hydraulic arm performing simple assembly tasks at General Motors. Through the 1960s-1980s, industrial robotics expanded into manufacturing; classical control theory, kinematics, and dynamics were formalized; the SCARA and other arm designs were established.
The methodology. Robots were programmed by hand or via teach-pendant demonstrations. Trajectories were specified explicitly; control loops handled position and velocity tracking; the same robot performed the same task indefinitely.
The limitations. Classical robots were narrow (programmed for specific tasks; brittle outside their programmed envelope), expensive (substantial engineering per deployment), unsafe near humans (fast, strong, unable to perceive their environment).
The trajectory. Industrial robotics matured through the 1980s-2000s; by 2010, robotic arms were standard in many manufacturing contexts (especially automotive). The technology was useful but narrow.
Mobile robotics and SLAM
The 1990s-2000s. Mobile robotics - robots that move through environments - required substantial new methodology. Path planning (A*, RRT, RRT*); localization (Kalman filtering, particle filtering); mapping (SLAM).
The breakthrough work. SLAM (Simultaneous Localization And Mapping; multiple researchers including Durrant-Whyte, Newman, Leonard, Thrun, Burgard) developed methods for a mobile robot to simultaneously build a map of an unknown environment and localize itself within that map. Foundational for autonomous navigation.
Probabilistic Robotics (Thrun, Burgard, Fox, 2005) consolidated the probabilistic-mobile-robotics methodology. Became the standard textbook.
The impact. Mobile robotics enabled applications beyond fixed-arm manufacturing - warehouse robots, autonomous vehicles, household robots. The methodology became foundational for autonomous driving.
The DARPA Grand Challenges and autonomous driving
A specific catalyst. The DARPA Grand Challenges (2004, 2005, 2007) substantially accelerated autonomous-driving R&D.
The challenges. 2004 (off-road autonomous driving): no team finished. 2005 (similar challenge): five teams finished; Stanford’s Stanley (Thrun, Montemerlo et al.) won. 2007 Urban Challenge (autonomous urban driving): six teams completed; CMU’s Boss won.
The lasting impact. The challenges:
Recruited substantial researcher attention to autonomous driving.
Established core methodology (sensor fusion, planning, control for vehicles).
Launched careers and companies (Waymo’s lineage traces to the DARPA work; many founders moved from research to commercial).
Demonstrated feasibility of autonomous driving.
The autonomous-driving industry that emerged (2010-2026: Waymo, Cruise, Tesla Autopilot/FSD, Mobileye, multiple Chinese companies) has its substantial roots in the DARPA work.
Deep learning enters robotics
The 2010s gradual transition. Pre-deep-learning robotics used hand-engineered features for perception (SIFT, HOG, etc.) and explicit modeling for everything else. Deep learning entered piecemeal.
Perception first. Deep CNNs (AlexNet 2012, ResNet 2015) substantially improved robotic vision. Object detection, semantic segmentation, depth estimation all benefited from deep learning. By 2017, deep-learning perception was standard in robotic systems.
Policy learning second. Levine et al. (2016) “End-to-End Training of Deep Visuomotor Policies” demonstrated end-to-end RL for robotic manipulation. Substantial demonstration that deep networks could learn manipulation policies directly from visual input.
DeepMind’s robotic-arm work and OpenAI’s robotic-hand work. Substantial 2017-2020 demonstrations of deep RL solving previously-hard manipulation tasks. OpenAI’s 2019 demonstration of in-hand Rubik’s Cube manipulation was a landmark.
The limitations. The demonstrations were narrow (single task per policy); sample-inefficient (millions of simulated trials or hundreds of real-robot trials); brittle (sim-to-real gaps; failure under distribution shift). Deep RL for robotics was promising but not production-grade through 2020.
Sim-to-real and the simulation era
A specific 2018-2020 research focus. Sim-to-real transfer - taking a policy trained in simulation and deploying it on real robots - became substantial research direction.
The problem. Simulation is much cheaper than real-robot data collection. But simulators don’t perfectly model real physics; policies trained in simulation often fail on real robots.
The approaches.
Domain randomization (Tobin et al., 2017). Randomize simulation parameters (textures, lighting, friction, masses) so the trained policy is robust to a distribution of possible physics; the real world is one sample from this distribution.
Domain adaptation. Use techniques that explicitly bridge simulation and real distributions.
Meta-learning. Learn policies that can rapidly adapt to new dynamics (real vs simulated).
Co-training with real data. Use small amounts of real-robot data to fine-tune simulation-trained policies.
The 2026 state. Sim-to-real has substantially advanced but remains an active research area. Some tasks (legged locomotion) sim-to-real well; others (dexterous manipulation) less so.
The legged-locomotion breakthrough
A specific landmark. Lee, Hwangbo, Wellhausen, Koltun, Hutter (ETH Zurich / ANYbotics, 2020) “Learning quadrupedal locomotion over challenging terrain.”
The setup. Train an RL policy in simulation for quadruped locomotion across diverse terrains. Deploy on ANYmal quadruped. Use careful domain randomization and a teacher-student training scheme.
The result. The deployed policy substantially advanced legged-robot locomotion capability. ANYmal walked across mud, rubble, stairs, slopes that previous controllers could not handle. Demonstrated that sim-to-real RL works at production scale for legged locomotion.
The follow-up work. Margolis et al., Wang et al., and many other groups substantially advanced legged-locomotion RL through 2021-2024. By 2026, RL-based locomotion is standard for legged robots; commercial deployment is broad (Boston Dynamics Spot, Unitree, ANYmal).
Foundation models enter robotics
The 2022-2023 inflection. PaLM-E (Driess et al., Google DeepMind, March 2023) and RT-1 (Brohan et al., Google DeepMind, late 2022) demonstrated that foundation models could substantially benefit robotic capability.
PaLM-E. An embodied multimodal language model - PaLM extended with robotic-state and image inputs. Substantial cross-task generalization; substantial benefit from PaLM’s pretrained capabilities.
RT-1. A Transformer-based robot policy trained on ~130K real-robot trajectories. Substantially better cross-task generalization than per-task RL policies.
RT-2 (Brohan et al., Google DeepMind, July 2023). Built on RT-1 but with a vision-language foundation model as backbone. Web pretraining transferred to robotic capability.
The conceptual lesson. Pretraining on web data benefits robotics. The VLA paradigm emerged; substantial subsequent work followed.
The humanoid commercial inflection
A specific 2023-2024 development. Substantial commercial activity emerged in humanoid robotics.
Figure AI (founded 2022). Raised substantial Series A (2023), Series B (2024 at $2.6B valuation). Figure 01 (2023) and Figure 02 (2024) demonstrated substantial capability. Pilot deployment at BMW manufacturing.
1X Technologies (formerly Halodi Robotics; founded 2014, accelerated 2023+). NEO humanoid; raised substantial funding (2024).
Apptronik (founded 2016). Apollo humanoid; partnership with Mercedes for manufacturing pilots.
Sanctuary AI (founded 2018). Phoenix humanoid.
Agility Robotics (founded 2015). Digit bipedal robot; deployment in logistics.
Tesla Optimus (announced 2021, ongoing development). Substantial Tesla investment; multiple iterations (V1, V2, V3) demonstrating continued capability advancement.
Boston Dynamics Atlas (electric version 2024). Substantial capability demonstration; commercial trajectory uncertain.
Unitree humanoids (G1, H1). Chinese-market humanoid systems at substantially lower price points.
The aggregate. By 2026, humanoid robotics is one of the most-funded commercial categories in AI/robotics. The capability has substantially advanced; whether the commercial trajectory will sustain depends on substantial unresolved engineering and deployment questions.
VLA frontier through 2024-2026
Continued progression. OpenVLA (June 2024) - open-source VLA. π0 (Physical Intelligence, October 2024) - flow-matching VLA with substantial dexterous-manipulation capability. π0.5 (2025) - successor. Helix (Figure AI, February 2025) - humanoid VLA. Gemini Robotics (Google DeepMind, 2025) - VLA in Gemini line. NVIDIA Cosmos (2025+) - world-model approach.
The trajectory. Continued capability advancement. The 2025-2026 systems are substantially better than 2023 baselines. The destination (reliable general-purpose robotic VLA) is approached but not reached.
Where this leaves us in 2026
The current state. Robotics in 2026 is a substantial and growing field with several substrate categories:
Industrial robotics. Mature; substantial commercial deployment; gradual capability expansion.
Mobile robotics. Mature for structured environments; advancing for unstructured.
Quadruped locomotion. Substantially solved for many terrains; deployed for industrial inspection.
Humanoid robotics. Commercial pilot stage; substantial capability gaps; trajectory uncertain.
Manipulation. Substantial advance via VLA; long-horizon dexterous manipulation remains hard.
VLA models. Active research and early deployment.
Sim-to-real. Substantial advance but uneven; some tasks transfer well, others poorly.
The remaining sections develop the technical content. §3 covers the robotic substrate (hardware). §4 covers classical methods. §5 covers perception. §6 covers imitation learning. §7 covers RL for robotics. §8 covers VLA. §9 covers world models. §10 covers locomotion. §11 covers manipulation. §12 covers industrial and service. §13-§16 close out.
Editorial note. Robotics - especially humanoid robotics - is one of the most-hyped AI categories in 2024-2026. Capabilities are real and improving; specific claims should be evaluated against demonstrated performance. Honest characterization (this chapter’s approach) distinguishes demonstration from production reality.
§3. The Robotics Substrate
The physical and engineering foundations on which robotic systems are built. AI software cannot run without hardware. This section develops the substrate - hardware platforms, sensors, actuators, the hardware-software co-design problem, and the cost trajectory that shapes what robotics is economically feasible.
Hardware platforms
The main categories of modern robotic platforms.
Manipulator arms. Fixed-base robotic arms with 6-7 degrees of freedom (DoF). The most-mature category; standard in industrial settings. Examples: Universal Robots UR series (collaborative, lightweight); ABB IRB, KUKA iiwa (industrial workhorses); Franka Emika Panda (research-grade collaborative); xArm (low-cost collaborative).
The standard configuration. Mounted on a fixed base; end-effector (gripper, suction, specialized tool) at the end of the arm. Reach typically 0.5-1.5m; payload 0.5-30kg depending on model.
Mobile manipulators. Manipulator arm on a mobile base. Examples: Fetch (Fetch Robotics; acquired by Zebra), Hello Robot Stretch, PR2 (research-historical, Willow Garage). Commercial maturity is moderate; substantial research deployment.
Mobile robots (wheeled). Mobile bases without manipulators. Used for navigation, transport, exploration. Examples: TurtleBot (research), Amazon Robotics Kiva-derived (warehouse), Starship Technologies (delivery).
Quadrupeds. Four-legged robots. Substantial maturity; substantial commercial deployment. Examples: Boston Dynamics Spot (~3-10K; consumer/research); ANYmal (ANYbotics; industrial inspection).
Bipeds. Two-legged (humanoid-like but possibly without arms). Less mature than quadrupeds. Examples: Cassie (Agility Robotics, research); Digit (Agility Robotics, commercial deployment in logistics).
Humanoids. Full bipedal humanoid robots. The 2022-2026 commercial-frontier category. Examples: Boston Dynamics Atlas (research/demo; electric version 2024); Tesla Optimus (in development, multiple versions); Figure 01/02 (Figure AI); 1X NEO; Sanctuary Phoenix; Agility Digit; Apptronik Apollo; Unitree G1/H1; Sanctuary Phoenix; many others.
The humanoid category. Substantial commercial activity in 2024-2026; capability advancing but production-grade reliability still limited.
Drones / aerial robots. Quadrotors and other flying platforms. Substantial commercial deployment (DJI dominates consumer; many specialized commercial platforms).
Underwater robots. ROVs (Remotely Operated Vehicles) and AUVs (Autonomous Underwater Vehicles). Specialized industrial and scientific applications.
Soft robots. Robots with substantial compliance (often pneumatic or made from elastomers). Research-stage but growing.
Specialized industrial. Welding robots, painting robots, pick-and-place machines, parallel kinematic robots (delta robots). Mature; widely deployed.
The 2026 commercial landscape. Industrial arms and quadrupeds are mature commercial categories. Humanoids are early-pilot-stage. Mobile manipulators are research-and-early-commercial. Drones are mature for many applications.
Sensors
What robots use to perceive the world.
Cameras (RGB). Standard digital cameras producing 2D images. Cheap; ubiquitous. The foundation of modern robotic perception.
Depth cameras. Produce per-pixel depth measurements. Several technologies:
Structured light (Kinect-style; Intel RealSense earlier models): project infrared pattern; observe distortions. Indoor; limited range.
Time-of-flight (ToF): measure phase delay of returning IR pulses. Indoor-to-moderate outdoor; medium range.
Stereo: two cameras; compute disparity. Works outdoors; quality depends on texture and lighting.
RGB-D combined: integrate RGB with depth in single device. Intel RealSense, Orbbec, ZED, others.
LiDAR. Light Detection And Ranging - scanning lasers measuring distances. Substantially more accurate and longer-range than depth cameras. Used in autonomous driving (Velodyne, Luminar, Waymo’s in-house, others), industrial mapping (Faro, Leica), warehouse robotics.
Cost trajectory. LiDAR cost dropped from 1K-10K range by 2024-2026 (solid-state designs from multiple vendors); enables broader deployment.
IMU (Inertial Measurement Unit). Accelerometers + gyroscopes; sometimes magnetometers. Measure motion and orientation. Standard on essentially all mobile robots; essential for state estimation.
Tactile sensors. Detect contact and force at robot’s contact points. Several technologies:
Capacitive tactile sensors (e.g., taXel, GelSight successors).
Vision-based tactile (GelSight, DIGIT): camera observes deformation of elastomer skin.
Strain-gauge-based force sensors on joints.
Tactile sensing has substantially advanced in 2023-2026. Still less developed than vision; substantial open research.
Force-torque sensors. Measure force and torque at specific joints (typically the wrist). Used for compliant control, assembly tasks.
Microphones / audio. Audio sensors enable voice interaction, sound-based environment awareness. Standard on humanoid robots, less common on industrial arms.
Encoders / joint position sensors. Measure joint angles. Essential for control; standard on all controlled-joint robots.
Sensor fusion. Combining inputs from multiple sensors to produce coherent state estimates. Kalman-filter-style approaches; increasingly deep-learning-based.
Actuators
How robots produce motion.
Electric motors. The dominant actuator category. Several types:
Brushed DC motors. Simple, cheap, suitable for many applications.
Brushless DC motors. Higher efficiency, more compact, more common in robotics.
Servomotors. Integrate motor + control + feedback. Standard in hobby/research robotics (RC servos) and industrial robotics (industrial servos).
Direct-drive motors. No gearbox; smooth motion; expensive. Used in high-performance applications.
Electric motors plus gearboxes (harmonic drives, cycloidal gears, planetary gears) produce the high-torque low-speed output needed for robot joints.
Quasi-direct-drive (QDD) actuators. Brushless motors with very-low-reduction gearboxes. Substantially better backdrivability and compliance than traditional geared motors. Used in modern legged robots (MIT Mini Cheetah and successors; ANYmal; modern humanoids).
Hydraulics. Powerful, high-force-density. Used in heavy industrial robotics and some legged robots (Boston Dynamics Atlas historically; some military applications). Heavy and complex; modern trend is toward electric.
Pneumatic actuators. Compressed air. Used in some industrial settings; soft-robotics applications (Soft Robotics Inc. grippers).
Series elastic actuators (SEA). Spring in series with motor; enables force control. Used in some legged robots; some humanoid designs.
Linear actuators. Produce straight-line motion rather than rotation. Used in some specialized applications.
Computational hardware
Onboard computation matters for autonomous robots.
General-purpose CPUs. x86 (Intel/AMD) or ARM (Apple, Qualcomm, others). Standard for most robot software.
GPUs. For deep-learning inference. NVIDIA Jetson series (embedded GPUs designed for robotics); discrete GPUs in robots with substantial compute.
Specialized AI accelerators. TPUs, Coral, NPU integrations. Increasing use in robots requiring low-latency deep-learning inference.
Real-time controllers. FPGAs or specialized real-time processors for low-level control loops requiring deterministic timing.
The trade-off. Onboard compute = power consumption + thermal management + cost. Cloud/edge compute = latency + connectivity requirements. Modern robots typically combine onboard compute for low-level control with cloud/edge for higher-level reasoning.
The hardware-software co-design problem
A specific consideration. Modern robotic systems require co-designing hardware and software.
The traditional approach. Hardware design first (mechanical engineers design the robot); software added later (the AI/robotics team programs it). Suboptimal because hardware choices constrain software possibilities.
The modern integrated approach. Hardware and software designed together:
Choose actuators to match planned control strategies.
Choose sensors to match planned perception algorithms.
Choose compute to match planned ML workloads.
Design mechanical structure to support planned behaviour.
The companies doing this well (Boston Dynamics, Figure AI, π0’s hardware partners, Physical Intelligence) have integrated hardware-software teams. The companies doing it poorly produce robots with capability gaps that better integration would have avoided.
Costs and hardware-availability trajectory
A practical consideration. Robotics costs substantially shape what’s feasible.
Industrial arm costs. 200K depending on payload, reach, capabilities. Mature category; modest price declines.
Quadruped costs. Boston Dynamics Spot at ~3-10K (research/consumer). Substantial price differential; capability differential is real but smaller than price suggests.
Humanoid costs. Tesla Optimus targeted at 30K eventually; Figure 02 expected ~90K; Apptronik Apollo similar. Substantial uncertainty; commercial pricing is at early stage.
Sensor costs. LiDAR has substantially declined (1-10K). Depth cameras at 50-500. Substantial cost declines enable broader deployment.
Compute costs. Onboard GPUs (Jetson Orin AGX at $2K range) enable substantial onboard deep-learning. Cloud inference adds operational cost.
The 2026 trajectory. Substantial cost declines across the substrate. Industrial robotics cost declines moderate; consumer robotics cost declines substantial. Humanoid robotics is at early commercial stage with significant cost uncertainty.
Where the substrate sits in 2026
The summary. Robotic substrate is mature and improving. Industrial categories are mature. Quadrupeds are substantially commercial. Humanoids are early-commercial with cost uncertainty. Sensor and compute costs continue declining. Hardware-software co-design is increasingly recognized as essential for capability.
The remaining issues. Humanoid hardware reliability gaps. Tactile sensing immaturity. Cost barriers for many applications. Hardware ecosystem fragmentation (different software stacks for different platforms).
The next section covers the classical methods that remain essential even in the deep-learning era.
§4. Classical Methods (Foundations)
The pre-deep-learning robotics toolkit. Even in 2026, modern robotic systems substantially depend on classical methods - kinematics, dynamics, motion planning, classical control, SLAM. This section summarizes the classical foundations; the AIMA-era treatment (Siciliano-Khatib Handbook; Thrun-Burgard-Fox Probabilistic Robotics) provides the depth.
Kinematics
The mathematics of robotic motion without considering forces.
Forward kinematics. Given joint angles, compute the position and orientation of the end-effector. Standard formulation: Denavit-Hartenberg parameters; product-of-exponentials. Closed-form for most standard arms.
Inverse kinematics. Given desired end-effector pose, compute joint angles. Substantially harder than forward - generally no closed-form solution; numerical methods or analytical solutions for specific manipulator designs.
The methodology is well-established. Standard textbook content (Spong et al. Robot Modeling and Control; Craig Introduction to Robotics; many others).
The modern integration. Kinematics remains foundational. Even VLA models use kinematics for low-level joint-space-to-task-space conversions. The methodology is not replaced by deep learning; it is complemented.
Dynamics
The mathematics of robotic motion with forces and torques.
Forward dynamics. Given torques, compute resulting motion (joint accelerations).
Inverse dynamics. Given desired motion, compute required torques.
The standard formulations. Lagrangian dynamics; Newton-Euler dynamics. Recursive computation methods (Newton-Euler recursive algorithm) for efficient computation.
The modern integration. Dynamics is essential for many applications:
Model-predictive control.
High-performance manipulation (e.g., dexterous tasks requiring force-aware planning).
Whole-body control for humanoids and quadrupeds.
Learning-based control with model-based components.
Learned dynamics models (training neural networks to predict dynamics) complement analytical dynamics; rarely replace it entirely.
Classical motion planning
Planning trajectories in configuration space.
Sampling-based methods. RRT (Rapidly-exploring Random Trees; LaValle 1998); RRT* (asymptotically optimal RRT; Karaman and Frazzoli 2011); PRM (Probabilistic Roadmap; Kavraki et al. 1996). Standard for high-DoF motion planning.
The pattern. Randomly sample points in configuration space; connect to existing nodes; build a tree or roadmap; query for paths.
Optimization-based methods. CHOMP (Covariant Hamiltonian Optimization for Motion Planning); STOMP; TrajOpt. Cast trajectory generation as an optimization problem; solve with gradient-based or stochastic methods.
Grid-based methods. A* and variants on discretized configuration space. Standard for 2D mobile-robot navigation; less common for high-DoF arms.
Trajectory optimization with constraints. Solve constrained nonlinear programs to generate trajectories satisfying physical limits, obstacle avoidance, smoothness criteria.
The modern integration. Classical motion planning is standard production technology. Industrial robotic arms run sampling-based planning hundreds of times per shift. Modern integration adds learned components (heuristics; collision predictors) but the substrate is classical.
Classical control
Producing motor commands to track planned trajectories.
PID control (Proportional-Integral-Derivative). The workhorse. Simple, well-understood, widely effective for low-DoF systems. Standard for joint-level control on most robots.
Computed torque control / inverse dynamics control. Use dynamics model to feedforward required torques; PID corrects for model errors. Substantially better tracking than PID alone for high-DoF systems.
Model predictive control (MPC). At each step, solve a constrained optimization problem to plan a short horizon; execute the first control input; replan. Substantially powerful but computationally expensive. Standard for autonomous vehicles, modern legged robots.
Optimal control. Solve infinite-horizon or finite-horizon optimal control problems. Linear-quadratic regulator (LQR) for linear systems; nonlinear extensions (iLQR, DDP) for nonlinear.
Compliance control / impedance control. Control the robot’s stiffness and damping rather than just position. Important for contact-rich tasks (assembly, manipulation against surfaces).
The modern integration. Classical control remains dominant for low-level control. Joint-level PID on most industrial arms; MPC on most autonomous vehicles; compliance control on collaborative arms. Modern learned-control work focuses on higher-level decisions; classical control handles the low-level loops.
SLAM and localization
A foundational mobile-robotics capability.
SLAM. Build a map of an unknown environment while simultaneously localizing within it. Standard formulations:
Filter-based. EKF SLAM, particle-filter SLAM. Process measurements sequentially.
Optimization-based. Pose-graph optimization; bundle adjustment. Solve for trajectory and map jointly.
Localization (in known map). Given a map, estimate the robot’s position from sensor measurements. Particle filter (AMCL - Adaptive Monte Carlo Localization) is standard.
Modern visual SLAM. Combines vision with traditional SLAM. ORB-SLAM, DSO, modern multi-sensor SLAM (LIO-SAM combining lidar, IMU, odometry).
Deep-learning SLAM. Learned features for SLAM (DROID-SLAM, others); NeRF-based mapping (representing the environment as a learned neural field). Substantial 2022-2026 advances.
The modern integration. Classical SLAM is production technology for many applications (warehouse robots, autonomous driving). Deep-learning SLAM is complementary; substantially improves visual feature extraction but doesn’t replace the fundamental SLAM framework.
Probabilistic robotics
The broader methodology underlying classical robotics.
Bayesian filtering. Maintain probability distributions over robot state; update with sensor observations and action models. Kalman filters, extended Kalman filters, unscented Kalman filters, particle filters.
Sensor models. Probabilistic models of how sensors produce measurements given true state. Essential for filtering.
Action models. Probabilistic models of how actions affect state. Used in both filtering and planning.
The methodology, as articulated in Thrun, Burgard, Fox (2005) “Probabilistic Robotics,” remains the conceptual framework for most mobile robotics. Modern deep-learning approaches build on this framework; they do not replace it.
Why classical methods persist
A specific question. Given the substantial advances in deep learning, why do classical methods remain essential?
Several reasons.
Safety guarantees. Classical control theory provides provable stability and safety guarantees for known systems. Learned controllers generally do not.
Sample efficiency. Classical methods can be designed with limited data. Learned methods require substantial data (especially for the long tail of edge cases).
Interpretability. A PID controller’s behaviour is understandable; a learned policy’s behaviour is less so.
Robustness. Classical methods often degrade gracefully under conditions outside their design assumptions. Learned methods may fail unpredictably.
Composability. Classical methods compose cleanly (e.g., low-level PID inside higher-level MPC). Learned components compose less cleanly.
Mature tooling. Decades of engineering have produced mature classical-control libraries, simulation tools, debugging methodology. Learned-control tooling is less mature.
The 2026 integration. Most production robotic systems use classical methods for low-level control (joint-level PID; collision-avoidance MPC; SLAM-based localization) and learned methods for higher-level perception and planning (deep-learning perception; learned manipulation policies; VLA models). The combination is more capable than either approach alone.
A worked example: a modern manipulation pipeline
To anchor the integration. A modern industrial manipulation system might combine:
MODERN MANIPULATION PIPELINE (illustrative)
1. PERCEPTION (deep learning)
Camera input → deep network → object detection,
pose estimation, grasp prediction.
2. PLANNING (classical + learned)
High-level: which object to manipulate (learned from
task specification or LLM-based reasoning).
Mid-level: where to grasp (learned grasp predictor).
Low-level: trajectory to reach grasp pose (classical
sampling-based motion planning with collision avoidance).
3. CONTROL (classical)
Joint-level control with compliance for contact-rich
tasks (impedance control). Force feedback for fine
manipulation.
4. VERIFICATION (mixed)
Vision-based verification that grasp succeeded;
tactile-feedback verification that object is held.
If failure detected, return to planning step.Each component uses the methodology most appropriate for its task. The classical components ensure safety and reliability at low levels; the learned components provide flexibility and broad capability at higher levels.
Where classical methods sit in 2026
The summary. Classical methods are foundational and persistent. They are not replaced by deep learning; they are complemented. Modern production robotic systems use both throughout.
The remaining issues. The classical-vs-learned boundary is contested (where exactly should the line be?). The integration of classical and learned components requires substantial engineering. Some traditional researchers underweight modern learned methods; some modern researchers underweight classical foundations. The healthy middle (using both appropriately) is the production standard.
The next sections develop the modern learning-based methods that complement the classical substrate: §5 covers perception (where deep learning is now dominant); §6 covers imitation learning; §7 covers RL for robotics; §8 covers VLA.
§5. Perception for Robotics
The robot’s interface to the physical world. Robotic perception has been substantially transformed by deep learning over 2015-2026. This section develops the dominant approaches - 3D perception, object detection and pose estimation, scene understanding, foundation models as backbones, and sensor fusion.
3D perception
The defining requirement. Robots act in 3D space; their perception must produce 3D understanding.
Depth estimation. Producing per-pixel depth from sensors.
Stereo depth. Disparity computation from two cameras. Classical methods (block matching, SGBM); modern deep-learning methods (PSMNet, RAFT-Stereo).
Monocular depth. Estimate depth from a single image. Substantially harder than stereo; modern deep methods (MiDaS, ZoeDepth, Depth Anything) produce surprisingly good results.
Time-of-flight and structured light. Direct depth measurement (cross-reference §3 sensors).
Point clouds. 3D point sets from LiDAR or depth cameras. Standard representation for many tasks.
Point-cloud processing methods.
Classical. ICP (Iterative Closest Point) for alignment; voxel-grid filtering; RANSAC for plane fitting.
Deep learning on point clouds. PointNet (Qi et al., 2017) and successors (PointNet++, PointCNN, transformers for point clouds). Process point sets directly without voxelization.
Voxel-based methods. VoxelNet, SECOND. Voxelize point clouds for 3D convolution.
Sparse 3D convolutions. MinkowskiNet and similar. Efficient processing of sparse 3D data.
NeRF and 3D scene representations. Neural Radiance Fields (Mildenhall et al., 2020) and successors enable photo-realistic 3D scene reconstruction. 3D Gaussian Splatting (Kerbl et al., 2023) substantially faster than NeRF.
The robotic applications.
Scene understanding from multiple views. Combine multiple camera viewpoints into coherent 3D representation.
Object manipulation in unknown environments. Build 3D scene model from a few observations.
Place recognition and re-localization. Match current observation to previously-built 3D scene.
The 2026 state. 3D perception is substantially mature. Production systems combine multiple sensor modalities; LiDAR-based and vision-based approaches both work; the choice depends on the application.
Object detection and pose estimation
A specific capability. Object detection identifies object instances and their bounding boxes (2D) or 3D poses.
2D object detection. YOLO (multiple versions through 2026), DETR, modern transformer-based detectors. Real-time; widely deployed. Tens of milliseconds per frame on modern hardware.
3D object detection. Produces 3D bounding boxes - orientation and size in 3D space. Used in autonomous driving (PointPillars, PV-RCNN, CenterPoint), manipulation (find graspable objects in 3D).
Pose estimation. Estimate the 6-DoF pose (3D position + 3D orientation) of an object. Substantially harder than detection alone. Modern methods (FoundationPose, MegaPose) leverage foundation models for substantial generalization.
Grasp pose prediction. Specialized: identify where on an object the robot should grasp. From hand-engineered (force-closure analysis) to learned (Dex-Net, GraspNet, modern transformer-based grasp predictors).
The 2026 state. Object detection in 2D is mature. 3D pose estimation is substantially mature for common objects and uneven for rare or novel objects. Grasp prediction is substantially advanced but not universally reliable.
Scene understanding
A higher-level capability. Scene understanding produces semantic representations of robotic environments - not just “there is an object here” but “there is a coffee mug on the table next to the laptop; the room is a home office.”
Semantic segmentation. Pixel-level classification of image content. Per-pixel labels (furniture, person, floor, wall). Standard deep-learning methods (DeepLab, Segment Anything).
Instance segmentation. Per-instance object masks (not just “person” but “person 1” and “person 2”). Mask R-CNN, modern segmentation transformers.
Panoptic segmentation. Combines semantic and instance segmentation; produces both per-pixel semantic labels and per-instance masks.
Open-vocabulary segmentation. Use VLMs (cross-reference Multimodal Models §4) to segment based on natural-language descriptions. SAM (Segment Anything Model; Meta, 2023) and successors substantially advanced this.
3D scene graphs. Represent the scene as a graph: nodes are objects/regions; edges encode spatial and semantic relationships. Used for high-level robotic reasoning.
The 2026 deployment. Scene understanding is standard in many production systems. Warehouse robots use semantic segmentation to distinguish floor from shelf from product; autonomous vehicles use comprehensive scene understanding; humanoid robots use scene graphs for reasoning about their environments.
Foundation models as perception backbones
A specific 2022-2026 development. Pretrained vision and vision-language foundation models (cross-reference Multimodal Models §4) have become standard perception substrates for robotic systems.
The pattern.
Backbone. Pretrained vision model (CLIP, DINOv2, SAM, or modern VLM).
Adaptation. Lightweight adapter on top for the specific robotic task.
Training. Adapter trained on robotic data; backbone frozen or lightly fine-tuned.
The benefits.
Substantial pretraining transfer. The backbone has learned from web-scale data; robotic tasks benefit from this prior.
Reduced data needs. Less robotic-specific data required to achieve good performance.
Cross-task generalization. The same backbone serves many downstream robotic tasks.
Notable foundation-model-based robotic perception systems.
DINOv2 + adapters. Common backbone for robotic feature extraction.
SAM-based segmentation in robotic perception. Used for open-vocabulary segmentation.
CLIP/SigLIP-based scene understanding. Used for language-conditioned scene interpretation.
VLM-based perception. Modern VLA models (RT-2, π0) use full vision-language models for perception within unified architectures.
The 2026 state. Foundation models are standard substrate for new robotic perception systems. Specialized per-task perception models are increasingly outperformed by foundation-model-based approaches.
Tactile perception
A specific underdeveloped area. Tactile perception uses contact sensors to perceive interactions with the environment.
The challenges. Tactile sensing has been technically harder than vision; sensors are less mature; data is less abundant.
Recent advances.
GelSight, DIGIT. Vision-based tactile sensors (camera observes deformation of elastomer skin). Produces dense tactile images.
Touch-and-vision combined models. ViTac and successors combine visual and tactile features.
Tactile-language models. Recent work extending VLM patterns to tactile.
The 2026 state. Tactile perception is less mature than visual perception. Substantial progress but still substantially limited compared to humans for contact-rich tasks. Cross-reference OP-RO-3 (long-horizon manipulation reliability).
Sensor fusion
Combining multiple sensor inputs.
The classical approach. Kalman filtering and extended-Kalman-filtering for state estimation from multiple sensor sources. Standard in mobile robotics; substantial existing software (e.g., ROS robot_localization).
The modern approach. Deep-learning-based sensor fusion. Networks consume multiple modalities (RGB + depth + LiDAR + IMU) and produce unified representations.
Notable applications.
Autonomous driving. Camera + LiDAR + radar fusion is standard; modern approaches use deep multi-modal networks (e.g., BEVFormer, MV-Net).
Mobile robot navigation. Vision + depth + IMU + wheel odometry. Modern visual-inertial odometry (VINS-Mono, VINS-Fusion) combines vision and IMU effectively.
Manipulation. Vision + tactile + force-torque for contact-rich tasks.
The 2026 state. Sensor fusion is standard in serious robotic systems. The deep-learning-vs-classical balance varies; both approaches are commonly used.
Where perception sits in 2026
The summary. Robotic perception has been substantially transformed by deep learning and foundation models. 3D perception, object detection, pose estimation, scene understanding are production-grade for many applications. Tactile perception lags. Sensor fusion combines classical and learned methods.
The remaining issues. Tactile perception immaturity. Generalization to novel objects and environments is uneven. Robustness under unusual conditions (poor lighting, transparent or reflective objects, edge cases) remains a challenge.
The next section develops imitation learning - the dominant learning paradigm in modern manipulation.
§6. Imitation Learning
The dominant learning paradigm for modern manipulation. Imitation learning trains a policy from demonstrations of the desired behaviour. This section develops behaviour cloning, demonstration sources, modern methods (diffusion policies, Action Chunking Transformer), and scaling.
Behaviour cloning: the basic approach
The simplest imitation-learning recipe. Treat the problem as supervised learning: given a state (observation), predict the action a demonstrator would take.
BEHAVIOUR CLONING (basic recipe)
Given:
Dataset D = {(state_i, action_i)} from demonstrations.
Train:
A policy network π(action | state):
Minimize cross-entropy (for discrete actions) or MSE
(for continuous actions) between predicted and
demonstrated actions.
Deploy:
At each timestep, observe state; produce action via π.The strengths. Simple to implement. Reuses standard supervised-learning machinery. Works well when demonstrations are abundant and diverse.
The classical limitation. Distribution shift. The policy only sees states from the demonstration distribution; at deployment, even small errors compound, eventually putting the robot in states it has never seen. The policy then produces poor actions, compounding the error further.
The mitigations.
DAgger (Dataset Aggregation; Ross et al., 2011). Iteratively expand the training distribution by querying the demonstrator on states encountered during deployment. Requires expert availability during training.
Action chunking. Predict multiple actions at once rather than one-at-a-time; reduces compounding-error frequency. The basis of modern ACT methods (below).
Data augmentation. Augment demonstration data with synthetic perturbations to broaden the state distribution.
Larger and more diverse demonstrations. Scaling solves many problems; abundant demonstrations across diverse situations reduce distribution-shift severity.
Demonstration sources
How to get demonstration data.
Teleoperation. A human controls the robot via a remote interface (VR controllers, motion capture, joystick). The robot performs the task; data is recorded.
The most-common source for modern imitation learning. Used by Figure AI, 1X, Physical Intelligence, OpenAI’s robot teams, others.
The challenges. Slow (real-time human demonstration takes time); expensive (skilled human operators); ergonomically demanding (long teleop sessions are tiring).
Kinesthetic teaching. Physically guide the robot through the desired motion; the robot records its own joint angles. Used in some industrial collaborative-robot deployments.
The limitations. Requires direct physical access; limited to simple motions; doesn’t capture force/torque profile naturally.
Video demonstrations. Learn from observational video of humans performing tasks; no robot trajectories required.
The advantages. Massive data available (YouTube, internet video); humans can demonstrate without specialized hardware.
The challenges. Embodiment gap. Human anatomy differs from robot; direct trajectory transfer is impossible. Action inference. Need to infer actions from observations; ambiguous.
Recent advances. DexYCB, EgoExo4D, Ego4D datasets enable substantial video-based learning. Vision-language models help bridge the embodiment gap by understanding what the human is doing at semantic level; robot policies translate this to robot-specific actions.
Cross-embodiment data. Use data from other robots (Open X-Embodiment, cross-reference §2). The robot benefits from data collected by different robots performing similar tasks.
Simulation demonstrations. Generate demonstrations in simulation; transfer to real (sim-to-real, cross-reference §7).
LLM-generated demonstrations. Use an LLM to plan actions; execute in simulation or with a teleoperated robot; record. Recent direction; emerging.
The 2026 production picture. Most production VLA training uses teleoperated data. Cross-embodiment data and simulation provide additional scale. Video-based learning is increasingly important but not yet dominant.
Diffusion policies
A specific 2023 advance. Diffusion Policy (Chi, Florence, Zhao, Trzcinski et al., 2023) “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion.”
The mechanism. Apply diffusion models (cross-reference Generative Models §6) to action prediction. Given observation, generate actions via iterative denoising from random noise.
The recipe.
DIFFUSION POLICY (Chi et al., 2023, sketch)
Training:
For each (observation, action_sequence) in demonstrations:
Add noise to action_sequence at random noise level t.
Train network to predict the noise given (noisy_action_sequence,
observation, t).
Loss: MSE on noise prediction.
Inference:
Start with random noise as initial action sequence.
Iteratively denoise (10-100 steps) conditioned on current
observation.
Output: predicted action sequence.
Execute first K actions; observe new state; replan.The advantages.
Multimodal action distributions. Diffusion can represent multiple valid actions for a given state (different valid trajectories); previous methods often produced averaged-out single-mode predictions.
Smooth action sequences. Predicts action sequences rather than single actions; smoother and more coherent.
Substantial empirical advance. Substantially outperformed previous imitation-learning methods on standard manipulation benchmarks.
The trajectory. Diffusion Policy substantially shifted the field. By 2024-2026, diffusion-based action prediction is standard in manipulation imitation learning.
Action Chunking Transformer (ACT)
Another specific 2023 advance. Action Chunking with Transformers (ACT) (Zhao, Kumar, Levine, Finn et al., Stanford ALOHA project, 2023). “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware.”
The mechanism. Predict chunks of consecutive actions rather than one action at a time. Use a Transformer with a conditional variational autoencoder.
The setup.
Encoder. Takes observation history and action chunk; produces latent embedding.
Decoder. Takes observation history and latent; predicts action chunk.
Training. Reconstruct demonstrated action chunks via the encoder-decoder.
Inference. Sample latent from prior; decode to action chunk; execute; replan.
The advantages.
Reduces compounding error. Predicting K actions ahead reduces decision frequency.
Captures action correlations. Adjacent actions are correlated; chunking captures this naturally.
Bimanual manipulation. ACT was specifically validated on bimanual manipulation tasks; established the feasibility.
The ALOHA system. The ACT paper was demonstrated on ALOHA - a low-cost bimanual teleoperation system. ALOHA-style hardware became standard research infrastructure for bimanual manipulation research.
Scaling imitation learning
The scaling trajectory. Modern imitation learning has scaled substantially through 2022-2026.
RT-1 (Brohan et al., Google DeepMind, 2022-2023). ~130K real-robot trajectories across 13 months of data collection. Substantial scale for the time.
Open X-Embodiment (Collaboration, 2023-2024). ~1M+ trajectories across 22+ robot platforms. Cross-embodiment imitation learning at substantial scale.
π0 (Physical Intelligence, October 2024). Even larger scale; substantial commercial investment in teleoperation infrastructure.
The scaling laws. Like other ML modalities, robotic imitation learning shows scaling-law behaviour (more data → better performance). The slopes are less favourable than for text/image learning (robotic data is more expensive per sample), but the trajectory is positive.
The data-efficiency techniques. Multiple approaches reduce data needs.
Cross-embodiment transfer. Use data from other robots.
Simulation augmentation. Generate additional data in simulation.
Foundation-model substrate. Web pretraining transfers to robotic capability.
Video-based learning. Learn from human-demonstration videos.
LLM-based augmentation. Use LLMs to generate task variations and demonstrations.
Imitation + reinforcement
A specific hybrid. Many modern systems combine imitation learning (for initialization) with reinforcement learning (for refinement).
The pattern.
Pretrain a policy via imitation on demonstrations.
Deploy the policy in simulation or real-world rollouts.
Use RL (cross-reference RL §7) to refine the policy with reward signal.
The advantages. Imitation provides good starting policies; RL refines to overcome demonstration limitations.
Notable examples. Many modern systems (from research demonstrations through production deployment) use imitation + RL.
Failure modes of imitation learning
A practical inventory of imitation-learning failures.
Distribution shift. Policy encounters states unlike training; produces poor actions.
Demonstrator suboptimality. Demonstrations may not represent optimal behaviour; policy inherits demonstrator’s mistakes.
Limited generalization. Policy may not transfer well to novel tasks or environments.
Compounding error. Even small per-step errors accumulate over long trajectories.
Causal confusion. Policy may learn correlations rather than causal relationships (e.g., learn to look at the gripper because demonstrations always end with gripper view; brittle if visual context differs).
Hardware-specific overfitting. Policy may overfit to specific robot hardware; doesn’t transfer to similar but different platforms.
Where imitation learning sits in 2026
The summary. Imitation learning is the dominant learning paradigm for modern manipulation. Diffusion policies and ACT are the dominant modern methods. Scale is substantial and growing. Combined with RL refinement, imitation provides substantially capable policies.
The remaining issues. Data collection cost remains substantial. Generalization beyond training distribution is uneven. Long-horizon manipulation reliability is incomplete.
The next section develops reinforcement learning for robotics - the complementary paradigm.
§7. Reinforcement Learning for Robotics
The complementary learning paradigm to imitation. Reinforcement learning trains policies from reward signals - the robot tries actions, observes rewards, updates the policy. This section develops the robotics-specific challenges, sim-to-real, domain randomization, offline RL for robotics, and sample efficiency.
Why RL for robotics
The motivating cases. RL is particularly suitable when:
Demonstrations are unavailable or insufficient. Some tasks have no good human demonstrators (e.g., extremely fast motions; specific industrial fine-tuning).
Optimization beyond imitation matters. The demonstrator may not be optimal; RL can refine.
Reward signals are easy to specify. Goal-reaching tasks, success-binary tasks, performance metrics.
Long-horizon credit assignment matters. RL with appropriate algorithms handles the credit-assignment problem (RL §4-§5).
Notable RL-for-robotics applications.
Legged locomotion. Sim-to-real RL is substantially dominant for modern quadruped and humanoid locomotion. Cross-reference §10.
Dexterous manipulation. OpenAI’s 2019 Rubik’s-Cube hand demonstration. Subsequent work.
Industrial tuning. Fine-tuning industrial controllers with RL.
Game-playing robots. Robotic table tennis, robotic chess.
Drone control. Autonomous flight; aerobatic maneuvering.
The sim-to-real gap
The dominant challenge for RL-for-robotics. Sample efficiency in real-robot data collection is severely limited (cross-reference §6 demonstrations and OP-RO-1). Most RL training happens in simulation; deployment is on real robots; the gap between simulated and real dynamics causes failures.
The sources of the gap.
Physics modeling errors. Simulators approximate physics; real physics has phenomena (contact dynamics, friction nonlinearities, deformable objects) that simulators model imperfectly.
Sensor modeling errors. Simulators may not perfectly model real sensor noise, latency, distortion.
Actuator modeling errors. Real actuators have characteristics (backlash, friction, saturation) hard to model exactly.
Environment modeling errors. The simulator’s world is built; the real world includes elements the simulator omits (dust, glare, novel objects).
System-identification errors. Even with good simulation, real robots may have slightly different parameters (mass distribution, joint friction) than assumed.
The impact. A policy trained in simulation may substantially fail on the real robot - wrong joint torques, incorrect timing, missed contact events.
Sim-to-real techniques
The methodology that has substantially advanced sim-to-real.
Domain randomization. Tobin, Fong, Ray, Schneider, Zaremba, Abbeel (2017) “Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World.”
The recipe. During simulation training, randomize the simulator parameters at the start of each episode. The policy learns to be robust to a distribution of physics rather than overfit to a specific simulator.
The variations.
Visual domain randomization. Randomize textures, lighting, camera positions; trained policy is robust to visual variations.
Dynamics domain randomization. Randomize physics parameters (masses, friction, motor strengths); trained policy is robust to physics variations.
Combined randomization. Both; standard for modern sim-to-real.
The empirical impact. Domain randomization substantially advances sim-to-real performance. The 2017 paper demonstrated transfer of object-localization policies from simulation to real robots; subsequent work scaled to substantial manipulation and locomotion tasks.
Domain adaptation. Use techniques that explicitly adapt the policy from simulation to real. Examples: fine-tuning with small amounts of real data; learning a domain-invariant feature representation; adversarial domain adaptation.
System identification. Use real-robot data to update the simulator’s parameters (mass distribution, joint friction, actuator characteristics). Closes the gap by improving the simulation rather than the policy.
Real-world fine-tuning. Use the simulation-trained policy as initialization; fine-tune in real-world rollouts. Combines sample-efficient simulation pretraining with real-world adaptation.
Privileged information / teacher-student. Train a “teacher” in simulation with access to privileged information (true state); distill into a “student” using only deployment-available observations (sensor measurements). The student transfers to real-world more reliably.
The ANYmal locomotion success (cross-reference §2). Lee et al. (2020) used careful domain randomization + teacher-student training to achieve substantial sim-to-real transfer for quadruped locomotion. The recipe became standard for legged-robot RL.
The simulator landscape
The infrastructure for sim-to-real RL.
MuJoCo (formerly Roboti, acquired by DeepMind 2021). Standard robotics-simulation tool. Fast, accurate-enough for many tasks, open-source since 2022.
Isaac Gym / Isaac Sim (NVIDIA). GPU-accelerated robotics simulation. Substantially faster than CPU-based for parallel-instance training; widely used for RL training at scale.
PyBullet (open-source). Lightweight; popular for research.
Gazebo (open-source). Mature; ROS-integrated; widely used for system testing.
Drake (Toyota Research Institute / open-source). Substantial physics modeling capability; multi-body dynamics with contact.
Genesis (2024+). Newer; emphasizes both rendering and simulation quality.
The trade-offs.
Speed. GPU-accelerated (Isaac) substantially faster than CPU (MuJoCo, PyBullet). Matters at scale.
Fidelity. Higher-fidelity simulators (Drake) better match real physics; slower.
Ecosystem. MuJoCo and Gazebo have substantial ecosystem (existing models, controllers, examples).
The 2026 landscape. Multiple simulators in active use. The choice depends on application - high-fidelity for precise modeling; GPU-fast for large-scale RL training; ecosystem-mature for system integration.
Offline RL for robotics
A specific application of offline RL (cross-reference RL §9) to robotic learning. Offline RL uses previously-collected trajectory data without further environment interaction.
The motivation for robotics. Real-robot data collection is expensive; once data is collected, maximizing its value via offline RL is appealing.
The methods (review from RL §9).
CQL (Conservative Q-Learning). Penalize Q-values for OOD actions.
IQL (Implicit Q-Learning). Use expectile regression to avoid OOD-action exploitation.
AWAC (Advantage-Weighted Actor-Critic). Imitation-weighted-by-advantage with smooth transition to online fine-tuning.
The robotics applications.
Learning from teleoperation data. Treat teleoperation trajectories as offline RL data; learn policies that match or exceed the demonstrator.
Learning from logged industrial-robot data. Industrial robots accumulate substantial trajectory data; offline RL can extract value.
Multi-task and multi-embodiment learning. Aggregate data across many sources; train policies offline.
The 2026 state. Offline RL for robotics is active research and emerging practice. Substantial industrial interest in extracting value from accumulated trajectory data. Some production deployment.
Sample efficiency at robot scale
A specific structural concern. Real-robot data is expensive. RL methods that are sample-efficient matter substantially more for robotics than for simulated environments.
The approaches.
Model-based RL. Cross-reference RL §8. Learn a model of the environment from data; plan or train policies in the model. Substantially more sample-efficient than model-free. Dreamer family applied to robotics.
Offline RL. As above; reuse existing data.
Cross-embodiment transfer. Train on data from many robots; transfer to target robot.
Sim-to-real with domain randomization. Train in simulation (essentially infinite data); transfer to real.
Imitation + RL. Initialize from imitation; refine with RL. Substantially more sample-efficient than RL from scratch.
Curriculum learning. Train on easy tasks first; gradually increase difficulty.
Meta-learning. Train policies that quickly adapt to new tasks with few examples.
The 2026 picture. Sample-efficiency techniques combine in production systems. Pure model-free RL with real-robot data is rare; some combination of pretraining + sim-to-real + offline RL + fine-tuning is standard.
RL for legged locomotion: the success story
The dominant successful application of RL in robotics. Quadruped locomotion has been substantially transformed by sim-to-real RL.
The pre-RL approach. Hand-designed controllers (e.g., MIT Cheetah’s controller). Substantial engineering per robot; limited adaptability to novel terrains.
The RL approach. Train a policy in simulation; deploy on real robot. The Lee et al. (2020) ANYmal work demonstrated the recipe; subsequent work (multiple groups) substantially refined.
The 2026 state. RL is the standard for quadruped locomotion. Boston Dynamics Spot, Unitree, ANYbotics, multiple other quadruped vendors all use RL-trained policies for locomotion. The capability is substantially better than what hand-engineered controllers achieved.
The pattern. RL + sim-to-real + domain randomization + (optionally) teacher-student or curriculum. The standard recipe for legged robots.
RL for manipulation: less mature
A contrast. Manipulation has been less transformed by RL than locomotion.
The reasons.
Reward specification is harder. Locomotion reward (walk in a direction; don’t fall) is straightforward. Manipulation reward (perform this specific task successfully) is often hard to specify.
Sim-to-real is harder. Contact-rich manipulation involves fine-grained physics that simulators model imperfectly.
Action spaces are higher-dimensional. Manipulation requires fine motor control; sample complexity grows.
Imitation often works better. For many manipulation tasks, demonstrations are easier to obtain than reward signals.
The 2026 manipulation paradigm. Imitation learning (cross-reference §6) is dominant; RL is complementary. Pure-RL manipulation is rarer than pure-RL locomotion.
Where RL for robotics sits in 2026
The summary. RL is substantial and important but less universally applicable than imitation learning. Sim-to-real RL is the dominant approach for legged locomotion. RL refinement is standard for imitation-trained policies. Pure-RL manipulation is less common than RL-imitation hybrids.
The remaining issues. Sample efficiency at real-robot scale. Sim-to-real gap for contact-rich manipulation. Reward specification for complex tasks.
The next section develops vision-language-action models - the modern multimodal paradigm that combines imitation, RL, and foundation-model substrates.
§8. Vision-Language-Action Models
The dominant new paradigm in 2023-2026 robotics. Vision-Language-Action (VLA) models extend the multimodal paradigm to physical action; cross-reference Multimodal Models §9 for the broader multimodal treatment. This section develops the robotics-specific aspects.
The VLA framing
A specific architectural commitment. VLA models combine:
Vision - process camera images of the robot’s environment.
Language - process natural-language task instructions.
Action - produce motor commands (joint torques, end-effector poses, mouse clicks).
The motivating insight. Robotic capability traditionally required:
Per-task engineering - hand-design each task.
Per-task training - train separate policy per task.
Per-environment adaptation - different environments require different code.
VLA replaces this with a general model that:
Accepts arbitrary visual input - same model handles different scenes.
Accepts arbitrary language input - same model handles different tasks.
Produces appropriate actions - generalized via foundation-model substrate.
The premise. A general multimodal model - pretrained on web-scale vision-language data, fine-tuned on robotic trajectories - substantially outperforms per-task specialized policies.
The architectural patterns
Two dominant architectures for VLA.
Decoder-only Transformer with action tokens. A vision-language model with the action space tokenized into the same vocabulary as text. The model autoregressively predicts text and action tokens.
The recipe (RT-2 style).
Backbone. Pretrained vision-language model (PaLI-X, PaLM-E, or similar).
Action tokenization. Discretize the action space (e.g., 256 bins per dimension); reuse existing text token IDs for action codes.
Training. Fine-tune on robot trajectories with action tokens; possibly co-train with web vision-language data.
Inference. Given image and instruction, autoregressively decode text + action tokens; convert action tokens back to motor commands.
The advantages. Simple architectural extension to existing VLMs. Action prediction inherits the VLM’s broad capabilities.
The disadvantages. Discrete action tokenization loses precision; continuous control may suffer.
Continuous action heads with flow-matching. π0-style.
Backbone. Pretrained vision-language model.
Action head. Flow-matching network (cross-reference Generative Models §7) producing continuous-valued action sequences.
Training. Train action head jointly with vision-language backbone on robot trajectories.
Inference. Given image and instruction, run flow-matching to generate action sequence.
The advantages. Continuous actions preserve precision. Flow-matching produces smooth action sequences.
The disadvantages. More complex architecture; action head adds parameters and training complexity.
The 2026 state. Both patterns are used. RT-2 / OpenVLA / Octo use tokenized actions. π0 uses flow-matching. Capability comparisons depend on task; both can produce substantial results.
Notable VLA systems (in chronological order)
A specific survey, complementing the Multimodal Models §9 treatment.
RT-1 (Robotic Transformer 1; Brohan et al., Google DeepMind, 2022-2023). First scaled Transformer-based robot policy. ~130K real-robot trajectories across 13 months. Substantial cross-task generalization within the trained distribution.
RT-2 (Brohan et al., Google DeepMind, July 2023). VLA with VLM backbone. Substantial generalization beyond training distribution via web-pretraining transfer.
PaLM-E (Driess et al., Google DeepMind, March 2023). Embodied multimodal language model - PaLM with multimodal input including robotic state and visual input.
RT-X (Open X-Embodiment Collaboration, 2023-2024). Train VLA models on the Open X-Embodiment dataset (1M+ trajectories, 22+ platforms). Demonstrated substantial cross-embodiment generalization.
OpenVLA (Kim, Pertsch, Karamcheti, Xiao et al., June 2024). Open-source VLA model. Prismatic-7B vision-language backbone. ~970K trajectories from Open X-Embodiment. Democratized VLA research.
Octo (Octo Team, 2024). Open-source generalist robot policy. Transformer architecture, diverse training data.
π0 (Physical Intelligence, October 2024). Flow-matching action head on vision-language backbone. Substantial dexterous-manipulation capability (folding laundry, packing groceries).
π0.5 (Physical Intelligence, 2025). Successor with improved capability.
Helix (Figure AI, February 2025). VLA for Figure humanoid; dual-system architecture (high-level reasoning + low-level control).
Gemini Robotics (Google DeepMind, 2025). VLA model integrated with Gemini foundation-model line.
RT-X2 and successor RT-family models (Google DeepMind, 2024-2025). Continued advances in the RT lineage.
MolmoAct, NVIDIA’s GR00T family (2024-2026). Various competing VLA systems.
Production deployment of VLA in 2026
A specific assessment. By 2026, VLA models have moved from research demonstrations to early production deployment in specific contexts.
The deployment contexts.
Industrial pilots. Figure AI deployed at BMW (2024+). Apptronik Apollo at Mercedes (2024+). Boston Dynamics Atlas in pilot industrial deployments. Pilots involve specific tasks (parts handling, simple assembly) in semi-structured environments.
Logistics. Agility Robotics Digit at Amazon (2024+). Mobile-manipulator deployment for tote moving.
Warehousing. Multiple deployments of mobile manipulators for picking and packing.
Research deployments. Substantial academic and industrial-research use of OpenVLA, π0, and related models for research.
The deployment caveats. VLA reliability is imperfect; most deployments involve substantial human oversight and constrained task scope. Production deployment at full autonomy on complex tasks is not yet routine.
The honest 2026 picture. VLA is transitioning from research to production. Specific applications (well-defined tasks in known environments) are at production stage. General-purpose deployment (arbitrary tasks in arbitrary environments) is not yet ready.
VLA training pipelines
A specific operational concern. How are VLA models actually trained?
The pattern.
VLA TRAINING PIPELINE (illustrative)
1. VISION-LANGUAGE PRETRAINING
(Reuse existing VLM checkpoints - Gemini, PaLI, Prismatic, etc.)
2. ROBOTIC TRAJECTORY COLLECTION
- Teleoperation (humans control robots; data recorded).
- Existing open data (Open X-Embodiment).
- Cross-embodiment data from previous robots.
- Synthetic data from simulation.
Total: 100K to 1M+ trajectories.
3. VLA TRAINING
Fine-tune vision-language backbone on robot trajectories.
Either:
(a) Tokenize actions into VLM vocabulary; autoregressive training.
(b) Add flow-matching action head; train jointly.
Substantial compute investment (hundreds of GPU-days).
4. DEPLOYMENT EVALUATION
Test on held-out tasks; held-out environments; novel objects.
Iterate based on observed failure modes.
5. CONTINUOUS UPDATING
Collect additional teleoperation data from deployment failures.
Periodically retrain with augmented data.The cost. VLA training at competitive scale requires substantial investment:
Teleoperation infrastructure ($M+ for substantial data collection).
Compute (hundreds of thousands of GPU-hours).
Engineering team (dozens of researchers and engineers).
Multiple robot platforms for cross-embodiment training.
The economic implication. VLA training at frontier scale is the province of well-funded labs (frontier-AI labs; well-funded robotics startups; major tech companies). Open-source efforts (OpenVLA, Octo) provide academic access but typically lag frontier capability.
VLA failure modes
A specific inventory.
Generalization gaps. VLA models trained on specific tasks/environments may fail on novel ones. The generalization is better than per-task policies but not complete.
Embodiment-specific overfitting. Trained on specific robot hardware; transfer to slightly different hardware may degrade.
Long-horizon manipulation failures. Multi-step tasks (folding all items in a laundry basket) accumulate errors. Long-horizon reliability remains hard (OP-RO-3).
Edge-case failures. Unusual scenarios (unusual lighting, rare objects, complex backgrounds) can produce poor actions.
Slow decision-making. VLA inference on frontier models is slow compared to classical controllers; latency can be a constraint for fast-response tasks.
Safety incidents. Production deployment occasionally produces unsafe actions; safety layers must catch these.
Where VLA sits in 2026
The summary. VLA is the dominant new paradigm in 2023-2026 robotics. Substantial commercial activity; substantial research investment; substantial capability advancement. Early production deployment for specific applications.
The remaining issues. Generalization gaps persist. Long-horizon manipulation reliability is incomplete. Production deployment with full autonomy is not yet routine. Cost barriers limit who can train frontier VLA models.
The trajectory. VLA is one of the fastest-improving areas in 2024-2026 robotics. Continued advancement expected; the destination (general-purpose reliable VLA) is approached but not reached.
The next section develops world models - the model-based complement to VLA.
§9. World Models and Model-Based Approaches
A complementary paradigm to imitation and reinforcement learning. World models are learned models of the environment that enable planning in imagination - predict consequences of actions; choose actions based on predicted consequences. This section develops the robotics-specific aspects; cross-reference RL §8 for the broader model-based-RL framework.
Why world models for robotics
The motivating insight. Real-robot data is expensive (cross-reference §6 demonstrations and OP-RO-1). If we can build a learned model that predicts robot dynamics and environment evolution, we can:
Plan ahead by imagining action consequences.
Train policies in the model rather than on the real robot.
Evaluate policies without expensive real-world rollouts.
Transfer across tasks - same world model serves many tasks.
The promise. World models trade additional model-learning effort for substantially reduced policy-learning effort. For sample-efficiency-bounded robotics, the trade-off is often favourable.
The challenges. World models are imperfect; planning in imperfect models can produce bad policies; the model-quality bottleneck limits performance.
Dreamer family in robotic application
The dominant world-model RL family. Dreamer (Hafner et al., 2019, 2020, 2023; cross-reference RL §8) trains a recurrent latent dynamics model jointly with policy and value functions; the policy is trained on imagined rollouts in the learned model.
The robotic applications.
DreamerV3 for robotics. Hafner, Pasukonis, Ba, Lillicrap (2023) “Mastering Diverse Domains through World Models.” DreamerV3 achieved substantial performance across many tasks (Atari, DMC continuous control, Crafter, MineRL); same hyperparameters across all. Substantial robotic-RL applications.
Day-Dreamer (Wu et al., 2022). Apply Dreamer to real robots; demonstrated learning manipulation policies from real-world data.
Multi-task Dreamer variants. Extend Dreamer to learn world models across multiple tasks; enable cross-task transfer.
The strengths. Substantial sample efficiency improvements over model-free RL. Single algorithm across diverse domains.
The limitations. World-model quality bounds policy quality. Long-horizon predictions accumulate error. Real-world manipulation has substantial sim-to-real-style challenges even with learned models.
Genie and learned environment models
A different mode of learned models. Genie (Bruce et al., DeepMind, 2024) “Genie: Generative Interactive Environments.” Train a model that, given an image and an action sequence, generates the resulting image sequence - effectively a video generation model conditioned on actions.
The application to robotics. Genie-style models can serve as learned simulators - generate plausible future states given current state and actions. The “simulator” is a generative model trained on observational data, not a hand-engineered physics simulation.
Genie 2 (DeepMind, 2024). Extended capability; can generate playable environments from single images.
Cosmos (NVIDIA, 2025+). Physical world models specifically targeted at robotics. Generate realistic future video given current state and action commands. Substantial training compute; commercial deployment direction.
The robotic applications.
Trained simulators for sim-to-real. Use learned simulators in place of (or augmenting) hand-engineered simulators.
Synthetic data generation. Generate diverse training data using learned world models.
Predictive control. Use world models for short-horizon prediction in MPC.
The 2026 state. Generative world models for robotics are research-stage to early-production. Substantial investment (NVIDIA Cosmos, DeepMind work, others); commercial deployment trajectory is positive.
Simulation as model
A specific framing. Hand-engineered simulators (MuJoCo, Isaac, Drake, Gazebo, etc.) are themselves a form of world model - a structured, physics-based model of the environment.
The advantages of hand-engineered simulators.
Substantial fidelity for many tasks (rigid-body dynamics; common materials).
Substantial extensibility (add new objects, new robots).
Substantial ecosystem (existing models, tools, debugging).
Well-understood limitations (known where they fail).
The disadvantages.
Sim-to-real gap for some tasks (contact-rich; deformables; complex fluids).
Engineering cost to model new environments.
Doesn’t learn from data (unlike generative models).
The 2026 picture. Hand-engineered simulators are the dominant world model for robotic RL. Learned generative models are complementary for cases where hand-engineering is inadequate (e.g., photorealistic rendering for vision-based policies; novel environment generation).
The hybrid pattern. Use hand-engineered physics for dynamics; use learned models for rendering or specific hard-to-model components (e.g., deformable cloth dynamics).
Predictive control with learned models
A specific application. Model Predictive Control (cross-reference §4) is a classical technique; it requires a model of the dynamics. Use learned dynamics models for MPC.
The pattern.
MPC WITH LEARNED MODEL
At each control step:
1. Observe current state.
2. Use learned dynamics model to predict consequences of
candidate action sequences over short horizon.
3. Choose action sequence optimizing some objective
(reach goal; avoid obstacles; minimize energy).
4. Execute first action.
5. Repeat.The advantages. Combines classical-control reliability with learned-model flexibility. Handles tasks with complex dynamics that hand-engineered models miss.
The 2026 deployment. Learned-model MPC is used in production for some applications (humanoid balance control; some manipulation tasks). Not universal; classical MPC with hand-engineered models still dominates many use cases.
The cross-reference to scientific simulation
A specific connection. World models for robotics share structure with scientific simulators (cross-reference AI for Science §9 on physics simulators). Both predict physical-system evolution; both face accuracy-vs-speed trade-offs; both benefit from learned components.
The differences. Robotic world models focus on control-relevant dynamics (what happens when this robot takes this action?); scientific simulators focus on fidelity-to-physics (what does this physical system actually do?). Both are valuable; the design priorities differ.
Where world models sit in 2026
The summary. World models are one of several approaches in modern robotics. Dreamer-family RL is substantial research with some production application. Generative world models (Genie, Cosmos) are emerging. Hand-engineered simulators remain the dominant world model in production.
The remaining issues. Learned world models’ fidelity is limited; long-horizon prediction accumulates error. Production deployment of learned world models is uneven across applications.
The trajectory. Continued advancement in learned world models; continued use of hand-engineered simulators; increasing hybrid approaches. World models are one tool among several, not the dominant paradigm.
§10. Locomotion
The robotic capability category with substantial recent success. Modern locomotion - especially quadruped - has been transformed by sim-to-real RL. This section develops quadruped, biped, and humanoid locomotion; the deployment landscape; the broader pattern.
Quadruped locomotion
The dominant success story in modern robotics. Quadruped robots (four-legged) have become substantial commercial reality in 2018-2026.
Boston Dynamics Spot. Launched 2019; substantial commercial deployment by 2024-2026 (~$75K each; deployed at hundreds of industrial sites for inspection, monitoring, mapping). Spot uses learned and classical controllers; the locomotion is reliable and impressive.
Unitree quadrupeds (Go1, Go2, B1, B2). Chinese manufacturer producing substantially cheaper quadrupeds (20-50K commercial). Substantial commercial uptake; substantial research adoption.
ANYbotics ANYmal. Industrial inspection robot; substantial deployment in oil and gas, mining, infrastructure inspection. Uses sim-to-real RL extensively.
Other quadrupeds. Multiple Chinese manufacturers (DEEP Robotics, XiaoMi CyberDog, others); various research platforms.
The capability state in 2026. Quadrupeds:
Walk on rough terrain (rubble, stairs, slopes, snow). Reliable.
Navigate complex environments (factories, construction sites, outdoor terrain). Substantially reliable.
Carry payloads (up to 10-20kg depending on model). Reliable.
Climb obstacles (low obstacles; some can climb stairs reliably). Mostly reliable.
Long-duration operation (hours of continuous operation). Reliable.
The commercial deployment. Industrial inspection is the dominant deployment category. Tens of thousands of quadrupeds deployed globally by 2026.
Bipedal locomotion
A harder capability. Two-legged robots are inherently less stable than four-legged; locomotion is harder.
Cassie (Agility Robotics, research-historical). Research bipedal platform; demonstrated substantial capability through RL-based locomotion.
Digit (Agility Robotics, commercial). Humanoid-form bipedal robot with arms. Substantial commercial deployment in logistics (Amazon, GXO).
Atlas (Boston Dynamics). Iconic research humanoid; demonstrated substantial capabilities (parkour, somersaults, manipulation). Hydraulic version retired; electric version (2024) continues the lineage.
The capability state. Bipedal locomotion is substantially less mature than quadruped. Stable walking on smooth surfaces is reliable; rough-terrain walking is uneven; reliability under perturbation is imperfect.
Humanoid locomotion
The 2024-2026 commercial frontier. Full humanoid robots - bipedal with two arms - have become substantial commercial activity.
The notable platforms.
Figure 01 / Figure 02 (Figure AI). Commercial humanoid; substantial demos; pilot deployments at BMW.
1X NEO (1X Technologies). Consumer-targeted humanoid; pilot deployments.
Apptronik Apollo. Commercial humanoid; pilot deployments at Mercedes.
Sanctuary AI Phoenix. Commercial humanoid; substantial demos.
Tesla Optimus. In development; multiple versions; substantial Tesla investment.
Boston Dynamics Atlas (electric). Research/demo; commercial trajectory uncertain.
Unitree G1 / H1. Chinese humanoids at substantially lower prices (90K respectively).
Agility Digit. Commercial deployment (above).
XPeng IRON, Engineered Arts Ameca, Sanctuary AI Phoenix, multiple others. Substantial humanoid ecosystem.
The capability state in 2026. Humanoid locomotion is advancing rapidly but substantially less mature than quadruped. Walking is reliable on flat surfaces for production-grade humanoids. Stair climbing is moderately reliable. Recovery from perturbations is uneven. Long-duration operation is limited (battery life, thermal management).
The commercial trajectory. Pilot deployments at major customers (BMW, Mercedes, Amazon, others) in 2024-2026. Whether this scales to substantial production deployment by 2027-2028 is one of the most-watched commercial questions.
The locomotion learning revolution
The methodological transformation. Lee, Hwangbo, Wellhausen, Koltun, Hutter (2020) “Learning quadrupedal locomotion over challenging terrain” was the catalyst. Sim-to-real RL substantially advanced quadruped locomotion; subsequent work spread the approach.
Notable contributions.
Lee et al. ANYmal. Foundational sim-to-real RL for quadruped locomotion.
Margolis et al. (MIT, 2022+). “Walk These Ways” and successors. Demonstrate gait control via RL; substantial capability advances.
Wang et al. (2022-2024). Multiple contributions to RL-based locomotion at MIT, Berkeley, Stanford, ETH.
Smith et al. (MIT, 2022). Real-world RL for quadruped locomotion (training on the real robot rather than primarily in simulation).
Cheetah series (MIT Cheetah, Mini Cheetah). Hardware and control co-design; substantial influence on the field.
Boston Dynamics’ Atlas RL transition. Atlas’s electric version uses substantially more learned components than the earlier hydraulic version.
Humanoid locomotion advances 2024-2026. Figure, 1X, Apptronik, Tesla, others using RL-based locomotion for humanoid bipedal walking. Substantial capability advances.
The standard recipe in 2026.
STANDARD LOCOMOTION RL RECIPE
1. Build accurate-enough simulator with domain randomization.
2. Train policy with PPO or similar (substantial parallel-environment
training; tens of thousands of simulated robot-hours).
3. Use teacher-student training: privileged "teacher" with full state
access; "student" with only deployment-available observations.
4. Deploy student to real robot.
5. Optionally fine-tune with limited real-world data.
Result: substantially capable locomotion policy that handles
varied terrain, perturbations, and tasks.The 2026 state. This recipe is standard production technology for legged robots. The implementation details vary; the conceptual framework is well-established.
Real-world deployment of learned locomotion
A specific assessment. Learned locomotion is substantially deployed in production in 2026.
Industrial inspection. Boston Dynamics Spot at hundreds of sites for autonomous inspection of factories, refineries, construction sites. ANYbotics ANYmal at similar deployments. Substantial economic value.
Security and patrol. Quadrupeds for facility security; autonomous patrol routes.
Search and rescue (research). Quadrupeds for navigating disaster sites; emerging application.
Logistics. Digit at Amazon and GXO for tote moving.
Manufacturing pilots. Humanoids at BMW, Mercedes, others for parts handling.
Consumer / hobby. Substantial consumer market for Unitree Go2 and similar consumer quadrupeds.
The aggregate. Legged robots - especially quadrupeds - generate substantial revenue and operate in many production environments. The technology has crossed the threshold from research demonstrations to commercial reality.
Where locomotion sits in 2026
The summary. Quadruped locomotion is substantially mature and commercially deployed. Humanoid locomotion is advancing rapidly but less mature; commercial deployment is at pilot stage. The methodological substrate - sim-to-real RL with domain randomization and teacher-student - is well-established.
The remaining issues. Humanoid locomotion reliability needs to advance for broader deployment. Battery life and thermal management constrain long-duration operation. Multi-terrain robustness is uneven.
The trajectory. Continued improvement; expanded commercial deployment; whether humanoids cross the threshold to widespread production deployment is the central commercial question.
The next section develops manipulation - the complementary capability category that is substantially less mature than locomotion.
§11. Manipulation
The capability category complementing locomotion. Manipulation - using robotic arms and grippers to interact with objects - has substantially advanced in 2022-2026 but remains less mature than locomotion. This section develops dexterous manipulation challenges, grasping, bimanual manipulation, long-horizon manipulation, and the modern π0-style systems.
Why manipulation is hard
The structural difficulty. Compared to locomotion, manipulation faces several harder problems.
Contact-rich dynamics. Manipulation involves substantial contact with objects; contact dynamics are hard to simulate and hard to model. Locomotion has contact too, but the relevant contacts (feet with ground) are more regular.
Dexterity requirements. Many manipulation tasks require fine motor control - placing objects precisely, applying appropriate force, handling delicate items. Locomotion typically requires less precision.
Object diversity. Manipulation interacts with diverse objects (different shapes, weights, materials, deformabilities). Locomotion interacts primarily with surfaces and obstacles.
Task diversity. “Manipulation” spans an enormous range - picking and placing; folding; assembling; pouring; opening; closing; cutting; tying. Locomotion has fewer fundamentally different sub-tasks.
Reward specification. Locomotion has natural rewards (forward velocity; don’t fall). Manipulation rewards vary substantially per task and are often hard to specify.
Failure consequences. Manipulation failures can damage objects, robots, or environments. Locomotion failures usually just stop the robot.
The implication. Manipulation requires more domain-specific engineering, more data, more careful evaluation. Progress is slower than locomotion.
Grasping: from classical to learned
The most-foundational manipulation capability. Grasping - picking up an object - is required by most manipulation tasks.
The classical approach. Hand-designed grasp planning based on force-closure analysis (Bicchi and Kumar 2000; Mason 2001 Mechanics of Robotic Manipulation textbook). Given a model of the object’s geometry, compute grasp poses that achieve force-closure (the grasp resists arbitrary external forces).
The limitations. Requires explicit object models; brittle for objects with unusual geometry; doesn’t handle novel objects without per-object engineering.
The deep-learning era. Dex-Net (Mahler et al., Berkeley, 2016-2018). Learn grasp quality predictors from synthetic data. Substantially advanced grasping capability.
GraspNet (Fang et al., 2020). Large-scale grasp dataset and benchmark. Trained models predict grasp poses for diverse objects.
Modern grasp models (2024-2026). Transformer-based and VLM-based grasp predictors. Substantially better generalization to novel objects.
Modern grippers. From simple parallel-jaw grippers to dexterous multi-finger hands (Shadow Hand; Allegro Hand; modern integrated humanoid hands). The hardware advance enables more dexterous capability.
The 2026 state. Grasping is substantially mature for many objects. Common-shape objects in known environments are grasped reliably; novel objects, transparent/reflective objects, deformable objects remain challenging.
Bimanual manipulation
A specific capability advance. Bimanual manipulation uses two arms cooperating on tasks that single-arm manipulation cannot perform.
The motivating tasks. Many real-world tasks require two hands - folding laundry, opening packages, cooking, assembly.
The architectural challenge. Coordinating two arms requires joint planning that respects both arms’ kinematics and the task’s constraints.
ALOHA (Zhao et al., Stanford, 2023). Low-cost bimanual teleoperation setup; substantial influence on bimanual research. Mobile ALOHA extension added mobility.
Action Chunking Transformer (cross-reference §6). Originally developed for bimanual manipulation; demonstrated substantial capability.
Modern bimanual systems in production humanoids. Figure, Apptronik, Tesla, and other humanoid systems have two arms; bimanual capability is essential.
The 2026 state. Bimanual manipulation is active research with growing production deployment. Substantial capability advances in 2024-2025; substantial open problems remain.
Long-horizon manipulation
The hardest manipulation challenge. Long-horizon manipulation involves multi-step tasks (folding all items in a laundry basket; assembling a piece of furniture; packing groceries).
The challenges.
Compounding errors. Each step has some failure probability; many steps multiply.
Plan adaptation. Early steps may fail; subsequent steps must adapt.
State tracking. Track progress through the task; recognize completion.
Recovery from failure. Detect and respond to errors.
The approaches.
Hierarchical planning + per-step manipulation. Decompose the task into substeps; execute each with substep-specific policy.
Long-horizon VLA models. End-to-end policies trained on long demonstrations. π0 and successors demonstrate substantial capability.
Mixed classical + learned. Classical task planning at high level; learned policies for substeps.
The 2026 state. Long-horizon manipulation is the frontier of robotic manipulation. Substantial capability advances 2024-2025 (π0’s laundry folding; commercial demos); reliability gaps remain (OP-RO-3).
π0 and the modern manipulation frontier
A specific 2024-2026 landmark. π0 (Physical Intelligence, October 2024) substantially advanced dexterous manipulation. Cross-reference §8 for the VLA-architectural details.
The capabilities demonstrated.
Folding laundry - pick up a shirt; manipulate it through the folding sequence; stack folded items.
Packing groceries - pick items; arrange them in bags appropriately.
Bussing tables - clear and clean tables in restaurants.
Many household tasks.
The methodological contributions. Flow-matching action prediction; large-scale teleoperation data collection; substantial commercial investment.
The follow-up.
π0.5 (2025). Successor with improved capability.
Helix (Figure AI, February 2025). VLA for Figure humanoid; dual-system architecture combining slow high-level reasoning with fast low-level control.
Multiple competitors. Skild AI, Multion, multiple stealth and open companies. Substantial research investment.
The 2026 state. Modern VLA-based manipulation is the leading edge of robotic capability. Reliability for specific tasks in specific environments is substantial. Reliability for arbitrary tasks in arbitrary environments is not yet achieved.
Where manipulation sits in 2026
The summary. Manipulation has substantially advanced in 2022-2026 but remains substantially less mature than locomotion. Grasping is substantially mature. Bimanual manipulation is active research with growing production. Long-horizon manipulation is the frontier.
The remaining issues. Cross-reference OP-RO-3 (long-horizon manipulation reliability). Dexterous manipulation in arbitrary environments. Tactile-perception integration. Cost-effective dexterous hardware.
The next section covers the deployed applications - industrial and service robotics.
§12. Industrial and Service Robotics
The deployment landscape. This section surveys industrial robotics, service robotics, and specialized domains (surgical, agricultural) - where modern robotics has and has not transformed practice.
Industrial robotics
The mature commercial substrate. Industrial robotics has been substantial commercial reality since the 1960s; the AI transformation has been incremental rather than disruptive.
The traditional industrial-robot deployment.
Arm robots in manufacturing for welding, painting, assembly, palletizing. Tens of thousands of installations globally. Industrial-arm vendors: FANUC, ABB, Yaskawa, KUKA dominate.
Pick-and-place machines in electronics, packaging, pharmaceuticals. Specialized high-speed robots.
Mobile bases (AGVs / AMRs) in warehouses for material movement. Substantial growth 2010-2026.
Collaborative arms (cobots) designed to operate near humans without safety cages. Universal Robots, Franka Emika, others. Growing market.
The 2022-2026 AI transformation.
AI-augmented industrial arms. Vision-guided picking, learned grasping for diverse parts. Modest but real capability advances.
Mobile manipulation deployment. Combining mobile bases with manipulators for warehouse picking. Amazon Robotics Sparrow, Covariant, others.
Vision-AI inspection. Deep-learning quality inspection replacing or augmenting traditional vision systems.
Collaborative-arm capability expansion. Cobots gaining more AI capability (vision, learned manipulation).
The deployment scale. Industrial robotics generated $50B+ annual revenue by 2025. The AI transformation adds value within an already-large established market.
Service robotics
A different category. Service robotics operates in human environments - homes, offices, public spaces, healthcare facilities.
Cleaning robots. Robotic vacuum cleaners (Roomba, Roborock, others) are mature consumer products. Substantial commercial cleaning robots for floors in airports, retail, offices.
Delivery robots. Sidewalk delivery (Starship Technologies, Serve Robotics, Coco); restaurant delivery; hotel delivery (Savioke / Relay). Substantial commercial deployment for specific use cases.
Hospitality. Robotic concierges; robotic waiters; robotic bartenders. Some commercial deployment; primarily novelty applications.
Healthcare robotics. Robots for patient handling, medication delivery, telepresence, rehabilitation. Substantial growing market.
Eldercare robotics. Robots designed to assist elderly people; tasks include companionship, monitoring, basic assistance. Substantial Japanese commercial activity; emerging Western activity.
Educational robotics. Robots for STEM education; programmable platforms for learning.
The 2026 state. Service robotics is commercial reality in specific niches but not universal. Cleaning robots are mature mass-market; other service categories are niche or emerging.
Surgical robotics
A specialized commercial success. Surgical robotics - robotic systems for medical procedures - is a substantial commercial reality.
Intuitive Surgical’s da Vinci. The dominant surgical robotics platform. Performed in millions of procedures globally. Teleoperated (a surgeon controls the robot); not autonomous. The robot provides enhanced surgical capability - better dexterity, better visualization, less invasive procedures.
Other surgical robotics. Medtronic Hugo, J&J Ottava, Stryker Mako (orthopedic), Smith+Nephew CORI, multiple others.
Autonomous surgical components. Some specific surgical tasks have autonomous robotic implementations (e.g., autonomous suturing research; specific orthopedic tasks).
The 2026 state. Surgical robotics is mature commercial technology for teleoperation; autonomous surgery remains research-stage with substantial regulatory and safety hurdles.
Agricultural robotics
A growing application. Agricultural robotics - robots for farming tasks (harvesting, weeding, planting, monitoring).
Harvesting robots. Tomato pickers, strawberry pickers, apple pickers. Specialized per-crop. Substantial research; emerging commercial.
Weeding robots. Laser weeders (Carbon Robotics); mechanical weeders. Substantial commercial deployment.
Autonomous tractors. Several companies (John Deere autonomous tractor; Monarch Tractor; others). Substantial commercial activity.
Crop monitoring drones. Substantial market for agricultural drones.
Dairy automation. Robotic milking (Lely, DeLaval); substantial commercial deployment.
The 2026 state. Agricultural robotics is substantial growing market. Specific tasks (milking, weeding, some monitoring) are mature; harvesting remains hard for many crops.
Warehouse and logistics robotics
A specific high-growth category. Warehouse robotics has been substantially transformed.
Amazon Robotics. Acquired Kiva Systems 2012; deployed >750K mobile robots in Amazon warehouses by 2024. Substantial economic impact on warehouse operations.
Other warehouse robotics. Locus Robotics, AutoStore, GreyOrange, Geek+, multiple others. Substantial commercial deployment.
Mobile manipulators in warehouses. Combining mobile bases with arms for picking. Amazon Sparrow, Covariant, Pickle Robot, others. Growing deployment.
Humanoid robots in warehouses. Pilot deployments of humanoids (Figure, 1X, Apptronik) for warehouse tasks. Early stage but substantial investment.
The 2026 state. Warehouse robotics is substantial commercial reality. The deployment economics are clear; substantial growth expected.
Autonomous vehicles
A specific subdomain that deserves brief mention. Autonomous vehicles are a substantial subfield of robotics with their own conventions; this chapter does not develop them in depth.
The 2026 state.
Waymo (Alphabet) operating commercial autonomous taxi service in multiple US cities. Substantial commercial deployment.
Tesla FSD providing driver-assistance with periodic full-autonomy demonstrations; commercial trajectory uncertain.
Cruise (GM) - substantial setbacks 2023; restructured 2024.
Chinese autonomous-vehicle companies (Baidu Apollo, Pony.ai, WeRide) operating commercial services in Chinese cities.
Autonomous trucking (Aurora, Kodiak, Plus) emerging commercial trajectory.
Autonomous vehicles are one of the largest robotics subdomains in commercial impact. The technical methodology overlaps substantially with general robotics; the deployment context differs.
The deployment economics
A specific consideration. When does robotic deployment make economic sense?
The factors.
Robot cost. Capital cost (purchase price + installation).
Operating cost. Energy; maintenance; software licensing; integration.
Labour displacement value. Cost of human labour replaced (with appropriate caveats about employment effects).
Capability gap. Tasks where humans are substantially better remain humans’ domain; tasks where robots are competitive get automated.
Specific deployment friction. Hardware reliability; software complexity; integration challenges.
The 2026 deployment patterns. Robotics deployment is expanding in:
High-volume repetitive tasks (warehouse picking, manufacturing).
Hazardous environments (inspection in dangerous locations; surgical assistance for complex procedures).
Tasks where humans are scarce (eldercare in aging societies; agricultural labour shortages).
Specialized high-value tasks (surgical robotics; precision manufacturing).
Robotics deployment is limited in:
Tasks requiring substantial dexterity (most household manipulation).
Tasks requiring substantial generality (general-purpose assistants in unstructured environments).
Tasks where humans are abundant and cheap (many service tasks in low-wage labour markets).
The trajectory. As robotics capabilities improve and costs decline, the deployable task envelope expands. The pace depends on technological progress, cost trajectories, and labour-market dynamics.
Where deployment sits in 2026
The summary. Robotics deployment is substantial and growing across many categories. Industrial robotics is mature. Warehouse robotics is substantial growth. Surgical robotics is mature for teleoperation. Service robotics is emerging in specific niches. Humanoid robotics is early pilot stage with substantial uncertainty. Autonomous vehicles are substantial in specific contexts.
The next sections close out the chapter: §13 covers connections; §14 covers critiques/limitations/open problems; §15 further reading; §16 exercises.
§13. Connections to Other Chapters
This chapter integrates substantial content from many others; cross-references throughout.
Reinforcement Learning §7 (continuous control), §8 (model-based RL), §9 (offline RL) all directly inform robotic learning. This chapter applies RL to robotics; the RL chapter develops the algorithms. Cross-references particularly central in §7 and §9 of this chapter.
Multimodal Models §9 develops vision-language-action models as a multimodal pattern. This chapter §8 applies it to robotics. The two chapters cover the same machinery from complementary angles.
AI Agents §11 covers embodied agentic safety. The agentic concerns translate to robotics with physical-action specifics. Cross-referenced throughout this chapter’s safety discussion.
Generative Models §6 (diffusion) and §7 (flow-matching) underlie modern action policies. Diffusion Policy uses diffusion; π0 uses flow-matching. The Generative Models chapter develops the machinery; this chapter applies it.
Foundation Models provides the FM-as-substrate framing. Modern VLA models are foundation models; the FM scaling and adaptation framing applies.
Self-Supervised Learning is the pretraining substrate; modern VLA models inherit from SSL pretraining.
Large Language Models is the substrate for the language-understanding component of VLA. Cross-referenced for the VLM-component aspects.
Causality §10 is relevant for interventional reasoning in physical contexts. Robotic systems intervene in the world; causal reasoning about intervention effects matters for safety and design.
Alignment §11 covers agentic safety; robotic deployment raises specific safety concerns developed in this chapter.
AI for Science §6 covers materials chemistry; some scientific applications integrate with robotic synthesis (autonomous chemistry labs). Cross-references for the autonomous-chemistry connection.
Evaluation §10 covers cross-cutting evaluation; this chapter covers robotics-specific evaluation challenges (notably the real-robot vs simulation trade-off; cross-embodiment evaluation).
Mechanistic Interpretability §10 covers MI for safety; mechanistic interpretability for robotic policies is an underdeveloped area with growing interest.
Deep Learning §4-§6 develop architectural primitives (CNNs, Transformers, attention) that robotic perception and policy networks build on.
§14. Critiques, Limitations, and Open Problems
This section presents both substantive critiques of the modern-robotics direction and the consolidated open-problems list.
“Robotics is data-bottlenecked”
A prominent critique. The argument: modern robotics’ progress is bottlenecked by data. Unlike text (web-scale data available) or images (LAION-5B etc.), robotic-trajectory data is scarce and expensive. The capability ceiling is limited by what data can be collected.
The substantive content. Robotic data is substantially less abundant than text or image data. The Open X-Embodiment dataset (~1M trajectories) is small compared to text corpora (trillions of tokens) or image corpora (billions of images). The data bottleneck is real.
The pushback.
Synthetic data (simulation) provides essentially unlimited data; sim-to-real transfer increasingly bridges to real.
Cross-embodiment training leverages data from many robots; substantially amortizes per-platform data costs.
Web-pretraining transfer (RT-2 style) imports broad capability from non-robotic data.
Video-based learning leverages observational video of human demonstrations.
The chapter’s position. The data-bottleneck concern is real but partially addressed. Continued progress depends on whether data-efficiency techniques continue advancing.
“The hardware lags the software”
A specific critique. The argument: AI/software capability has substantially advanced in 2022-2026; robotic hardware has advanced more slowly. The bottleneck is increasingly hardware (cost, reliability, dexterity), not algorithms.
The substantive content. Hardware costs (especially humanoids) remain substantial; reliability gaps for some hardware (humanoid balance; complex grippers) are real; manufacturing scale lags software innovation.
The pushback.
Cost trajectories are substantially declining (Unitree quadrupeds at $3-10K; humanoid prices declining).
Reliability is improving - modern humanoids substantially more reliable than 2022 baselines.
Substantial commercial investment is going to hardware (Figure, 1X, Tesla, Sanctuary, Apptronik raising hundreds of millions).
The chapter’s position. The hardware-vs-software gap is real but narrowing. The 2024-2026 humanoid commercial activity reflects substantial hardware advances; whether hardware can keep pace with rapidly-advancing software-side capabilities is open.
“Sim-to-real is unsolved”
A specific technical critique. Despite substantial advances, sim-to-real transfer remains imperfect for many tasks - particularly contact-rich manipulation and tasks involving deformable materials.
The pushback. Sim-to-real is substantially solved for legged locomotion (cross-reference §10). Manipulation sim-to-real remains harder; substantial active research; gradual progress. The framing matters - “unsolved” overstates the case; “imperfect for some categories” is accurate.
The chapter’s position. Sim-to-real is partially solved. Locomotion: substantially solved. Manipulation: substantial advances, substantial remaining work. OP-RO-7 captures the open frontier.
The humanoid hype
A specific present-tense concern. The 2024-2026 humanoid robotics boom has substantial commercial activity but also substantial overpromising. Marketing materials show capabilities far beyond what production-grade deployment can reliably do.
The critique. Humanoid robotics is in hype cycle analogous to earlier AI cycles. Substantial capital deployed; substantial demonstrations; substantial gap between demonstration and reliable deployment. Whether the commercial trajectory will sustain depends on resolving these gaps.
The chapter’s position. Honest characterization (this chapter’s approach) acknowledges both the capability advances and the gap. The commercial trajectory of humanoid robotics is one of the most-watched technology questions of 2026-2028.
The labour-displacement concern
A substantive societal critique. Robotic automation displaces human labour. The 2022-2026 robotics advances (especially humanoids) have potential to displace substantial categories of human work.
The implications.
Warehouse and manufacturing roles increasingly automated.
Service-industry roles potentially impacted (delivery, cleaning, basic care).
Long-term agricultural labour shifting.
The honest framing. Labour-market effects of robotics are substantial but uneven. Some affected workers retrain; some don’t. Aggregate productivity gains accrue to capital; distribution of gains depends on policy. The chapter notes this concern; the broader analysis lives in the Alignment / Ethics chapter and the labour-economics literature.
Open problems
Consolidated open-problems list. Each carries an OP-RO-N identifier for cross-chapter reference.
OP-RO-1. Sample efficiency at real-robot scale. Real-robot data collection is expensive. Methods that learn from limited real-robot data (combining sim-to-real, cross-embodiment, video-based, foundation-model transfer) are central to robotic-AI progress. Whether sample efficiency can advance fast enough to support general-purpose robotic capability is open.
OP-RO-2. Generalization across embodiments. Policies trained on one robot platform often don’t transfer well to others. Cross-embodiment training (Open X-Embodiment) helps but doesn’t solve the problem. Universal robotic foundation models that work across diverse hardware are an open research direction.
OP-RO-3. Long-horizon manipulation reliability. Multi-step manipulation tasks accumulate errors; reliable long-horizon completion remains hard. Whether VLA scaling will resolve this or whether qualitatively different approaches are needed is open.
OP-RO-4. Safe deployment in unstructured environments. Robots in unstructured environments (homes; outdoor spaces; healthcare facilities) encounter rare situations that can produce unsafe actions. Designing for safe deployment in such environments - combining classical safety guarantees with learned-policy flexibility - is an ongoing engineering challenge.
OP-RO-5. Cost-effective robotic hardware. Hardware costs constrain deployment. Cost-effective humanoid robots (sub-5K range), high-quality tactile sensors would substantially expand deployment. Manufacturing scale and design innovations both matter.
OP-RO-6. Robotic foundation models. Whether universal robotic foundation models - trained on broad data, deployable across diverse hardware and tasks - emerge as a substantial paradigm. The 2024-2026 progress (RT-X, OpenVLA, π0) suggests yes; production-grade universal robotic FMs are not yet realized.
OP-RO-7. Sim-to-real transfer. Some categories (legged locomotion) have substantial sim-to-real solutions. Others (contact-rich manipulation; deformable-object manipulation; tasks involving complex physics) remain hard. Continued progress on sim-to-real for the harder categories is open.
OP-RO-8. Multi-robot coordination. Coordinating multiple robots on shared tasks raises both classical (multi-agent planning) and modern (multi-agent reinforcement learning) challenges. Substantial industrial multi-robot deployments (warehouse robots) work because of careful per-deployment engineering; general-purpose multi-robot coordination is open.
OP-RO-9. Tactile and force perception. Tactile sensing has substantially lagged vision in development. Robust tactile perception at scale would unlock substantial manipulation capability. Cross-reference §5.
OP-RO-10. Whole-system reliability. Production robotic deployment requires whole-system reliability - hardware, software, perception, planning, control, error handling. Each component may be 99%+ reliable; combined system reliability may be substantially lower. Whole-system reliability engineering is an underdeveloped area.
§15. Further Reading
Opinionated annotated list.
Classical robotics foundations
Siciliano, B., and Khatib, O. (eds.) (2016). Springer Handbook of Robotics (2nd ed.). The canonical reference for classical robotics. ~1500 pages; comprehensive.
Spong, M. W., Hutchinson, S., and Vidyasagar, M. (2020). Robot Modeling and Control (2nd ed.). Textbook for kinematics, dynamics, control.
Thrun, S., Burgard, W., and Fox, D. (2005). Probabilistic Robotics. Foundational mobile-robotics textbook.
LaValle, S. M. (2006). Planning Algorithms. Comprehensive motion-planning reference. Available free online.
Modern robotics learning
Lee, J., Hwangbo, J., Wellhausen, L., Koltun, V., Hutter, M. (2020). “Learning quadrupedal locomotion over challenging terrain.” Science Robotics. Foundational sim-to-real RL for legged locomotion.
Tobin, J., et al. (2017). “Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World.”
Chi, C., et al. (2023). “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion.”
Zhao, T. Z., et al. (2023). “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware.” ACT.
Vision-language-action
Brohan, A., et al. (Google DeepMind 2022). “RT-1: Robotics Transformer for Real-World Control at Scale.”
Brohan, A., et al. (Google DeepMind 2023). “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.”
Driess, D., et al. (Google DeepMind 2023). “PaLM-E: An Embodied Multimodal Language Model.”
Kim, M. J., et al. (2024). “OpenVLA: An Open-Source Vision-Language-Action Model.”
Black, K., et al. (Physical Intelligence 2024). “π0: A Vision-Language-Action Flow Model for General Robot Control.”
Open X-Embodiment Collaboration (2023-2024). “Open X-Embodiment: Robotic Learning Datasets and RT-X Models.”
Humanoid robotics
Figure AI, 1X Technologies, Apptronik, Tesla Optimus - substantial commercial-website content and technical demos.
Boston Dynamics technical reports and demos on Atlas and Spot.
Various humanoid-locomotion papers (Margolis et al., Wang et al., Smith et al.).
Industrial and commercial
International Federation of Robotics annual reports - industrial-robotics market data.
Various specific company technical materials (Boston Dynamics, Unitree, Universal Robots, ABB).
Surgical and specialized
Intuitive Surgical da Vinci technical documentation.
Various specialized-robotics application papers.
Reading-order recommendation
For someone new to modern robotics: start with the Siciliano-Khatib handbook for classical foundations. Then Thrun-Burgard-Fox for probabilistic robotics. Then Lee et al. 2020 (ANYmal) for the modern sim-to-real RL paradigm. Then RT-2 (Brohan et al. 2023) for the VLA paradigm. Then π0 (Black et al. 2024) for the modern manipulation frontier. Add OpenVLA and Diffusion Policy for the open-source landscape.
§16. Exercises and Experiments
Research-style exercises that develop robotics-specific skills.
E1. Classical control on a simulated arm. Use MuJoCo or PyBullet. Implement PID control for joint-space tracking. Implement inverse kinematics for end-effector positioning. Reflect on the classical-control approach.
E2. RL policy in simulation. Use Isaac Gym or Gymnasium. Train a small RL policy (PPO or SAC) on a simulation task (Ant locomotion, FetchReach, etc.). Reflect on training dynamics and sample efficiency.
E3. Diffusion Policy implementation. Implement Diffusion Policy on a simple manipulation task in simulation. Compare to a behaviour-cloning baseline. Investigate the multimodal-distribution advantage.
E4. Small VLA experiment. Using OpenVLA or similar open model, set up a small VLA pipeline. Test on a small task. Investigate generalization to slight variations of the task.
E5. Sim-to-real investigation. For a specific simulated task, characterize the sim-to-real gap. Identify which physics aspects (friction, contact, dynamics) cause the gap. Apply domain randomization; measure improvement.
E6. Open X-Embodiment exploration. Download a subset of Open X-Embodiment data. Train a small cross-embodiment policy. Investigate cross-platform generalization.
E7. Deployed-product analysis. Pick a deployed robotics product (Spot, da Vinci, Unitree quadruped, Roomba). Analyze its technical architecture from publicly-available information. Identify which components are classical vs learned; how the system is integrated.
E8. Tactile-perception experiment. Using a tactile-sensor dataset (GelSight, DIGIT, others), train a small tactile-perception model. Compare with vision-only baselines on a contact-rich task.
E9. World-model experiment. Implement a small Dreamer-style world model. Train on a simple environment; use for policy training. Compare to model-free RL.
E10. Safety-case construction. For a hypothetical robotic deployment (e.g., a humanoid in a warehouse), construct a structured safety case. Specify hazards; identify mitigations; assess residual risks. Reflect on the safety-engineering challenges of robotic deployment.