A modern dexterous-manipulation pipeline consists of several layers. (1) Perception: RGB-D cameras, depth sensors, and tactile sensors on fingertips (e.g. GelSight, DIGIT) provide observations of pose and contact. (2) State representation: a VLA model or a separate encoder (CNN, ViT, point-cloud network) compresses raw sensor data into a compact state vector. (3) Policy: a neural network (MLP, transformer, diffusion policy) produces an action vector each step โ typically joint targets for all hand DoFs. (4) Training: imitation learning from teleoperated demonstrations, RL in simulation (Isaac Gym, MuJoCo) with domain randomization for sim-to-real transfer, or a hybrid IL+RL approach (residual policies). (5) Execution: a low-level controller (joint-impedance, operational-space control) converts actions into actuator torques at 100โ1000 Hz.
Classical two-finger grippers fail at objects with complex shapes, at manipulation requiring re-orientation in-hand, and at delicate operations with force control. Dexterous Manipulation addresses the problem of general, adaptive object manipulation in unstructured environments โ a prerequisite for humanoids, home robots, and advanced industrial automation.
A mechanical hand with many degrees of freedom (Shadow Hand: 24 DoF, Allegro: 16, Inspire: 12, robotic hands of Tesla/Figure humanoids: 11โ17 DoF). The physical interface between the policy and the world.
Official
Fingertip sensors (GelSight, DIGIT, ReSkin, piezoresistive arrays) measure contact force, slip and local surface geometry. Critical for tasks that require force control.
Official
A neural network mapping observations to actions. Modern variants: MLP/transformer for RL tasks, diffusion policy for imitation learning, VLA models for tasks requiring natural-language reasoning.
Official
A mechanism that lets policies trained in simulation work on a physical robot. Typically domain randomization (sampling friction, masses, latencies), domain adaptation, residual policy, or fine-tuning on a small number of real-world demonstrations.
Official
Policies that work perfectly in simulation can fail completely on a physical robot due to differences in friction, actuator latency, sensor noise, and contact dynamics.
The policy finds unexpected ways to maximise reward (e.g. spinning a cube by finger vibration instead of coordinated grasping).
Most simulators do not model tactile sensors with sufficient fidelity โ preventing pure-sim learning of force-controlled tasks.
Foundational work by Mason, Salisbury and Bicchi on form- and force-closure grasps, hand kinematics, and grasp analysis.
First commercial 24-DoF anthropomorphic multi-fingered end-effector (Shadow Robot Company) โ the de facto standard for dexterous-manipulation research.
Ken Goldberg's group showed that deep learning on synthetic grasp datasets transfers effectively to physical robots โ a foundation for modern learned grasping.
A neural policy trained in massively parallel simulation with domain randomization solved a Rubik's cube with a Shadow Hand โ the first spectacular RL success in dexterous manipulation.
NVIDIA released GPU-native simulation with thousands of parallel environments; training dexterous-manipulation policies dropped from days to hours.
Teams at Stanford and Berkeley showed that low-cost teleoperation rigs (ALOHA, DexCap) collecting demonstrations enable effective imitation-learning policies without RL โ an alternative path to RL+sim-to-real.
Vision-Language-Action models with billions of parameters, trained on massive robotic-demonstration corpora, started dominating dexterous manipulation by integrating natural-language understanding with policy generation.
Number of controlled hand joints. Affects both manipulation expressiveness and learning difficulty (higher action-space dimensionality).
Low-level control loop rate. Higher frequencies allow reacting to contact dynamics; lower frequencies reduce policy compute cost.
Which sensors feed the policy: proprioceptive only, +RGB, +RGB-D, +tactile. Each additional modality increases task success but complicates training.
The policy conditions actions on visual, tactile, and proprioceptive observations โ different action sequences are activated depending on task phase (grasp, transport, in-hand manipulation, release).
RL training in simulation is massively parallel (thousands of Isaac Gym environments on a single GPU). Inference on a physical robot remains sequential due to the closed contact loop at 100โ1000 Hz.
Training RL policies in simulation requires massive GPU parallelism (Isaac Gym, MuJoCo MJX). VLA inference also runs best on GPUs.
Low-level control loops (joint impedance, MPC) typically run on a real-time CPU alongside the high-level policy on GPU.