Robots Atlas>ROBOTS ATLAS

InfiniBand

A switched-fabric network with native RDMA, lossless credit-based link-level flow control, and sub-microsecond latencies β€” designed from the ground up as an HPC/AI interconnect rather than as a bolt-on over an existing stack.

Category
Abstraction level
Operation level
LLM training clusters (NVIDIA SuperPOD, DGX, frontier compute)TOP500 supercomputersScale-out storage (Lustre, GPFS, NVMe-oF)Oracle Exadata databasesScientific simulations / CFD / climate

Each host carries a Host Channel Adapter (HCA) β€” an intelligent NIC that implements the entire protocol stack in hardware. The application uses the verbs API (ibv_post_send) to post an RDMA WRITE/READ/SEND or an atomic operation; the HCA directly reads/writes remote memory with zero copies and no CPU involvement on the remote side. The switched fabric uses a Subnet Manager to compute paths (linear forwarding tables) and credit-based flow control: a sender transmits only when the receiver has buffer credit available, guaranteeing losslessness. Physical layer: links are aggregated (1Γ—/4Γ—/8Γ—/12Γ—) with QSFP (up to HDR) and OSFP (NDR and beyond) connectors, copper up to 10 m, fiber up to 10 km.

Traditional Ethernet-with-TCP/IP introduced high latency, CPU overhead, and lossy behavior that disqualified it as an HPC/AI interconnect. InfiniBand solves this with native RDMA, lossless link-level flow control, and a switched-fabric topology from layer 1 up.

01

Host Channel Adapter (HCA)

Host hardware endpoint

Host-side adapter that implements the IB transport stack in hardware and serves the RDMA verbs (send, receive, write, read, atomic).

02

IB Switch

Forwarding plane

Fabric switch that forwards IB packets between HCAs based on the linear forwarding table installed by the Subnet Manager.

03

Subnet Manager (SM)

Control plane

Modular

Control-plane component (typically run on one of the nodes or in a switch) that discovers topology, assigns LIDs, and programs routing tables in switches.

04

Verbs API

Software interface

IBTA-standardized set of programming operations (ibv_post_send, ibv_open_device, ibv_reg_mr…) implemented by the libibverbs library (OFED).

Parallelism

Fully parallel

A switched fabric with multi-rail HCAs and adaptive routing enables parallel communication between thousands of GPUs without a single-link bottleneck.

Data rate (SDR/DDR/QDR/FDR/EDR/HDR/NDR/XDR)

Critical
  • EDR (100 Gbit/s 4Γ—, 2014)
  • HDR (200 Gbit/s 4Γ—, 2018)
  • NDR (400 Gbit/s 4Γ—, 2022)
  • XDR (800 Gbit/s 4Γ—, 2024)

Per-lane bandwidth generation β€” from 2.5 Gbit/s (SDR) up to 200 Gbit/s (XDR).

Lane width (1Γ—/4Γ—/8Γ—/12Γ—)

Standard

Number of aggregated physical lanes per port. 4Γ— is the standard; 12Γ— is used switch-to-switch.

Fabric topology

Critical

Fat tree, dragonfly, torus β€” affects bisection bandwidth, cost, and diameter.

MTU

Standard

IB packet size β€” typically 256 B to 4 KB (max).

Common pitfalls

Vendor lock-in (NVIDIA/Mellanox)
HIGH

After the Mellanox acquisition (2019) and Intel's exit (Omni-Path), NVIDIA is effectively the sole IB hardware vendor.

Choose RoCE / Ethernet as the alternative, or pursue a multi-vendor strategy using Ultra Ethernet.

Subnet Manager as single point of failure
MEDIUM

Master SM failure blocks new path setup; a standby SM must be configured.

Master/standby SM, monitoring, and automatic failover.

No native IP routing
MEDIUM

IB is a dedicated fabric β€” IPoIB or a gateway is required to interoperate with the broader IP infrastructure.

IPoIB, EoIB, gateway switches.

CapEx / OpEx cost
MEDIUM

IB hardware (HCAs, switches, cabling) is typically more expensive than equivalent Ethernet at the same line rate.

1999

IBTA founded (merger of NGIO and Future I/O)

NGIO (Intel) and Future I/O (Compaq, IBM, HP) merge into the InfiniBand Trade Association.

2000

InfiniBand Architecture Specification 1.0

breakthrough

First release of the IB architecture specification.

2001

Mellanox InfiniBridge β€” first 10 Gbit/s product

Mellanox ships the first commercial InfiniBand products at 10 Gbit/s line rate (SDR).

2005

InfiniBand lands in Linux Kernel 2.6.11

OpenIB Alliance (later OpenFabrics) integrates the IB stack into the mainline kernel.

2014

IB becomes the most-used TOP500 interconnect

breakthrough

After years of HPC ecosystem growth, InfiniBand becomes the dominant interconnect on the TOP500 list.

2019

NVIDIA acquires Mellanox for USD 6.9 billion

breakthrough

The acquisition makes IB a strategic component of NVIDIA's AI platform β€” the Quantum (switches) and ConnectX (HCA) lines.

2022

NDR β€” 400 Gbit/s

Introduction of NDR (Quantum-2, ConnectX-7) β€” the scale-out fabric of frontier-class AI clusters.

2024

XDR β€” 800 Gbit/s (Quantum-X800)

breakthrough

NVIDIA announces Quantum-X800 and ConnectX-8 as the next-gen fabric for Blackwell GPUs.

GPU Tensor CoresPRIMARY

IB is the primary scale-out fabric of the NVIDIA DGX/SuperPOD platforms for H100/H200/B200 GPU clusters.

ALTERNATIVE TO

RoCE

RDMA over Converged Ethernet (RoCE) is a family of network protocols standardized by the InfiniBand Trade Association (IBTA) that bring RDMA semantics β€” remote memory access bypassing the host CPU networking stack β€” onto Ethernet. Three variants exist: RoCE v1 operates as an Ethernet link-layer protocol (Ethertype 0x8915) confined to a single broadcast domain; the experimental RoCE v1.5 runs over IP; RoCE v2 encapsulates packets inside UDP/IP (port 4791) and is routable across IPv4/IPv6 networks. To approach InfiniBand-class performance, RoCE typically requires a lossless Ethernet fabric configured with Priority Flow Control (PFC) and Data Center Bridging (DCB); RoCE v2 additionally defines an ECN-based congestion-control mechanism using CNP frames. RoCE is today the dominant interconnect for GPU clusters in large-scale AI training, with end-to-end latencies as low as 1.3 Β΅s on modern host-channel adapters.

GO TO CONCEPT

Commonly used with

Synchronous Training

Synchronous Distributed Training is the dominant paradigm for scaling deep learning, in which N workers (typically GPUs or TPUs) replicate the model and process different shards of a minibatch. After computing local gradients, all workers synchronously aggregate them via an all-reduce operation β€” every worker receives the sum (or mean) of all peers' gradients. Only after the all-reduce completes does the optimizer update the weights, keeping all replicas identical. Mathematically the scheme is equivalent to single-node SGD on an NΓ—B minibatch, eliminating the stale-gradient issue of Asynchronous Parameter Server. Goyal et al. (Facebook, 2017, "Accurate, Large Minibatch SGD") demonstrated that with the linear scaling rule and a learning-rate warmup, ResNet-50 can be trained on ImageNet in one hour on 256 GPUs without loss of accuracy. Synchronous training is today the standard for LLM training (PyTorch DDP, FSDP, DeepSpeed ZeRO, Megatron-LM, JAX pmap) and requires low-latency interconnects β€” hence the central role of RoCE/InfiniBand and NCCL.

GO TO CONCEPT
RoCE

RDMA over Converged Ethernet (RoCE) is a family of network protocols standardized by the InfiniBand Trade Association (IBTA) that bring RDMA semantics β€” remote memory access bypassing the host CPU networking stack β€” onto Ethernet. Three variants exist: RoCE v1 operates as an Ethernet link-layer protocol (Ethertype 0x8915) confined to a single broadcast domain; the experimental RoCE v1.5 runs over IP; RoCE v2 encapsulates packets inside UDP/IP (port 4791) and is routable across IPv4/IPv6 networks. To approach InfiniBand-class performance, RoCE typically requires a lossless Ethernet fabric configured with Priority Flow Control (PFC) and Data Center Bridging (DCB); RoCE v2 additionally defines an ECN-based congestion-control mechanism using CNP frames. RoCE is today the dominant interconnect for GPU clusters in large-scale AI training, with end-to-end latencies as low as 1.3 Β΅s on modern host-channel adapters.

GO TO CONCEPT
MRC

Multipath Reliable Connection (MRC) is a network protocol designed for training frontier AI models on supercomputer clusters with more than 100,000 GPUs. It extends the RDMA over Converged Ethernet (RoCE) standard from the InfiniBand Trade Association and builds on techniques from the Ultra Ethernet Consortium (UEC), adding SRv6 source routing on top. MRC has been deployed across all of OpenAI's largest NVIDIA GB200 supercomputers, including the Stargate site operated with Oracle Cloud Infrastructure in Abilene, Texas, and in Microsoft Fairwater supercomputers. The specification was published on May 5, 2026 as an Open Compute Project (OCP) contribution and is publicly available. MRC addresses three problems of large-scale synchronous training: it enables two-tier multi-plane networks connecting 131,000 GPUs instead of conventional three- or four-tier designs, virtually eliminates core network congestion via adaptive packet spraying, and routes around failures on a microsecond timescale using static source routing instead of dynamic BGP.

GO TO CONCEPT
SRv6

SRv6 (Segment Routing over IPv6, RFC 8754, March 2020) is a source-routing architecture in which the ingress node injects a list of instructions β€” called SIDs (Segment Identifiers) β€” encoded as 128-bit IPv6 addresses inside a dedicated IPv6 extension header named the Segment Routing Header (SRH). Each SID combines locator semantics (where the packet should go) with function semantics (what the node should do: forwarding, VPN, encap, decap, service chaining, traffic engineering). The overarching segment-routing architecture is specified in RFC 8402 (July 2017); SRv6 is its IPv6-native instantiation, an alternative to SR-MPLS. The key benefit is that a single IPv6 data plane carries underlay forwarding, routing, traffic engineering, network slicing, VPN, and Network Programming simultaneously β€” without separate protocols (LDP, RSVP-TE) and without per-flow state in the core. In AI contexts, SRv6 is deployed in hyperscaler scale-out fabrics (Microsoft, Meta, Alibaba) to multipath RoCE/RDMA traffic and apply per-path congestion control.

GO TO CONCEPT
InfiniBand Trade Association (IBTA)
official websiteInfiniBand Trade Association
InfiniBand Roadmap
documentationInfiniBand Trade Association