Host Channel Adapter (HCA)
Host hardware endpoint
Host-side adapter that implements the IB transport stack in hardware and serves the RDMA verbs (send, receive, write, read, atomic).
A switched-fabric network with native RDMA, lossless credit-based link-level flow control, and sub-microsecond latencies β designed from the ground up as an HPC/AI interconnect rather than as a bolt-on over an existing stack.
Each host carries a Host Channel Adapter (HCA) β an intelligent NIC that implements the entire protocol stack in hardware. The application uses the verbs API (ibv_post_send) to post an RDMA WRITE/READ/SEND or an atomic operation; the HCA directly reads/writes remote memory with zero copies and no CPU involvement on the remote side. The switched fabric uses a Subnet Manager to compute paths (linear forwarding tables) and credit-based flow control: a sender transmits only when the receiver has buffer credit available, guaranteeing losslessness. Physical layer: links are aggregated (1Γ/4Γ/8Γ/12Γ) with QSFP (up to HDR) and OSFP (NDR and beyond) connectors, copper up to 10 m, fiber up to 10 km.
Traditional Ethernet-with-TCP/IP introduced high latency, CPU overhead, and lossy behavior that disqualified it as an HPC/AI interconnect. InfiniBand solves this with native RDMA, lossless link-level flow control, and a switched-fabric topology from layer 1 up.
Host hardware endpoint
Host-side adapter that implements the IB transport stack in hardware and serves the RDMA verbs (send, receive, write, read, atomic).
Forwarding plane
Fabric switch that forwards IB packets between HCAs based on the linear forwarding table installed by the Subnet Manager.
Control plane
Control-plane component (typically run on one of the nodes or in a switch) that discovers topology, assigns LIDs, and programs routing tables in switches.
Software interface
IBTA-standardized set of programming operations (ibv_post_send, ibv_open_device, ibv_reg_mrβ¦) implemented by the libibverbs library (OFED).
Fully parallel
A switched fabric with multi-rail HCAs and adaptive routing enables parallel communication between thousands of GPUs without a single-link bottleneck.
Data rate (SDR/DDR/QDR/FDR/EDR/HDR/NDR/XDR)
Per-lane bandwidth generation β from 2.5 Gbit/s (SDR) up to 200 Gbit/s (XDR).
Lane width (1Γ/4Γ/8Γ/12Γ)
Number of aggregated physical lanes per port. 4Γ is the standard; 12Γ is used switch-to-switch.
Fabric topology
Fat tree, dragonfly, torus β affects bisection bandwidth, cost, and diameter.
MTU
IB packet size β typically 256 B to 4 KB (max).
After the Mellanox acquisition (2019) and Intel's exit (Omni-Path), NVIDIA is effectively the sole IB hardware vendor.
Choose RoCE / Ethernet as the alternative, or pursue a multi-vendor strategy using Ultra Ethernet.
Master SM failure blocks new path setup; a standby SM must be configured.
Master/standby SM, monitoring, and automatic failover.
IB is a dedicated fabric β IPoIB or a gateway is required to interoperate with the broader IP infrastructure.
IPoIB, EoIB, gateway switches.
IB hardware (HCAs, switches, cabling) is typically more expensive than equivalent Ethernet at the same line rate.
IBTA founded (merger of NGIO and Future I/O)
NGIO (Intel) and Future I/O (Compaq, IBM, HP) merge into the InfiniBand Trade Association.
InfiniBand Architecture Specification 1.0
breakthroughFirst release of the IB architecture specification.
Mellanox InfiniBridge β first 10 Gbit/s product
Mellanox ships the first commercial InfiniBand products at 10 Gbit/s line rate (SDR).
InfiniBand lands in Linux Kernel 2.6.11
OpenIB Alliance (later OpenFabrics) integrates the IB stack into the mainline kernel.
IB becomes the most-used TOP500 interconnect
breakthroughAfter years of HPC ecosystem growth, InfiniBand becomes the dominant interconnect on the TOP500 list.
NVIDIA acquires Mellanox for USD 6.9 billion
breakthroughThe acquisition makes IB a strategic component of NVIDIA's AI platform β the Quantum (switches) and ConnectX (HCA) lines.
NDR β 400 Gbit/s
Introduction of NDR (Quantum-2, ConnectX-7) β the scale-out fabric of frontier-class AI clusters.
XDR β 800 Gbit/s (Quantum-X800)
breakthroughNVIDIA announces Quantum-X800 and ConnectX-8 as the next-gen fabric for Blackwell GPUs.
IB is the primary scale-out fabric of the NVIDIA DGX/SuperPOD platforms for H100/H200/B200 GPU clusters.
RDMA over Converged Ethernet (RoCE) is a family of network protocols standardized by the InfiniBand Trade Association (IBTA) that bring RDMA semantics β remote memory access bypassing the host CPU networking stack β onto Ethernet. Three variants exist: RoCE v1 operates as an Ethernet link-layer protocol (Ethertype 0x8915) confined to a single broadcast domain; the experimental RoCE v1.5 runs over IP; RoCE v2 encapsulates packets inside UDP/IP (port 4791) and is routable across IPv4/IPv6 networks. To approach InfiniBand-class performance, RoCE typically requires a lossless Ethernet fabric configured with Priority Flow Control (PFC) and Data Center Bridging (DCB); RoCE v2 additionally defines an ECN-based congestion-control mechanism using CNP frames. RoCE is today the dominant interconnect for GPU clusters in large-scale AI training, with end-to-end latencies as low as 1.3 Β΅s on modern host-channel adapters.
GO TO CONCEPTSynchronous Distributed Training is the dominant paradigm for scaling deep learning, in which N workers (typically GPUs or TPUs) replicate the model and process different shards of a minibatch. After computing local gradients, all workers synchronously aggregate them via an all-reduce operation β every worker receives the sum (or mean) of all peers' gradients. Only after the all-reduce completes does the optimizer update the weights, keeping all replicas identical. Mathematically the scheme is equivalent to single-node SGD on an NΓB minibatch, eliminating the stale-gradient issue of Asynchronous Parameter Server. Goyal et al. (Facebook, 2017, "Accurate, Large Minibatch SGD") demonstrated that with the linear scaling rule and a learning-rate warmup, ResNet-50 can be trained on ImageNet in one hour on 256 GPUs without loss of accuracy. Synchronous training is today the standard for LLM training (PyTorch DDP, FSDP, DeepSpeed ZeRO, Megatron-LM, JAX pmap) and requires low-latency interconnects β hence the central role of RoCE/InfiniBand and NCCL.
GO TO CONCEPTRDMA over Converged Ethernet (RoCE) is a family of network protocols standardized by the InfiniBand Trade Association (IBTA) that bring RDMA semantics β remote memory access bypassing the host CPU networking stack β onto Ethernet. Three variants exist: RoCE v1 operates as an Ethernet link-layer protocol (Ethertype 0x8915) confined to a single broadcast domain; the experimental RoCE v1.5 runs over IP; RoCE v2 encapsulates packets inside UDP/IP (port 4791) and is routable across IPv4/IPv6 networks. To approach InfiniBand-class performance, RoCE typically requires a lossless Ethernet fabric configured with Priority Flow Control (PFC) and Data Center Bridging (DCB); RoCE v2 additionally defines an ECN-based congestion-control mechanism using CNP frames. RoCE is today the dominant interconnect for GPU clusters in large-scale AI training, with end-to-end latencies as low as 1.3 Β΅s on modern host-channel adapters.
GO TO CONCEPTMultipath Reliable Connection (MRC) is a network protocol designed for training frontier AI models on supercomputer clusters with more than 100,000 GPUs. It extends the RDMA over Converged Ethernet (RoCE) standard from the InfiniBand Trade Association and builds on techniques from the Ultra Ethernet Consortium (UEC), adding SRv6 source routing on top. MRC has been deployed across all of OpenAI's largest NVIDIA GB200 supercomputers, including the Stargate site operated with Oracle Cloud Infrastructure in Abilene, Texas, and in Microsoft Fairwater supercomputers. The specification was published on May 5, 2026 as an Open Compute Project (OCP) contribution and is publicly available. MRC addresses three problems of large-scale synchronous training: it enables two-tier multi-plane networks connecting 131,000 GPUs instead of conventional three- or four-tier designs, virtually eliminates core network congestion via adaptive packet spraying, and routes around failures on a microsecond timescale using static source routing instead of dynamic BGP.
GO TO CONCEPTSRv6 (Segment Routing over IPv6, RFC 8754, March 2020) is a source-routing architecture in which the ingress node injects a list of instructions β called SIDs (Segment Identifiers) β encoded as 128-bit IPv6 addresses inside a dedicated IPv6 extension header named the Segment Routing Header (SRH). Each SID combines locator semantics (where the packet should go) with function semantics (what the node should do: forwarding, VPN, encap, decap, service chaining, traffic engineering). The overarching segment-routing architecture is specified in RFC 8402 (July 2017); SRv6 is its IPv6-native instantiation, an alternative to SR-MPLS. The key benefit is that a single IPv6 data plane carries underlay forwarding, routing, traffic engineering, network slicing, VPN, and Network Programming simultaneously β without separate protocols (LDP, RSVP-TE) and without per-flow state in the core. In AI contexts, SRv6 is deployed in hyperscaler scale-out fabrics (Microsoft, Meta, Alibaba) to multipath RoCE/RDMA traffic and apply per-path congestion control.
GO TO CONCEPT| Title | Publisher | Type |
|---|---|---|
| InfiniBand β Wikipedia | Wikipedia | article |
| InfiniBand Trade Association (IBTA) | InfiniBand Trade Association | official website |
| InfiniBand Roadmap | InfiniBand Trade Association | documentation |
| NVIDIA Announces New Switches Optimized for Trillion-Parameter GPU Computing | NVIDIA | official website |
| Dissecting a Small InfiniBand Application Using the Verbs API | arXiv | scientific article |