RoCE (RDMA over Converged Ethernet) – how it works and what it powers | AI Technologies | Robots Atlas

GPU training clusters for LLMsScale-out fabrics (NVIDIA Spectrum-X, NVLink-over-Ethernet)Distributed storage (NVMe-over-Fabrics)High-performance computing clustersIn-memory databases and caching

RoCE encapsulates InfiniBand transport-layer messages (Base Transport Header + payload) inside Ethernet frames (RoCE v1) or UDP/IP packets (RoCE v2). The Host Channel Adapter (HCA) implements the entire protocol stack in hardware: the application posts a READ/WRITE/SEND verb and the HCA accesses remote memory without kernel involvement or data copies. Because RoCE is sensitive to packet loss, deployments use Priority Flow Control (PFC) for losslessness and ECN-based congestion signaling (CNP frames in v2).

Conventional TCP/IP-over-Ethernet imposes high latency and CPU overhead on inter-node communication in HPC and AI-training clusters. RoCE solves this by delivering RDMA (zero-copy, kernel-bypass) without requiring a dedicated InfiniBand fabric.

Common pitfalls

PFC-induced deadlocks

HIGH

Priority Flow Control, required for losslessness, can trigger credit-loop deadlocks in large fabrics.

Use DCQCN, SRv6 path routing, or adaptive routing; constrain PFC domains.

Packet-loss sensitivity

CRITICAL

RoCE v2 over UDP has no built-in reliability; a single drop triggers go-back-N retransmission and dramatic performance loss (incast collapse).

Lossless ECN/PFC tuning, selective repeat (Reliable RoCE), or Multipath Reliable Connection (MRC).

DCB configuration complexity

MEDIUM

Configuring Data Center Bridging (PFC, ETS, DCBX) per switch is considerably more complex than configuring an InfiniBand fabric.

Reference implementations

Linux RDMA Subsystem (rdma-core)official

C

NVIDIA MLNX_OFEDofficial

C

Soft-RoCE (rxe)official

C · Linux kernel software RoCE implementation

2010

RoCE v1 specification published (IBTA Annex A16)

breakthrough

The InfiniBand Trade Association ratifies RoCE v1 as Annex A16 to IBA specification 1.2.1.

2014

RoCE v2 specification published (IBTA Annex A17)

breakthrough

RoCE v2 introduces UDP/IP encapsulation (port 4791), enabling routable RDMA across IP networks and ECN/CNP-based congestion control.

2016

RoCE v2 lands in Linux Kernel 4.5

The mainline Linux kernel adds RoCE v2 support (Mellanox OFED 2.3+), enabling broad data-center deployment.

2020

NVIDIA acquires Mellanox

The acquisition makes RoCE a strategic component of NVIDIA's AI platform (Spectrum, ConnectX, BlueField).

2024

Spectrum-X and Ultra Ethernet Consortium

breakthrough

NVIDIA launches Spectrum-X — an Ethernet platform optimized for RoCE in AI clusters; the Ultra Ethernet Consortium (AMD, Broadcom, Cisco, Meta, Microsoft) forms to design a RoCE successor.

GPU Tensor CoresPRIMARY

RoCE is the standard scale-out fabric for GPU clusters (NVIDIA ConnectX/BlueField, Spectrum-X) used in LLM training.

Hardware agnosticGOOD

RoCE requires an RDMA-capable NIC (HCA) but is agnostic to the CPU/GPU/accelerator above it.

ALTERNATIVE TO

IB

InfiniBand (IB) is a networking standard maintained by the InfiniBand Trade Association (IBTA, founded 1999), in which hosts connect to the fabric via Host Channel Adapters (HCAs) and peripherals via Target Channel Adapters (TCAs). Its switched-fabric topology, credit-based link-level flow control, and native RDMA deliver microsecond latencies (1.3 µs at QDR, <0.6 µs at HDR) and full line-rate without packet loss. Successive bandwidth generations are: SDR (8 Gbit/s 4×, 2001), DDR (16, 2005), QDR (32, 2007), FDR (54.54, 2011), EDR (100, 2014), HDR (200, 2018), NDR (400, 2022), and XDR (800, 2024). InfiniBand supports five message types — RDMA read/write, channel send/receive, transactional operations, multicast, and atomics. The Linux kernel has supported IB since 2.6.11 (2005) via OpenFabrics Enterprise Distribution (OFED) and the so-called verbs API. After 2014, IB briefly led the TOP500 interconnect ranking, but Ethernet/RoCE later reclaimed market share. In 2019 NVIDIA acquired Mellanox — the last independent vendor — and today IB is the primary scale-out fabric of NVIDIA's AI platforms (Quantum-2, Quantum-X800), used for LLM training in conjunction with NVLink/NVSwitch.

GO TO CONCEPT

Commonly used with

MRC

Multipath Reliable Connection (MRC) is a network protocol designed for training frontier AI models on supercomputer clusters with more than 100,000 GPUs. It extends the RDMA over Converged Ethernet (RoCE) standard from the InfiniBand Trade Association and builds on techniques from the Ultra Ethernet Consortium (UEC), adding SRv6 source routing on top. MRC has been deployed across all of OpenAI's largest NVIDIA GB200 supercomputers, including the Stargate site operated with Oracle Cloud Infrastructure in Abilene, Texas, and in Microsoft Fairwater supercomputers. The specification was published on May 5, 2026 as an Open Compute Project (OCP) contribution and is publicly available. MRC addresses three problems of large-scale synchronous training: it enables two-tier multi-plane networks connecting 131,000 GPUs instead of conventional three- or four-tier designs, virtually eliminates core network congestion via adaptive packet spraying, and routes around failures on a microsecond timescale using static source routing instead of dynamic BGP.

GO TO CONCEPT

SRv6

SRv6 (Segment Routing over IPv6, RFC 8754, March 2020) is a source-routing architecture in which the ingress node injects a list of instructions — called SIDs (Segment Identifiers) — encoded as 128-bit IPv6 addresses inside a dedicated IPv6 extension header named the Segment Routing Header (SRH). Each SID combines locator semantics (where the packet should go) with function semantics (what the node should do: forwarding, VPN, encap, decap, service chaining, traffic engineering). The overarching segment-routing architecture is specified in RFC 8402 (July 2017); SRv6 is its IPv6-native instantiation, an alternative to SR-MPLS. The key benefit is that a single IPv6 data plane carries underlay forwarding, routing, traffic engineering, network slicing, VPN, and Network Programming simultaneously — without separate protocols (LDP, RSVP-TE) and without per-flow state in the core. In AI contexts, SRv6 is deployed in hyperscaler scale-out fabrics (Microsoft, Meta, Alibaba) to multipath RoCE/RDMA traffic and apply per-path congestion control.

GO TO CONCEPT

Synchronous Training

Synchronous Distributed Training is the dominant paradigm for scaling deep learning, in which N workers (typically GPUs or TPUs) replicate the model and process different shards of a minibatch. After computing local gradients, all workers synchronously aggregate them via an all-reduce operation — every worker receives the sum (or mean) of all peers' gradients. Only after the all-reduce completes does the optimizer update the weights, keeping all replicas identical. Mathematically the scheme is equivalent to single-node SGD on an N×B minibatch, eliminating the stale-gradient issue of Asynchronous Parameter Server. Goyal et al. (Facebook, 2017, "Accurate, Large Minibatch SGD") demonstrated that with the linear scaling rule and a learning-rate warmup, ResNet-50 can be trained on ImageNet in one hour on 256 GPUs without loss of accuracy. Synchronous training is today the standard for LLM training (PyTorch DDP, FSDP, DeepSpeed ZeRO, Megatron-LM, JAX pmap) and requires low-latency interconnects — hence the central role of RoCE/InfiniBand and NCCL.

GO TO CONCEPT

IB

InfiniBand (IB) is a networking standard maintained by the InfiniBand Trade Association (IBTA, founded 1999), in which hosts connect to the fabric via Host Channel Adapters (HCAs) and peripherals via Target Channel Adapters (TCAs). Its switched-fabric topology, credit-based link-level flow control, and native RDMA deliver microsecond latencies (1.3 µs at QDR, <0.6 µs at HDR) and full line-rate without packet loss. Successive bandwidth generations are: SDR (8 Gbit/s 4×, 2001), DDR (16, 2005), QDR (32, 2007), FDR (54.54, 2011), EDR (100, 2014), HDR (200, 2018), NDR (400, 2022), and XDR (800, 2024). InfiniBand supports five message types — RDMA read/write, channel send/receive, transactional operations, multicast, and atomics. The Linux kernel has supported IB since 2.6.11 (2005) via OpenFabrics Enterprise Distribution (OFED) and the so-called verbs API. After 2014, IB briefly led the TOP500 interconnect ranking, but Ethernet/RoCE later reclaimed market share. In 2019 NVIDIA acquired Mellanox — the last independent vendor — and today IB is the primary scale-out fabric of NVIDIA's AI platforms (Quantum-2, Quantum-X800), used for LLM training in conjunction with NVLink/NVSwitch.

GO TO CONCEPT

Title	Publisher	Type
RDMA over Converged Ethernet — Wikipedia	Wikipedia	article
InfiniBand Architecture Specification Release 1.2.1 Annex A16: RoCE	InfiniBand Trade Association	documentation
InfiniBand Architecture Specification Release 1.2.1 Annex A17: RoCEv2	InfiniBand Trade Association	documentation
Revisiting Network Support for RDMA	arXiv	scientific article