Multipath Reliable Connection
Sprays a single RDMA transfer across hundreds of paths through multiple parallel network planes, using static SRv6 source routing instead of dynamic routing protocols, eliminating core congestion and routing around failures on a microsecond timescale.
MRC splits each 800 Gb/s NIC into eight independent 100 Gb/s links connected to different switches, creating parallel network planes. For a single RDMA transfer, packets are sprayed across hundreds of paths in all planes. Each packet carries the final memory address so packets can arrive out of order and be written directly. MRC keeps state for many paths and swaps a path when it detects congestion; on a packet loss it immediately stops using that path and probes it. For destination-side congestion it uses packet trimming β the switch strips the payload and forwards only the header, triggering an explicit retransmission request. Routing uses IPv6 Segment Routing (SRv6): the sender encodes a sequence of switch identifiers in the destination address, and each switch removes its own identifier and consults a static routing table to decide the next hop. Dynamic routing (BGP) is disabled.
In AI training clusters at the scale of hundreds of thousands of GPUs, a single late transfer can stall an entire synchronous training step, and link or switch failures in classic single-path RoCE networks cause multi-second pauses or job crashes. Traditional protocols require packets of a transfer to follow one path, leading to hot-spots and underuse of available path diversity.
Fully parallel
A single transfer is sprayed across hundreds of concurrent paths through all network planes.
Conditional
Input dependent
Adaptive packet spraying selects paths dynamically in response to load and loss/trim signals.
Reference implementations
GENESIS Β· Source paper
Resilient AI Supercomputer Networking using MRC and SRv6MRC specification released through Open Compute Project
breakthroughMay 5, 2026 β OpenAI publishes the MRC 1.0 specification as an OCP contribution alongside the MRC + SRv6 white paper.
Deployed on OpenAI NVIDIA GB200 clusters; built into 800 Gb/s NICs attached to GPUs.
Commonly used with
RoCE
RDMA over Converged Ethernet (RoCE) is a family of network protocols standardized by the InfiniBand Trade Association (IBTA) that bring RDMA semantics β remote memory access bypassing the host CPU networking stack β onto Ethernet. Three variants exist: RoCE v1 operates as an Ethernet link-layer protocol (Ethertype 0x8915) confined to a single broadcast domain; the experimental RoCE v1.5 runs over IP; RoCE v2 encapsulates packets inside UDP/IP (port 4791) and is routable across IPv4/IPv6 networks. To approach InfiniBand-class performance, RoCE typically requires a lossless Ethernet fabric configured with Priority Flow Control (PFC) and Data Center Bridging (DCB); RoCE v2 additionally defines an ECN-based congestion-control mechanism using CNP frames. RoCE is today the dominant interconnect for GPU clusters in large-scale AI training, with end-to-end latencies as low as 1.3 Β΅s on modern host-channel adapters.
GO TO CONCEPTSRv6
SRv6 (Segment Routing over IPv6, RFC 8754, March 2020) is a source-routing architecture in which the ingress node injects a list of instructions β called SIDs (Segment Identifiers) β encoded as 128-bit IPv6 addresses inside a dedicated IPv6 extension header named the Segment Routing Header (SRH). Each SID combines locator semantics (where the packet should go) with function semantics (what the node should do: forwarding, VPN, encap, decap, service chaining, traffic engineering). The overarching segment-routing architecture is specified in RFC 8402 (July 2017); SRv6 is its IPv6-native instantiation, an alternative to SR-MPLS. The key benefit is that a single IPv6 data plane carries underlay forwarding, routing, traffic engineering, network slicing, VPN, and Network Programming simultaneously β without separate protocols (LDP, RSVP-TE) and without per-flow state in the core. In AI contexts, SRv6 is deployed in hyperscaler scale-out fabrics (Microsoft, Meta, Alibaba) to multipath RoCE/RDMA traffic and apply per-path congestion control.
GO TO CONCEPTSynchronous Training
Synchronous Distributed Training is the dominant paradigm for scaling deep learning, in which N workers (typically GPUs or TPUs) replicate the model and process different shards of a minibatch. After computing local gradients, all workers synchronously aggregate them via an all-reduce operation β every worker receives the sum (or mean) of all peers' gradients. Only after the all-reduce completes does the optimizer update the weights, keeping all replicas identical. Mathematically the scheme is equivalent to single-node SGD on an NΓB minibatch, eliminating the stale-gradient issue of Asynchronous Parameter Server. Goyal et al. (Facebook, 2017, "Accurate, Large Minibatch SGD") demonstrated that with the linear scaling rule and a learning-rate warmup, ResNet-50 can be trained on ImageNet in one hour on 256 GPUs without loss of accuracy. Synchronous training is today the standard for LLM training (PyTorch DDP, FSDP, DeepSpeed ZeRO, Megatron-LM, JAX pmap) and requires low-latency interconnects β hence the central role of RoCE/InfiniBand and NCCL.
GO TO CONCEPTIB
InfiniBand (IB) is a networking standard maintained by the InfiniBand Trade Association (IBTA, founded 1999), in which hosts connect to the fabric via Host Channel Adapters (HCAs) and peripherals via Target Channel Adapters (TCAs). Its switched-fabric topology, credit-based link-level flow control, and native RDMA deliver microsecond latencies (1.3 Β΅s at QDR, <0.6 Β΅s at HDR) and full line-rate without packet loss. Successive bandwidth generations are: SDR (8 Gbit/s 4Γ, 2001), DDR (16, 2005), QDR (32, 2007), FDR (54.54, 2011), EDR (100, 2014), HDR (200, 2018), NDR (400, 2022), and XDR (800, 2024). InfiniBand supports five message types β RDMA read/write, channel send/receive, transactional operations, multicast, and atomics. The Linux kernel has supported IB since 2.6.11 (2005) via OpenFabrics Enterprise Distribution (OFED) and the so-called verbs API. After 2014, IB briefly led the TOP500 interconnect ranking, but Ethernet/RoCE later reclaimed market share. In 2019 NVIDIA acquired Mellanox β the last independent vendor β and today IB is the primary scale-out fabric of NVIDIA's AI platforms (Quantum-2, Quantum-X800), used for LLM training in conjunction with NVLink/NVSwitch.
GO TO CONCEPT| Title | Publisher | Type |
|---|---|---|
| Supercomputer networking to accelerate large scale AI training | OpenAI | official website |
| Resilient AI Supercomputer Networking using MRC and SRv6 (white paper) | OpenAI | scientific article |
| OCP-MRC-1.0 specification | Open Compute Project | documentation |
| AMD advances AI networking at scale with MRC | AMD | blog |
| Enabling AI networking at scale with Multi-Path Reliable Connections (MRC) | Broadcom | blog |
| Building Resilient Networks for AI Supercomputers | Microsoft Azure | blog |
| NVIDIA Spectrum-X Ethernet MRC | NVIDIA | blog |