Multipath Reliable Connection

Sprays a single RDMA transfer across hundreds of paths through multiple parallel network planes, using static SRv6 source routing instead of dynamic routing protocols, eliminating core congestion and routing around failures on a microsecond timescale.

Category

Abstraction level

Operation level

Synchronous pretraining of frontier models on >100,000 GPU clustersDistributed training on Stargate supercomputers (OCI Abilene)RDMA networking for NVIDIA GB200Large-scale Ethernet fabrics in AI data centres

MRC splits each 800 Gb/s NIC into eight independent 100 Gb/s links connected to different switches, creating parallel network planes. For a single RDMA transfer, packets are sprayed across hundreds of paths in all planes. Each packet carries the final memory address so packets can arrive out of order and be written directly. MRC keeps state for many paths and swaps a path when it detects congestion; on a packet loss it immediately stops using that path and probes it. For destination-side congestion it uses packet trimming — the switch strips the payload and forwards only the header, triggering an explicit retransmission request. Routing uses IPv6 Segment Routing (SRv6): the sender encodes a sequence of switch identifiers in the destination address, and each switch removes its own identifier and consults a static routing table to decide the next hop. Dynamic routing (BGP) is disabled.

In AI training clusters at the scale of hundreds of thousands of GPUs, a single late transfer can stall an entire synchronous training step, and link or switch failures in classic single-path RoCE networks cause multi-second pauses or job crashes. Traditional protocols require packets of a transfer to follow one path, leading to hot-spots and underuse of available path diversity.

Parallelism

Fully parallel

A single transfer is sprayed across hundreds of concurrent paths through all network planes.

Paradigm

Conditional

Input dependent

Adaptive packet spraying selects paths dynamically in response to load and loss/trim signals.

Reference implementations

OCP-MRC-1.0 specificationofficial

Open Compute Project

GENESIS · Source paper

Resilient AI Supercomputer Networking using MRC and SRv6

2026OpenAI / Open Compute Project (OCP)

2026

MRC specification released through Open Compute Project

breakthrough

May 5, 2026 — OpenAI publishes the MRC 1.0 specification as an OCP contribution alongside the MRC + SRv6 white paper.

OCP-MRC-1.0

GPU Tensor CoresPRIMARY

Deployed on OpenAI NVIDIA GB200 clusters; built into 800 Gb/s NICs attached to GPUs.

Commonly used with

RoCE

RDMA over Converged Ethernet (RoCE) is a family of network protocols standardized by the InfiniBand Trade Association (IBTA) that bring RDMA semantics — remote memory access bypassing the host CPU networking stack — onto Ethernet. Three variants exist: RoCE v1 operates as an Ethernet link-layer protocol (Ethertype 0x8915) confined to a single broadcast domain; the experimental RoCE v1.5 runs over IP; RoCE v2 encapsulates packets inside UDP/IP (port 4791) and is routable across IPv4/IPv6 networks. To approach InfiniBand-class performance, RoCE typically requires a lossless Ethernet fabric configured with Priority Flow Control (PFC) and Data Center Bridging (DCB); RoCE v2 additionally defines an ECN-based congestion-control mechanism using CNP frames. RoCE is today the dominant interconnect for GPU clusters in large-scale AI training, with end-to-end latencies as low as 1.3 µs on modern host-channel adapters.

GO TO CONCEPT

SRv6

SRv6 (Segment Routing over IPv6, RFC 8754, March 2020) is a source-routing architecture in which the ingress node injects a list of instructions — called SIDs (Segment Identifiers) — encoded as 128-bit IPv6 addresses inside a dedicated IPv6 extension header named the Segment Routing Header (SRH). Each SID combines locator semantics (where the packet should go) with function semantics (what the node should do: forwarding, VPN, encap, decap, service chaining, traffic engineering). The overarching segment-routing architecture is specified in RFC 8402 (July 2017); SRv6 is its IPv6-native instantiation, an alternative to SR-MPLS. The key benefit is that a single IPv6 data plane carries underlay forwarding, routing, traffic engineering, network slicing, VPN, and Network Programming simultaneously — without separate protocols (LDP, RSVP-TE) and without per-flow state in the core. In AI contexts, SRv6 is deployed in hyperscaler scale-out fabrics (Microsoft, Meta, Alibaba) to multipath RoCE/RDMA traffic and apply per-path congestion control.

GO TO CONCEPT

Synchronous Training

Synchronous Distributed Training is the dominant paradigm for scaling deep learning, in which N workers (typically GPUs or TPUs) replicate the model and process different shards of a minibatch. After computing local gradients, all workers synchronously aggregate them via an all-reduce operation — every worker receives the sum (or mean) of all peers' gradients. Only after the all-reduce completes does the optimizer update the weights, keeping all replicas identical. Mathematically the scheme is equivalent to single-node SGD on an N×B minibatch, eliminating the stale-gradient issue of Asynchronous Parameter Server. Goyal et al. (Facebook, 2017, "Accurate, Large Minibatch SGD") demonstrated that with the linear scaling rule and a learning-rate warmup, ResNet-50 can be trained on ImageNet in one hour on 256 GPUs without loss of accuracy. Synchronous training is today the standard for LLM training (PyTorch DDP, FSDP, DeepSpeed ZeRO, Megatron-LM, JAX pmap) and requires low-latency interconnects — hence the central role of RoCE/InfiniBand and NCCL.

GO TO CONCEPT

InfiniBand (IB) is a networking standard maintained by the InfiniBand Trade Association (IBTA, founded 1999), in which hosts connect to the fabric via Host Channel Adapters (HCAs) and peripherals via Target Channel Adapters (TCAs). Its switched-fabric topology, credit-based link-level flow control, and native RDMA deliver microsecond latencies (1.3 µs at QDR, <0.6 µs at HDR) and full line-rate without packet loss. Successive bandwidth generations are: SDR (8 Gbit/s 4×, 2001), DDR (16, 2005), QDR (32, 2007), FDR (54.54, 2011), EDR (100, 2014), HDR (200, 2018), NDR (400, 2022), and XDR (800, 2024). InfiniBand supports five message types — RDMA read/write, channel send/receive, transactional operations, multicast, and atomics. The Linux kernel has supported IB since 2.6.11 (2005) via OpenFabrics Enterprise Distribution (OFED) and the so-called verbs API. After 2014, IB briefly led the TOP500 interconnect ranking, but Ethernet/RoCE later reclaimed market share. In 2019 NVIDIA acquired Mellanox — the last independent vendor — and today IB is the primary scale-out fabric of NVIDIA's AI platforms (Quantum-2, Quantum-X800), used for LLM training in conjunction with NVLink/NVSwitch.

GO TO CONCEPT

Title	Publisher	Type
Supercomputer networking to accelerate large scale AI training	OpenAI	official website
Resilient AI Supercomputer Networking using MRC and SRv6 (white paper)	OpenAI	scientific article
OCP-MRC-1.0 specification	Open Compute Project	documentation
AMD advances AI networking at scale with MRC	AMD	blog
Enabling AI networking at scale with Multi-Path Reliable Connections (MRC)	Broadcom	blog
Building Resilient Networks for AI Supercomputers	Microsoft Azure	blog
NVIDIA Spectrum-X Ethernet MRC	NVIDIA	blog