NVIDIA's Multi-GPU Communication Technologies and Outlook Introduction

Terence S.-Y. Chen, Latitude Design Systems

Introduction

To meet the growing demands of artificial intelligence (AI), high performance computing (HPC), and other data-intensive workloads, advanced interconnect solutions are needed to enable high-bandwidth, low-latency communication between multiple GPUs. NVIDIA has been at the forefront of developing these technologies, including NVLink for fast GPU-to-GPU connections, NVSwitch for network topologies, and advanced packaging and silicon photonics for chip-to-chip communication.

Figure 1. NVLink: Direct GPU Interconnect with Scalable IO and NVSwitch for Full-Speed Multi-Node Communication

NVLink

NVLink is NVIDIA's high-speed direct GPU interconnect enabling multi-GPU scaling within servers or workstations. Compared to traditional PCIe solutions, NVLink provides significantly higher bandwidth, lower latency, and better scalability, allowing multi-GPU systems to more efficiently handle communication needs of AI and HPC workloads. The latest NVLink implementations offer up to 900 GB/s of bi-directional bandwidth between two GPUs, nearly 10x higher than PCIe 4.0. Latency is also reduced to just hundreds of nanoseconds with NVLink. This dramatic improvement in interconnect performance enables much more efficient workload scaling across multiple GPUs. NVLink uses an optimized hybrid packet-based protocol tailored for GPU traffic patterns. It connects over simple, narrow traces on the board or NVLink bridge chip, minimizing routing congestion. NVLink is transparent to GPU drivers and apps requiring no code changes. It is fully cache coherent between CPUs and GPUs maintaining data consistency. Overall, NVLink delivers the high-bandwidth, low-latency, scalable interconnect performance demanded by multi-GPU AI and HPC workloads, overcoming limitations of PCIe while seamlessly integrating into software ecosystems to accelerate system performance. NVLink has evolved generation-over-generation with increasing bandwidth, lower latency, and more parallel links to support larger-scale multi-GPU interconnect.

Key characteristics of different NVLink generations:

Feature	2nd Gen	3rd Gen	4th Gen
NVLink Bandwidth per GPU	300GB/s	600GB/s	900GB/s
Max Links per GPU	6	12	18
Supported NVIDIA Architecture	Volta	Ampere	Hopper

NVSwitch

Building on NVLink, NVIDIA developed NVSwitch, a network switch enabling full-speed multi-GPU communication within single or multi-node systems. NVSwitch can interconnect up to 256 GPUs with massive total bandwidth of 57.6 TB/s ideal for large-scale AI workloads.

NVSwitch uses NVLink to provide direct peer-to-peer full bandwidth connectivity between GPU pairs. NVSwitch systems can be configured in different topologies like mesh or hybrid cube mesh for various workloads and packaging. NVSwitch handles routing between GPUs using congestion control to prevent bottlenecks.

Compared to earlier NVIDIA GPU interconnects, NVSwitch offers significantly higher bandwidth for GPUs to maximize inter-GPU communication capacity. Direct GPU-to-GPU pathways also reduce latency. This enables building more scalable multi-GPU systems with NVSwitch. NVSwitch is fully integrated into NVIDIA's accelerated computing stack including CUDA, GPU drivers, multi-process services, and libraries. This allows applications to seamlessly leverage NVSwitch capabilities without any code changes. NVSwitch systems can also be connected over Mellanox InfiniBand networks for multi-node configurations.

NVSwitch performance metrics across generations:

Feature	1st Gen	2nd Gen	3rd Gen
GPUs per Node	Up to 8	Up to 8	Up to 8
GPU-to-GPU Bandwidth	300GB/s	600GB/s	900GB/s
Total Aggregate Bandwidth	2.4TB/s	4.8TB/s	7.2TB/s
Supported NVIDIA Architecture	Volta	Ampere	Hopper

NVIDIA NVSwitch sets new standards for intra-node GPU communication scalability and performance, paving the path for extremely powerful AI supercomputers and data centers.

Advanced Packaging and Silicon Photonics

To further advance interconnect capabilities, NVIDIA leverages advanced 2.5D/3D packaging and silicon photonics technologies enabling extremely efficient, high-density chip-to-chip communication.

Advanced Packaging Using advanced wafer-level packaging, multiple dies can be vertically interconnected with high-density microbumps enabling more data transfer compared to board-level connections. NVIDIA utilizes chip-on-wafer-on-substrate (CoWoS) packaging to integrate GPU dies with HBM memory stacks and connect to an NVLink switch chip using over 10,000 microbumps. This 2.5D integration provides very high GPU-to-GPU bandwidth over NVLinks along with high memory bandwidth. Additional 3D stacking techniques like through-silicon vias (TSVs) can further increase interconnect density between dies. Combined, these packaging innovations enable tighter integration of more components for higher system performance.
Silicon Photonics While electrical signaling is reaching limits in interconnect capacity and efficiency, silicon photonics offers a path to overcome these challenges through optical chip-to-chip transmission. Silicon photonics can deliver higher bandwidth at lower power consumption versus electrical links. NVIDIA is developing silicon photonics to enable future high-throughput, power-efficient optical NVLink connections for GPU systems. Using wavelength division multiplexing, each fiber optic cable can potentially carry terabits per second of data capacity. This allows the NVLink fabric to scale to thousands of interconnected GPUs across a data center. The optical transceivers and modulators are integrated into the silicon package along with CMOS control logic while light sources are provided externally. This co-packaged optics (CPO) approach combines the benefits of silicon photonics and advanced 2.5D integration. Electronic-photonic co-design tools like PIC Studio can facilitate the design and simulation of these advanced photonic packaging solutions. PIC Studio enables co-simulation of electronic driver circuits and photonic components to ensure signal integrity across the electro-optic boundary. It also provides interoperability with electronic simulators like Spectre to link CMOS and photonic simulations. PDK support from major silicon photonics foundries further accelerates development. The optical transceiver building blocks in PIC Studio can be leveraged to simulate NVLink’s envisioned high-speed optical links. Overall, silicon photonics leverages the speed and efficiency of optics to push interconnect performance to new levels. Combined with advanced packaging, this technology will enable next-generation exascale supercomputing and massive-scale AI training.

Interconnect Technology Roadmap

NVIDIA continues advancing NVLink and NVSwitch to keep pace with the growing demands of AI and HPC workloads, increasing bandwidth, lowering latency, and expanding connectivity with each generation.

Last year’s announcement of NVLink-C2C opened up NVLink for use as a high-speed cache-coherent interconnect between custom chips and NVIDIA GPUs, DPUs and CPUs. This will enable new heterogeneous computing systems composed of chiplets. NVLink-C2C supports industry standards like CXL and AMBA CHI for interoperability.

Looking ahead, as advanced packaging and silicon photonics solutions mature, optical I/O technology will make its way into NVIDIA products. Multi-GPU servers with CPO will achieve huge leaps in bandwidth capacity and energy efficiency. Tighter integration of GPUs, memory, and switches will also boost system performance.

Conclusion

NVIDIA’s long-term vision is to architect high-performance accelerated computing platforms, from chips to full data centers. Key to this is developing innovative interconnects like NVLink, NVSwitch, and optical networking to eliminate communication bottlenecks. Multi-GPU scaling and chiplet integration are becoming critical to advance AI and HPC capabilities. With its extensive expertise and sustained research, NVIDIA aims to provide world-leading interconnect solutions to meet the demands of next-generation exascale computing.