top of page

The Emergence of Integrated Heterogeneous Architectures

Abstract

This article examines how technology leaders like NVIDIA and Apple are pioneering tightly integrated heterogeneous computing architectures, unifying CPUs, GPUs, and accelerators using high-bandwidth interconnects and shared memory to deliver new levels of performance and efficiency.

Introduction

Heterogeneous computing architectures that tightly couple different processors are becoming critical for high-performance computing across applications. By integrating CPUs, GPUs, and accelerators into unified systems optimized holistically, huge gains in efficiency and capabilities are possible. Companies like NVIDIA and Apple are leading the way.

NVIDIA’s rise to prominence is no accident. As early as 2007 when it launched CUDA, NVIDIA focused on general purpose GPU (GPGPU) computing to accelerate parallel workloads beyond graphics. This prescience is what AMD and Intel failed to foresee. Initially, GPGPU gained traction in high performance computing (HPC), substantially accelerating parallel computations. However, programming complexity slowed mainstream adoption until NVIDIA cultivated their ecosystem to increase accessibility. Before the AI boom, though, GPGPU was still niche outside academia, with little industry adoption in places like Taiwan.

NVIDIA Bets Early on GPU Compute

NVIDIA focused very early on GPGPU computing for parallel workloads beyond graphics. They lowered barriers with the CUDA ecosystem, bringing GPU acceleration to the mainstream. GPGPU first gained traction in HPC before finding killer applications in AI.

GPUs Accelerate AI

The rise of deep learning provided GPUs with a killer application. Its training process has extremely high parallelism and huge computational power requirements, which matches perfectly with the parallel processing strengths of GPUs. NVIDIA's years of ecosystem development for GPUs in high performance computing provided critical foundations for it to swiftly meet AI computing demands. NVIDIA realized the high-speed development of GPUs in the AI field.

Interconnects and Infrastructure

High-speed interconnect technologies like NVIDIA's self-developed NVLink can build unified GPU clusters with tight coupling. After acquiring Mellanox, NVIDIA obtained leading high-speed networking technology, further enhancing GPU connectivity at the data center level. NVIDIA also developed smart network interface cards (Data Processing Units, DPUs) to further optimize data center network infrastructure and provide secure and reliable GPU virtualization.

Topology of a fully connected NVIDIA NVLink Switch System across NVIDIA DGX GH200 consisting of 256 GPUs
Figure 1. Topology of a fully connected NVIDIA NVLink Switch System across NVIDIA DGX GH200 consisting of 256 GPUs (Source: NVIDIA)
Performance comparisons for giant memory AI workloads
Figure 2. Performance comparisons for giant memory AI workloads (Source: NVIDIA)

The Rise of Unified Architectures

In heterogeneous systems, mechanisms like shared memory and cache coherency can achieve tighter CPU-GPU integration, bringing major performance improvements over discrete architectures. NVIDIA's Grace processor employs new interconnect and memory architectures, delivering 900GB/s ultra-high bandwidth between the CPU and GPU. Apple's self-developed M-series chips realize seamless memory sharing and unification between CPUs and GPUs.

Although the plan to acquire ARM was not realized, NVIDIA still maintains close collaborations with ARM, because both companies have common competitors in the x86 dominated landscape. In the GH200 chip on the Grace Hopper architecture launched this year, NVIDIA used 900GB/s ultra-high bandwidth to connect the CPU and GPU, and supports cache coherency, allowing the GPU to access the CPU's 480GB shared memory 7 times faster than PCIe Gen5.

DGX GH200 System Architecture
Figure 3. DGX GH200 System Architecture (Source: NVIDIA)

This architecture is well-suited for large language model applications. Currently the maximum NVIDIA GPU memory is 96GB, and 144GB versions will be launched in the future. But does the memory have to be centralized on the GPU side? In fact, the GPU in Apple's M2 Ultra chip can access the CPU's 192GB memory at 800GB/s bandwidth, larger than a single NVIDIA GPU's memory space. Therefore, it can be used as a low-cost AI development platform. The newly released M3 Max enables the Macbook Pro to have up to 128GB of shared memory and can also run much larger language models than normal graphics cards.

APPLE M3 CPU
Figure 4. APPLE M3 CPU

Co-design and Ecosystems Drive Innovation

NVIDIA and Apple customize hardware based on software and application requirements rather than being limited to existing specifications, adopting forward-looking joint design approaches. By building related ecosystems and continuously conducting hardware-software co-optimization, the cooperative computing capabilities of heterogeneous platforms can be fully leveraged. This design philosophy spawned the integrated heterogeneous computing paradigm and opened a new chapter in high performance computing.

Conclusion

The rise of heterogeneous computing architectures signifies a new trend in the development of computing systems. Its flexible combinations of specialized processors and hardware-software co-design means can achieve leaps in computing system capabilities. NVIDIA and Apple are leading this industry to take big steps towards a new era of highly integrated and optimized heterogeneous computing, which will drive tremendous breakthroughs in computing efficiency and system capabilities, and unlock infinite application possibilities.

Comments


bottom of page