Architecting Chips For High-Performance Computing

Introduction

The world's leading hyperscaler cloud data center companies — Amazon, Google, Meta, Microsoft, Oracle, and Akamai — are driving rapid innovation in chip architectures specifically designed for the cloud. To pack more compute horsepower into a smaller area with lower cooling costs, these companies are embracing heterogeneous, multi-core architectures optimized for specific data types and workloads.

This trend follows in the footsteps of mobile devices which had to deal with tight footprints and stringent power and thermal requirements. Steve Roddy, VP of Marketing at Quadric, notes "Monolithic silicon from industry stalwarts like Intel have AI NPUs in nearly every product code. And of course, AI pioneer NVIDIA has long utilized a mixture of CPUs, shader (CUDA) cores, and Tensor cores in its enormously successful data center offerings. The move in coming years to chiplets will serve to cement the transition completely."

The Economics of Custom Architectures

With the benefits of traditional scaling shrinking and the maturation of advanced packaging, which allows many customized features previously constrained by reticle size, the competitive race for performance per watt and per dollar has kicked into high gear. This has led to an explosion of custom architectures optimized for different workloads.

Neil Hand, Director of Marketing for the IC segment at Siemens EDA, explains "Everyone is building their own architectures these days, especially the data center players, and a lot of the processor architecture comes down to what the workload looks like. At the same time, these developers are asking what the best path to acceleration is."

Some companies focus on parallelism with many cores, while others target memory bandwidth improvements. Many are developing dedicated accelerators for tasks like data manipulation, matrix operations, and compression/decompression.

Heterogeneous Multi-Core Architectures

The resulting chip architectures are heterogeneous multi-core designs mixing general-purpose CPUs, GPUs, and fixed-function accelerators. As Patrick Verbist, Product Manager for ASIP Tools at Synopsys, describes:

"They are heterogeneous multi-core architectures, typically a mix of general-purpose CPUs and GPUs, depending on the type of company, since they have a preference for one or the other. Then there are RTL accelerators with a fixed function...The type of application loads these accelerators run, in general, include data manipulation, matrix multiplication engines, activation functions, compression/decompression of parameters, the weights of the graphs, and other things."

To support changing workload requirements, many companies are adopting application-specific instruction processors (ASIPs) which allow customizing the datapath and instruction set.

"ASIPs allow you to customize the operators, so the data path and the instruction set only execute a limited set of operations in a more efficient way than a regular DSP can do," says Verbist. "If you look at a GPU, it has to support a variety of workloads, but not all workloads. And that's where ASIPs come into play to support the flexibility and the programmability."

Accommodating AI/ML Workloads

The rise of AI and machine learning is a major driver of this architectural diversity. Andy Heinig, Head of Efficient Electronics at Fraunhofer IIS, states "The need for AI/ML will accelerate the process of developing new application-specific architectures. The classic CPUs can be part of this revolution if they provide a much better memory interface to solve the memory problem. If CPUs provide such new memory architectures, AI/ML accelerators can be the best solution for data centers alongside the CPU."

Arm is collaborating directly with hyperscalers like AWS, Google, and Microsoft on optimizing their Neoverse-based solutions for AI/ML and high performance computing. "On-CPU inference is very important, and we are seeing our partners take advantage of our SVE pipes and matrix math enhancements and data types to run inference," says Brian Jeff, Senior Director of Product Management for Arm's Infrastructure line.

The massive model sizes required for large language models like GPT-3 are also driving new architectural considerations. Priyank Shukla, Principal Product Manager at Synopsys, explains:

"Let's take the example of GPT-3, which has 175 billion parameters. Each parameter has a 2-byte width, so that is 16 bits. You need to store this much information — 175 billion parameters into 2 bytes, which is equal to 350 billion bytes in memory. That memory needs to be stored in all the accelerators that share that model, which needs to be placed on the fabric of the accelerators...You need a fabric that can take that bigger model and then process it."

Some parts of these large models can be processed in parallel across multiple chips or racks, while other parts must be processed serially with low latency access to the full model.

Figure 1 shows an example of an ML-optimized server rack designed to handle such large models efficiently.

The Multi-Die Imperative

To integrate all the required compute elements — CPUs, GPUs, custom accelerators, high-bandwidth memory, etc. — while managing power and thermals, a multi-die or chiplet-based approach is becoming essential.

"The whole industry is at an inflection point where you cannot avoid this anymore," says Sutirtha Kabir, R&D Director at Synopsys. "We talk about Moore's Law and 'SysMoore' in the background, but the designers have to add more functionality in the CPUs and GPUs and there's just no way they can do that because of reticle size limits, yield limits, and all that in one chip. Multi-die chips are an inevitability here."

Multi-die design introduces new challenges around partitioning, inter-die synchronization, thermal management, and 3D floorplanning. "You're taking a single story house and making it three stories or four stories," explains Kabir. "But then there are other design challenges. You cannot ignore thermal anymore...If you don't take that into account during your floorplanning, you're going to fry your processors."

At the recent ISSCC conference, Marc Swinnen, Director of Product Marketing at Ansys, remarked "These data centers use a huge amount of power. I was at ISSCC in San Francisco and we were in a booth right next to NVIDIA, which was showing one of its AI training boxes — a big, bold box with eight chips, and scads and scads of fans and heat sinks. We asked how much power it uses, and they said, 'Oh, 10,000 watts at the top, but on average 6,000 watts.' Power is really becoming crazy."

Taking a Systems Approach

To tackle these multi-faceted design challenges, a comprehensive system-level approach is required that spans the instruction set, microarchitecture, memory subsystem, interconnects, and more.

"A complete system approach enables us to work with our partner to tailor SoC designs to modern workloads and process nodes, while taking advantage of chiplet-based design approaches," states Arm's Jeff. "This approach to custom chip design enables data center operators to optimize their power costs and computing efficiency."

Siemens' Hand also emphasizes the importance of system-level analysis and optimization: "The system-level co-design into the application has become important, and it's become more accessible because high-performance compute is no longer what it used to be. This is a data center on wheels."

The Road Ahead

Where this architectural evolution leads is difficult to predict, but it's clear the definition of "high-performance compute" will continue expanding.

"Once you start breaking von Neumann architectures and start using different memory flows and start looking at in-memory compute, it gets really cool. And then you say, 'What does high-performance compute even mean?' It's going to depend on the application," says Hand.

Factors like integrating silicon photonics, unified memory architectures across racks, and non-von Neumann computing models could radically reshape data center system topologies and redefine what constitutes optimal architecture and performance.

The one certainty is that the pace of innovation in cloud data center chip design will only accelerate as the world's largest tech giants continue their arms race to deliver leadership performance, efficiency, and scalability for the exponentially growing AI/ML and traditional computing workloads of the future.

Reference

[1] B. Smith, "Architecting Chips For High-Performance Computing," Semiconductor Engineering, May 15, 2024. [Online]. Available: https://semiengineering.com/architecting-chips-for-high-performance-computing/. [Accessed: May 23, 2024].

Architecting Chips For High-Performance Computing

Recent Posts

תגובות