top of page

TPU v4: An Optically Reconfigurable Supercomputer for Accelerating Machine Learning

Abstract

This report summarizes the key innovations in Google’s Tensor Processing Unit version 4 (TPU v4), a domain-specific supercomputer for training machine learning models. TPU v4 introduces optical circuit switches for flexible topology and sparse cores dedicated to accelerating embeddings. The optical switches construct a 4096-chip interconnect that improves reliability through reconfigurability while optimizing performance through tailorable topologies. Sparse cores provide a 5-7x speedup for embeddings at only 5% incremental area and power. Across production workloads, TPU v4 demonstrates 2.1-3.5x higher performance over TPU v3, with 2.7x greater performance per Watt.

Introduction

Machine learning models continue to rapidly advance in scale and algorithmic complexity. TPU v4 represents Google’s latest specialized hardware for accelerating the training of these demanding models as part of an integrated machine learning stack including algorithms, software infrastructure, and customized silicon. This report focuses on two vital architectural innovations in TPU v4:

  1. Optical circuit switches enabling a flexible interconnect topology

  2. Sparse cores dedicated to accelerating embedding lookups

Optical Circuit Switches for Flexible Topology

To scale up from 1024 chips per supercomputer in TPU v3 to 4096 chips in TPU v4, optical circuit switches (OCSes) are introduced to connect the TPU chips over optical links. The OCSes offer eight substantial benefits:

  1. Scalability up to 4096 chips, a 4x increase over TPU v3

  2. Improved reliability through reconfigurability around failed chips

  3. Flexible topology tailored to optimize each job

  4. 1.2-2.3x higher performance from topology tuning

  5. Reduced power versus electronic packet switching

  6. Simplified scheduler for better utilization

  7. Faster partial deployment of systems

  8. Enhanced security isolation between jobs

The optical switches construct the TPU v4 supercomputer from 64-chip building blocks, with electrical links between chips within a block and optical links between blocks. Despite the transformative interconnect flexibility enabled by the OCS fabric, it adds less than 5% to overall system cost and 3% to power. This allows constructing a 4096-chip system with high efficiency and fault tolerance.

Connectivity of a 4×4×4 cube (top) to 3 OCSes (bottom)
Figure 1: Connectivity of a 4×4×4 cube (top) to 3 OCSes (bottom). The “+” and “–” connections with the same dimension and index are connected to the same OCS; 48 of these in-out pairs each connect to a distinct OCS.

Sparse Cores for Embedding Acceleration

A major portion of Google’s machine learning workload consists of deep learning recommendation models (DLRMs). DLRMs rely heavily on embedding lookups, which stress memory bandwidth with all-to-all communication traffic. To accelerate embeddings, TPU v4 contains dedicated Sparse Cores operating as an interconnected sea of simple dataflow processors. The sparse cores improve embedding performance by 5-7x over CPUs and 3.1x over TPU v3, with only ~5% incremental area and power. Without sparse cores, placing embeddings in CPU memory would reduce overall TPU v4 performance by 5-7x.

Block diagram of the Sparse Cores in TPU v4
Figure 2: Block diagram of the Sparse Cores in TPU v4.

Production Workload Performance

Across eight production workloads, TPU v4 demonstrates significantly higher performance over TPU v3. As shown in Figure 12, speedups range from 1.5x to 3.5x for the same slice sizes. The DLRMs experience especially large gains of 3.0-3.5x due to dual benefits from the optical interconnect’s enhanced bisection bandwidth and dedicated acceleration in sparse cores. Surprisingly, one RNN workload runs 3.3x faster on TPU v4, benefiting from the increased scratchpad memory bandwidth.


TPU v4 speedup over TPU v3 on production jobs
Figure 3: TPU v4 speedup over TPU v3 on production jobs.

Conclusion

The optical circuit switches and sparse cores in TPU v4 facilitate scaling to 4096 chips while improving reliability, efficiency, and suitability for demanding machine learning workloads. TPU v4 delivers substantially higher performance across Google’s production jobs compared to prior generations. The specialized architecture exemplifies customized system design for keeping pace with the rapidly changing machine learning field over multiple generations.

Reference

Jouppi, N., Kurian, G., Li, S., Ma, P., Nagarajan, R., Nai, L., Patil, N., Subramanian, S., Swing, A., Towles, B., Young, C., Zhou, X., Zhou, Z., & Patterson, D. A. (2023). TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings. Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA '23), 1-14. https://doi.org/10.1145/3579371.3589350

Comments


bottom of page