Abstract

A new era of computing has begun with the development of High Performance Computing (HPC), Artificial Intelligence (AI), Machine Learning (ML) and Cognitive Systems. Dramatic increases in power density of the electronic components have led to the design and architecture of efficient thermal management technologies on these systems. IBM designed and delivered in 2018 the most powerful and fastest supercomputers of the world known as Summit and Sierra having 200 petaflops peak computing performance through LINPACK benchmarks. These systems are both air and liquid cooled, where water is employed in liquid cooled systems to cool the high power electronic components including IBM POWER9 processors and NVIDIA GPUs. In this paper, we highlight the overview of the thermal and mechanical design strategies applied to these systems. Testing and experimental analysis with comparison to computational modeling is provided. Thermal control strategies are investigated for the optimization of overall system efficiency. In air cooled systems, we discuss the fan and heat sink designs, as well as the preheating effect on the PCIe section. In liquid cooled systems, which have a unique cold plate design cooling the processors and the GPUs with water, we examine the water flow path design for the CPUs, GPUs, and the thermal performance of the cold plate. An overview of the cooling assemblies such as TIMs and air baffles in these systems is discussed. Unit and rack manifolds and rear door heat exchanger are investigated. Water flow and pressure distribution at the node and rack-level are provided.

This content is only available via PDF.

Article PDF first page preview

Article PDF first page preview
You do not currently have access to this content.