Abstract
Due to continuous growth of AI accelerator chip power and heat flux, implementation of advanced cooling technologies for AI platforms seems to be inevitable for hyper scale users. Liquid cooling is one of the relatively more mature category of advanced cooling technologies, and has been adopted in a variety of forms across industry. However, not all liquid cooling solutions are able to deliver high performance with reasonable cost and efficiency. In addition, it’s not straightforward to arrive at proper balance of performance, reliability, serviceability, and scalability for a product, and prepare the facility accordingly to align with long term strategy.
In this presentation, we will introduce a passive cold plate loop solution (Tide 1.0), based on Meta’s AI training platform (Zion) with eight Open Accelerator Modules (OAM). It reflects the design considerations on performance and serviceability. Thermal simulation and optimization studies will be presented. The solution was tested on dummy thermal test vehicles and real functional system, along with cooling capability forecast. Results showed a good match between simulation, TTV test and real system test. The resulting performance demonstrated strong use case of liquid cooling solutions on upcoming AI platforms.