In recent years, there has been a significant growth in number, size and power densities of data centers. A significant part of data center power consumption is attributed to the cooling infrastructure, consisting of computer air conditioning units (CRACs), chillers and cooling towers. For energy efficient operation and management of the cooling resources, data centers are beginning to be extensively instrumented with temperature sensors. While this allows cooling actuators, such as CRAC set point temperature, to be dynamically controlled and data centers operated at higher temperatures to save energy, it also increases chances of thermal anomalies. Furthermore, considering that large data centers can contain thousands to tens of thousands of such sensors, it is virtually impossible to manually inspect and analyze the large volumes of dynamic data generated by these sensors, thus necessitating autonomous mechanisms for thermal anomaly detection. Also, in addition to threshold-based detection methods, other mechanisms of anomaly detection are also necessary. In this paper, we describe the commonly occurring thermal anomalies in a data center. Furthermore, we describe — with examples from a production data center — techniques to autonomously detect these anomalies. In particular, we show the usefulness of a principal component analysis (PCA) based methodology to a large temperature sensor network. Specifically, we examine thermal anomalies such as those related to misconfiguration of equipment, blocked vent tiles, faulty sensor and CRAC related anomalies. Furthermore, several of these anomalies normally go undetected since no temperature thresholds are violated. We present examples of the thermal anomalies and their detection from a real data center.

This content is only available via PDF.
You do not currently have access to this content.