Fan efficiency is known to increase with size. In part I of this study, savings in server fan power on the order of 50% were reported by replacing server-enclosed 60 mm fans with a rear-mounted wall of larger fans (80 mm or 120 mm in size). A methodology for row-wise control of such rack-level fans, with the purpose of simulating an actual product, is previewed and savings comparable to part I are reported. Performance under real-life scenarios such as nonuniform computational loads and fan failure is investigated. Each rack-level setup has distinct advantages. Selecting between configurations would necessitate a compromise between efficiency, redundancy, and cost.
The growth and dependence of global commerce, social interaction, news sources, and other industries on information technology (IT) systems over the last decade has contributed to the rise of large data centers. These facilities are responsible for a significant portion of national and global energy consumption. Such are the implications that the national energy usage by data centers more than doubled between 2000 and 2005 and, it was projected that consumption would continue to rise over the course of the following five years . By 2010, it was reported that data centers accounted for around 2% (between 1.7% and 2.2%) of the total national electricity consumption . In the United States, this figure continues to grow with usage increasing by 8.7% between and 2011 and 2012 while projected growth is expected to be around 9.8% over the following year. Electricity usage worldwide is reported to be around 1.8% or a corresponding power consumption of 322 TWh . With this industry continuing to grow and strain the national electricity grid, there is a need to target energy savings within the data center.
Power usage effectiveness is a common metric used to gauge energy efficiency of operation of the facility and is the ratio of total facility power to IT equipment power. A recent survey  reported that the average power usage effectiveness is around 2.9 with only 20% of surveyed facilities recording a value of less than 2.0. Thus, majority of data centers in North America operate extremely inefficiently with IT equipment accounting for less than half the total power consumption (support infrastructure responsible for majority of the remaining). It is, therefore, imperative to reduce data center power consumption and operating cost by increasing the efficiencies of power distribution and cooling systems. With a sizeable portion (around 30%) of typical data center power consumption attributed to cooling , which is categorized as a parasitic load, it has become vital that energy savings and efficiencies be pursued in these components at various levels within the data center facility .
In this study, focus is placed at the server and rack levels. Traditional servers are configured to include all sub-systems such as compute, memory, storage, networking, and cooling within a single chassis. Common rack-mount units generally have a low profile of 1 U (U = 1.75 in), which accommodate small 40 mm fans. Manufacturers  and standards associations  have published data that encourage designers to opt for larger fans to increase their peak total efficiency. However, since fans are generally selected based on server profile, there exist opportunities to instead consolidate larger fans at the rear of a rack to increase savings.
Over the past few years, original equipment manufacturers , semiconductor device manufacturers, and hyper-scale data center owners  have been promoting the concept of “rack disaggregation.” This refers to separation of resources or subsystems that are traditionally included in a server, into individual modules at the rack. This includes compute, storage, networking, power distribution, and cooling, and endeavors to make the rack the fundamental building block of a data center. Increased distance between IT components is countered by introduction of silicon photonics [11,12]. The primary advantage of such a deployment is the ability to change or refresh subsystems at different frequencies. In addition, disaggregation promotes dematerialization , capable of significant environmental impact, through reduction in printed circuit board sizes and sheet metal otherwise used for server chassis. In particular, disaggregation of cooling at the rack is synonymous with the focus of this study.
Preliminary work  predicted savings of up to 55% in cooling power by replacing smaller, chassis enclosed (60 mm) fans with larger rack-mount units for a stack of four servers. The present studies (parts I and II) advance this work by experimentally validating maximum possible savings through deployment of 80 mm and 120 mm fans. This paper previews a methodology for implementation of a control system that replicates the in-built scheme for modulation of chassis fan speeds. Thus, with minor modifications, row-wise control of rack-level fans is executed with input from each server in the stack. Performance of larger fans under different rack loads and failure conditions is reported, and savings in power over the baseline configuration are quantified.
Experimental Setup and Procedures
Server Under Study.
Figure 1 shows an intel-based open compute server [15,16] used in this study, similar to that employed in Ref. . This 1.5 U rack-mount unit has two central processing units (CPUs), each with a rated thermal design power of 95W. Four 60 mm direct current fans are installed within the chassis to provide cooling to the motherboard and its critical components (CPU and memory). Both processors represent the principal heat load within the system and their temperatures drive a native algorithm that controls speed of the fans using a pulse width modulation (PWM) signal. It is to be noted that, as seen in Fig. 1, a sheet metal partition isolates the flow through the motherboard portion of the server from flow through the power supply unit (PSU) and hard drive seen at the right. In this study, focus is placed on the application of larger rack-mount fans for cooling the motherboard section only. Henceforth, for the sake of simplicity, figures depicting the 80 mm and 120 mm fan configurations will be illustrated without the PSU channel.
As previously discussed in Ref. , a stack of four servers is considered when evaluating the rack-level solutions. The rear of the stack provides an area of 330 mm × 333 mm within which the larger fans must be accommodated. Figure 2 shows the fan wall installed at a distance of 25.4 mm from the rear of the stack for both 80 mm (nine units installed in a 3 × 3 array) and 120 mm (four units installed in a 2 × 2 array) cases. Table 1 lists specifications of the fans used. For a detailed description of this setup, please refer to part I of this study. To enable cross referencing between the two parts, the naming scheme is maintained with servers termed A–D from the bottom of the stack to the top and each row of fans similarly numbered 1–3 (1 and 2 for the 120 mm configuration). Since the focus of this study is to monitor cooling power consumption, the fans are powered externally as shown in Fig. 3. However, the internal PWM signals from each server are still used to control the fans and are delivered through a control circuit. The PWM signal component of the test setup will be explained in detail in the Controlling the Fans section. Tachometer output from each fan is logged using a data acquisition unit as well as returned to each server to prevent triggering of a failure scenario (running all remaining fans at full speed to prevent shutdown). It is imperative that the ground signal from each server and fan be shared between all monitoring and controlling equipment. Since the fans are not powered by the server, a power meter measures the rack (or stack) IT power consumption from a 277 VAC source. Together, the fan cooling power plus IT power represent the total power consumption of the system. A workstation communicates with all components in the setup and provides a common timestamp for effective data reduction. An ambient condition’s logger records air temperature at the inlet to the server stack. Over the duration of testing, the inlet temperature is found to have a maximum variation of ±1 °C with a mean of 25 °C.
Stressing the Servers.
These rack-mount units are configured with CPU and memory resources to function as web servers. Applications usually deployed on these systems are found to utilize relatively more CPU as compared to memory. Table 2 outlines different simulated computational loads setup to run on these servers to mirror operation in a data center. Synthetic load generator lookbusy  is employed to create loads as previously outlined. A bash script, executed on each server at the start of a test, automates a procedure wherein each load is applied (in order) for a duration of 30 min with 30 min of idling between two loads (with the exception of the initial Idle load that runs for 60 min). This sequence of stressing the server from idling to maximum loads is repeated two more times before a test is concluded to ensure that the results obtained are repeatable. Native Linux tools mpstat and free are employed to measure CPU utilization and memory usage over the course of each test and ensure loads are per requirement. An internal diagnostic tool provided by the motherboard manufacturer provides readings from each CPU digital temperature sensor. Along with rack (IT) and fan power consumptions, these represent data of primary interest in this study. Steady-state operation is achieved within the first 20 min of each load and individual measurements made within the last 10 min are averaged before reporting. Results presented in this study represent averages over all three runs. Note that reported CPU temperatures, unless mentioned specifically, are the mean of individual readings from the four servers.
Controlling the Fans
As previously stated, one of the primary objectives is to outline a methodology for row-wise control of larger rack-mount fans. By controlling each row independently based on adjacent server loads, cooling power consumption can be minimized as compared to a scheme where all fan speeds are equally modulated based on the maximum load across the stack. To enable such an arrangement, four PWM signals (one from each server) from the stack need to be converted to row-wise input; in this case, three signals for the 80 mm configuration or two for 120 mm counterpart. Contribution of each server’s signal to a given row’s input needs to be determined to minimize fan power consumption. A simple “zone of influence” test is carried out to determine these parameters.
Zone of Influence.
where Ii is the influence on a given server by row j of fans operating at a higher speed. Influence values for all tests are reported in Fig. 4(b). In an ideal state, values for rows 1 and 3 should mirror each other and distribution for row 2 should be uniform (25% across the stack). However, each distribution is skewed toward server D. This is to be expected, as clearly shown by the CPU temperature variation when all fans are operated at 10% duty cycle in Fig. 4(a). These results provide an indication of the tolerances across the stack, specifically, in terms of difference in CPU powers. To counter this effect and to provide a scheme that would function across a multitude of servers in a data center, the weight of each server’s PWM output with respect to a given fan row is based on the influences reported in Fig. 4(b) and generalized to provide a mirrored form as listed in Table 3. These coefficients are inputs to the control circuit to process each PWM input from the stack to a row-wise duty cycle output. The Control Circuit and Operation section will describe the control circuit and system in detail.
Control Circuit and Operation.
where DCj is the duty cycle signal to fan row j and Ci,j is the weight of server i corresponding to row j. Thus, the control system as described in the Controlling the Fans section, is implemented for both 80 mm and 120 mm cases.
Need for a Lower Bound.
No further modification to this system is required as the fans are found to overcool the servers while idling at 10% PWM irrespective of the load as seen in Fig. 7. Thus, settings for each control system (80 mm and 120 mm cases) are finalized and deployed for further testing and evaluation.
Results and Discussion
The objective of this study is to confirm that substantial savings are available when 60 mm chassis fans are replaced by rack-level configurations with effective control schemes that simulate a product or solution deployed within a data center facility. It is imperative that, in addition to identical utilizations seen in Fig. 7, realistic conditions such as nonuniform workloads and fan failure scenarios must be investigated to reinforce the merits of employing larger fans.
Comparison Under Uniform Loading.
It has been shown that despite overcooling the servers with rack-level setups, under uniform utilizations, savings in cooling power are available irrespective of the load. Table 4 summarizes the extent of reduction in fan power available through deployment of 80 mm and 120 mm fans with their designed control method. In either case, it is apparent that savings are maximized when the stack is operated at high workloads. The reason for reduced efficiency at utilizations at or below 30% is because 60 mm fans operate at idling speeds in such conditions. In comparison to the baseline, the 80 mm case reports savings of around 45–51%. Similarly, the 120 mm configuration provides reduction in cooling power of the order of 33–50%. This setup provides reduced savings at lower loads owing to the fact that these fans operate at idling speeds (10% lower bound) across the entire test spectrum and significantly overcool the stack for utilizations below 70%.
Comparisons must be made with results from part I to ensure that the deployed control systems do not deviate significantly from reported maximum available savings (by comparing average CPU temperatures). For the 80 mm case, a maximum deviation of 5% is observed when compared to projected savings of 50–53%. While this reduction can be considered acceptable, it exists because of overcooling at loads between idling to 30% as seen in Fig. 7(b). However, for the 120 mm case, a reduction of 15% is observed in comparison to expected savings of 48–54%. This can be attributed to the fact that in part I, comparisons between the baseline and 120 mm configurations are made with respect to CPU temperatures reported while testing the latter case. It is, therefore, unfair to draw comparisons between results from part I and this study for the 120 mm setup.
When comparing both rack-level configurations from the perspective of cooling power consumption, it would be understandable to claim that the 80 mm case is more efficient. However, as seen in Fig. 7(c), it is evident that reduction in IT power consumption is available due to overcooling and lower leakage power. Therefore, comparisons can also be made in terms of total power consumption (cooling + IT) as outlined in Table 5. In this case, it is apparent that the benefits of reduced IT power more than compensate for lower savings in fan power. However, it must be noted that increased total savings through overcooling is accompanied by an increase in air flow rates through the stack, which raises cooling power consumption at the facility level. Therefore, attention is focused at rack-level fan performance only.
Comparison Under Nonuniform Load.
Since it would be time and resource intensive to test each configuration under all possible workloads observed in a data center facility, a simple test was conducted to prove that rack-level fans are more efficient than the baseline under nonuniform loads. This involved comparing the 60 mm setup when all servers are idling (lowest fan power consumption) with the larger fan cases when only one server in the stack is operating at maximum utilization. The latter represents an extreme case of nonuniform loading, and server D was chosen as it consistently reported CPU temperatures higher than the remaining units (see Fig. 4(a)). Thus, by reporting savings under such conditions, it would be implicit that rack-level fans under study are superior under nonuniform loads.
Table 6 outlines the results from these tests. It is observed that both control schemes provide substantial savings (around 35%) in fan power when compared to the baseline. In addition, these results support the need for row-wise control as, for the 80 mm case, 5% reduction in fan power is observed when compared to a similar test conducted in part I; wherein all fans are controlled by server D. Since the fan control does not engage for the 120 mm configuration, similar savings are reported in both part I and this study.
Fan Failure Study.
Along with other server components such as hard drives and memory, failure of fans commonly occurs in data center facilities. To account for the same, thermal engineers are required to design cooling systems for servers that ensure uptime even when a fan fails. Therefore, it is imperative that all configurations under study are tested while simulating failure to ensure no detriment to performance of the system such as throttling, increase in power consumption, etc. Figure 8 provides an illustration of locations of single fan failures for all configurations. The diagonal pattern is chosen to reduce testing time as power consumption under failure is found to be independent of location . For the rack-level setups, preliminary trials were found to be in agreement as well. For each configuration, results from all tests are averaged and reported in Table 7. It is observed that failure of a single fan has a marginal effect on power consumption for the baseline and 80 mm cases. However, for the 120 mm setup, a 33% increase in fan power is observed under simulated failure. This can be attributed to lower available redundancy caused by the restriction in available area at the rear of the stack that limits the number of units to four. It is important to note that, while the difference in performance between 80 mm and 120 mm cases is apparent, the frequency of failures and time between failure and replacement are equally important when making a decision between configurations.
where β is the Weibull shape parameter and chosen to be three in this case [23,25]. L1 values for all three fans as a function of operating temperature are plotted in Fig. 9. Over 80 mm fans have 22.5% greater life expectancy than 60 mm and 120 mm counterparts. However, this does not translate to lower failures than the 120 mm units. Considering there are more than twice the number of parts in the 80 mm fan wall, more failures would be expected due to the sheer difference in quantity. In addition, since 120 mm fans operate at a fixed duty cycle (at ambient temperatures of 25 °C or below), and changes in speed are detrimental to fan life, lower failure rates can be expected.
Previous discussions in the results section have made valid arguments for choosing either rack-level configuration over the baseline case. However, making a selection between the larger fans is not straightforward. A decision between configurations would require consideration of advantages outlined herein as well as factors that exist beyond the scope of this study. Accommodation of all parameters would necessitate a study of total cost of ownership. Setup and results from this study would factor in both capital and operational expenditure, based on which a decision could be made. Regardless, based on reported observations, maximizing savings through deployment of rack-level fans can be achieved through a compromise between efficiency, redundancy, and cost. These terms are ultimately dependent on the size of fans selected.
Configurations of larger fans setup to mirror a product that could be deployed in a data center facility have been shown to provide greater efficiency in cooling web servers traditionally configured to use smaller (60 mm) chassis fans. In union with  and part I  of this study, a detailed methodology that outlines selection of fans, prediction and validation of savings, and setup for server-dependent operation has been previewed. A control scheme was implemented for both 80 mm and 120 mm cases that delivered savings in cooling power consumption of the order of 45–51% and 33–50%, respectively, under uniform rack load. Testing both rack-level setups under highly nonuniform load reports savings of around 35% when compared to baseline (60 mm) fan power for idling utilization. Simulation of fan failure showed marginal penalties in performance for both 60 mm and 80 mm cases. However, a 33% increase in cooling power was observed for the 120 mm configuration and attributed to lack of available redundancy (fan count). A summary of advantages in support of either rack-level setup is included below.
Advantages of 80 mm Configuration
Consistent savings in fan power across a spectrum of loads
No penalty or increase in cooling power for single fan failure
22.5% greater life expectancy per fan
Less overcooling corresponds to lower air flow requirement, which influences facility-level cooling power consumption.
Advantages of 120 mm Configuration
Greater overcooling gives rise to decrease in IT power consumption due to reduced leakage power. These savings more than compensate for relatively higher fan power at lower loads.
At ambient temperatures of 25 °C and below, fans will idle (10% duty cycle). Lack of change in speed will increase life expectancy.
With less than half the number of fans (4) in the wall, lower number of total failures is expected.
Analysis of total cost of ownership would be required to make an informed decision between the two rack-level solutions. A final selection would represent a compromise between efficiency, redundancy, and cost.
The authors thank Jacob Na for providing information on life expectancy for the selected fans. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Directorate for Engineering (Grant No. 1134821).
- C =
server coefficient or weight
- i =
server name, A to D
- I =
- j =
fan row number, 1 to 3
- L1 =
time at which 1% of fans will fail, hours
- L10 =
time at which 10% of fans will fail, hours
- T =
CPU temperature, °C
- Ttest =
temperature at which fine life test is conducted, °C
- Tuse =
air temperature at inlet side of fan, °C
- Vout =
DC analog output voltage from low-pass filter, V