Abstract
We consider the problem of optimal control of district cooling energy plants (DCEPs) consisting of multiple chillers, a cooling tower, and a thermal energy storage (TES), in the presence of time-varying electricity prices. A straightforward application of model predictive control (MPC) requires solving a challenging mixed-integer nonlinear program (MINLP) because of the on/off of chillers and the complexity of the DCEP model. Reinforcement learning (RL) is an attractive alternative since its real-time control computation is much simpler. But designing an RL controller is challenging due to myriad design choices and computationally intensive training. In this paper, we propose an RL controller and an MPC controller for minimizing the electricity cost of a DCEP, and compare them via simulations. The two controllers are designed to be comparable in terms of objective and information requirements. The RL controller uses a novel Q-learning algorithm that is based on least-squares policy iteration. We describe the design choices for the RL controller, including the choice of state space and basis functions, that are found to be effective. The proposed MPC controller does not need a mixed-integer solver for implementation, but only a nonlinear program (NLP) solver. A rule-based baseline controller is also proposed to aid in comparison. Simulation results show that the proposed RL and MPC controllers achieve similar savings over the baseline controller, about 17%.
1 Introduction
In the US, 75% of the electricity is consumed by buildings, and a large part of that is due to heating, ventilation, and air conditioning (HVAC) systems [1]. In university campuses and large hotels, a large portion of the HVAC’s share of electricity is consumed by district cooling energy plant s (DCEPs), especially in hot and humid climates. A DCEP produces and supplies chilled water to a group of buildings it serves (hence the moniker “district”), and the air handling units in those buildings use the chilled water to cool and dehumidify air before supplying it to building interiors. Figure 1 shows a schematic of such a plant, which consists of multiple chillers that produce chilled water, a cooling tower that rejects the heat extracted from chillers to the environment, and a thermal energy storage system (TES) for storing chilled water. Chillers—the most electricity-intensive equipment in the DCEP—can produce more chilled water than buildings’ needs when the electricity price is low. The extra chilled water is stored in the TES, and then used during periods of high electricity price to reduce the total electricity cost. The district cooling energy plant s are also called central plants or chiller plants.
DCEPs are traditionally operated with rule-based control algorithms that use heuristics to reduce electricity cost while meeting the load, such as “chiller priority,” “storage priority,” and additional control sequencing for the cooling tower operation [2–8]. But making the best use of the chillers and the TES to keep the electricity cost at the minimum requires non-trivial decision-making due to the discrete nature of some control commands, such as chiller on/off actuation, and highly nonlinear dynamics of the equipment in DCEPs. A growing body of work has proposed algorithms for optimal real-time control of DCEPs. Both model predictive control (MPC) [9–17] and reinforcement learning (RL) [18–26] have been studied.
For MPC, a direct implementation requires solving a high dimension mixed-integer linear program (MINLP) that is quite challenging to solve. Various substitutive approaches are thus used, which can be categorized into two groups: NLP approximations [9–12] and MILP approximations [13–17]. NLP approximations generally leave the discrete commands for some predetermined control logic and only deal with continuous control commands, which may limit the potential of their savings. MILP approximations mostly adopt a linear DCEP model so that the problem is tractable, though solving large MILPs is also challenging.
An alternative to MPC is RL—an umbrella term for a set of tools used to approximate an optimal policy using data collected from a physical system, or more frequently, its simulation. Despite a burdensome design and learning phase, real-time control is simpler since control computation is an evaluation of a state-feedback policy. However, designing an RL controller for a DCEP is quite challenging. The performance of an RL controller depends on many design choices and training an RL controller is computationally onerous.
In this paper, we propose an RL controller and an MPC controller for a DCEP, and compare their performances with that of a rule-based BL controller through simulations. All three controllers are designed to minimize total energy cost while meeting the required cooling load. The main source of flexibility is the TES, which allows a well-designed controller to charge the TES in periods of low electricity price. The proposed RL controller is based on a new learning algorithm that is inspired by the “convex Q-learning” proposed in recent work [27] and the classical least-squares policy iteration (LSPI) algorithm [28]. Basis functions are carefully designed to reduce the computational burden in training the RL controller. The proposed MPC controller solves a two-fold nonlinear program (NLP) that is transformed from the original MINLP via heuristics. Hence the MPC controller is “stand-in” for a true optimal controller and provides a sub-optimal solution to the original MINLP. The baseline controller (BC) that is used for comparison is designed to utilize the TES and time-varying electricity prices (to the extent possible with heuristics) to reduce energy costs. The RL controller and baseline controller have the same information about electricity price: the current price and a backward moving average.
The objective behind this work is to compare the performance of the two complementary approaches, MPC and RL, for the optimal control of all the principal actuators in a DCEP. The two controllers are designed to be comparable, in terms of objective and information requirements. We are not aware of many works that have performed such a comparison; the only exceptions are [25,26], but the decision-making is limited to a TES or temperature setpoints. Since both RL and MPC approaches have merits and weaknesses, designing a controller with one approach and showing it performs well leaves open the question: would the other have performed better? This paper takes a first step in addressing such questions. To aid in this comparison, both the controllers are designed to be approximations of the same intractable infinite horizon optimal control problem. Due to the large difference in the respective approaches (MPC and RL), it is not possible to ensure exact parallels for an“apples–to–apples” comparison. But the design problems for RL and MPC controllers have been formulated to be similar to the possible extent.
Simulation results show that both the controllers, RL and MPC, lead to significant and similar cost savings (16–18%) over a rule-based baseline controller. These values are comparable to that of MPC controllers with mixed-integer formulation reported in the literature, which vary from 10% to 17% [13–17]. The cooling load tracking performance is similar between them. The real-time computation burden of the RL controller is trivial compared to that of the MPC controller, but the RL controller leads to higher chiller switches (from off to on and vice versa). However, the MPC controller enjoys the advantage of error-free forecasts in the simulations, something the RL controller does not.
The remainder of the paper is organized as follows. The contribution of the paper over the related literature is discussed in detail in Sec. 1.1. Section 2 describes the district cooling energy plant and its simulation model as well as the control problem. Section 3 describes the proposed RL controller, Sec. 4 presents the proposed MPC controller, and Sec. 5 describes the baseline controller. Section 6 provides a simulation evaluation of the controllers. Section 7 provides an “under-the-hood” view of the design choices for the RL controller. Section 8 concludes the paper.
1.1 Literature Review and Contributions
1.1.1 Prior Work on Reinforcement Learning for DCEP.
There is a large and growing body of work in this area, e.g., Refs. [18–26]. Most of these papers limit the problem to controlling part of a DCEP. For instance, the DCEPs considered in Refs. [18–21,23] do not have a TES. References [18–22] optimize only the chilled water loop but not the cooling water loop (at the cooling tower), while Ref. [24] only optimizes the cooling water loop. The reported energy savings are in the 10–20% range over rule-based baseline controllers, e.g., 15.7% in Ref. [23], 11.5% in Ref. [18], and around 17% in Ref. [21].
Reference [25] considers a complete DCEP, but the control command computed by the RL agent is limited to TES charging and discharging. It is not clear what control law is used to decide chiller commands and cooling water loop setpoints. Reference [26] also considers a complete DCEP, with two chillers, a TES, and a large building with an air handling unit. The RL controller is tasked with commanding only the zone temperature setpoint and TES charging/discharging flowrate whilst the control of the chillers or the cooling tower is not considered. Besides, trajectories of external inputs, e.g., outside air temperature and electricity price, are the same for all training days in Ref. [26]. Another similarity of Refs. [25,26] with this paper is that these references compare the performance of RL with that of a model-based predictive control.
1.1.2 Prior Work on Model Predictive Control for DCEP.
The works that are closest to us in terms of problem setting are [13–15], which all reported MILP relaxation-based MPC schemes to optimally operate a DCEP with TES in presence of time-varying electricity prices. Reference [13] reports an energy cost savings with MPC of about 10% over a baseline strategy that uses a common heuristic (charge TES all night) with some decisions made by optimization. In Ref. [14], around 15% savings over the currently installed rule-based controller is achieved in a real DCEP. Reference [15] reported a cost savings of 17% over “without load shifting” with the help of the TES in a week-long simulation. Reference [16] also proposes an MILP relaxation-based MPC scheme for controlling a DCEP and approximately 10% savings in electricity cost over a baseline controller over a one-day long simulation is reported. But the DCEP model in Ref. [16] ignores the effect of weather condition on plant efficiency, and the baseline controller is not purely rule-based; it makes TES and chiller decisions based on a greedy search. Reference [17] deserves special mention since it reports an experimental demonstration of MPC applied to a large DCEP; the control objective being the manipulation of demand to help with renewable integration and decarbonization. It too uses an MILP relaxation. The decision variables include plant mode (combination of chillers on) and TES operation, but cooling water loop decisions are left to legacy rule-based controllers.
1.2 Contribution Over Priori Arts
1.2.1 Contribution Over “Reinforcement Learning for DCEP” Literature.
Unlike most prior works on RL for DCEPs that only deal with a part of DCEP [18–24], the control commands in this work consist of all the available commands (five in total) of both the chilled and cooling water loops in a full DCEP. To the best of our knowledge, no prior work has used RL to command both the water loops and a TES. Second, unlike some of the closely related work such as Ref. [26], we treat external inputs such as weather and electricity price as RL states, making the proposed RL controller applicable for any time-varying disturbances that can be measured in real-time. Otherwise, the controller is likely to work well only for disturbances seen during training. Third, the proposed RL controller commands the on/off status of chillers directly rather than the chilled/cooling water temperature setpoints [19,21,23] or zone temperature setpoints [26], which eliminates the need for another control system to translate those setpoints to chiller commands. Fourth, all the works cited above rely on discretizing the state and/or action spaces in order to use the classical tabular learning algorithms with the exception of Ref. [22]. The size of the table will become prohibitively large if the number of states and control commands becomes large and a fine-resolution discretization is used. Training a such controller and using it in real-time, which will require searching over this table, will become computationally challenging. That is perhaps why only a small number of inputs are chosen as control commands in prior work even though several more setpoints can be manipulated in a real DCEP. Although Ref. [22] considers continuous states, its proposed method only controls part of a DCEP with simplified linear plant models, which may significantly limit its potential of cost savings in reality. In contrast, the RL controller proposed in this paper is for a DCEP model consisting of highly nonlinear equations, and the states and actions are kept as continuous except for the one command that is naturally discrete (number of chillers that are on).
While there is an extensive literature on learning algorithms and on designing RL controllers, the design of an RL controller for practically relevant applications with non-trivial dynamics is quite challenging. RL’s performance depends on myriad design choices, not only on the stage cost/reward, function approximation architecture and basis functions, learning algorithm and method of exploration, but also on the choice of the state space itself. A second challenge is that training an RL controller is computationally intensive and brute force training is beyond the computational means of most researchers. For instance, the hardware cost for a single AlphaGo Zero system in 2017 by DeepMind has been quoted to be around $25 million [29]. Careful selection of the design choices mentioned above is thus required, which leads to the third challenge: if a particular set of design choices leads to a policy that does not perform well, there is no principled method to look for improvement. Although RL is being extensively studied in the control community, most works demonstrate their algorithms on plants with simple dynamics with a small number of states and inputs, e.g., Refs. [30,31]. The model for a DCEP used in this paper, arguably still simple compared to available simulation models (e.g., Ref. [32]), is quite complex: it has eight states, five control inputs, three disturbance inputs, and requires solving an optimization problem to compute the next state given the current state, control, and disturbance.
1.2.2 Contribution Over “Model Predictive Control for DCEP” Literature.
The MPC controller proposed here uses a combination of relaxation and heuristics to avoid the MINLP formulation. In contrast to Refs. [13–17], the MPC controller does not use a MILP relaxation. The controller does compute discrete decisions (number chillers to be on, TES charge/discharge) directly, but it does so by using NLP solvers in conjunction with heuristics. The cost saving obtained is similar to those reported in earlier works that use MILP relaxation. Comparing other NLP formulations [9–12], our MPC controller determines the on/off actuation of equipments and TES charging/discharging operation directly.
Closed-loop simulations are provided for all three controllers, RL, MPC, and baseline, to assess the trade-offs among these controllers, especially between the model-based MPC controller and the “model-free” RL controller.
1.2.3 Contribution Over a Preliminary Version.
The RL controller described here was presented in a preliminary version of this paper [33]. There are three improvements. First, an MPC controller, which is not presented in Ref. [33], was designed, evaluated, and compared with our RL controller. Therefore, the optimality of our control with RL is better assessed. Another difference is that the baseline controller described here is improved over that in Ref. [33] so that the frequency of on/off switching of chillers is reduced. Lastly, a much more thorough discussion of the RL controller design choices and their observed impacts are included here than in Ref. [33]. Given the main challenge with designing RL controllers for complex physical systems discussed above, namely, “what knob to tweak when it doesn’t work?,” we believe this information will be valuable to other researchers.
2 System Description and Control Problem
The DCEP contains a TES, multiple chillers and chilled water pumps, a cooling tower and cooling water pumps, and finally a collection of buildings that uses the chilled water to provide air conditioning; see Fig. 2. The heat load from the buildings is absorbed by the cold chilled water supplied by the DCEP, and thus the return chilled water temperature is warmer. This part of the water is called load water, and the related variables are denoted by superscript lw for “load water.” The chilled water loop (subscript chw) removes this heat and transmits it to the cooling water loop (subscript cw). The cooling water loop absorbs this heat and sends it to the cooling tower, where this heat is then rejected to the ambient. The cooling tower cools down the cooling water returned from the chiller by passing both the sprayed cooling water and ambient air through a fill. During this process, a small amount of water spray will evaporate into the air, removing heat from the cooling water. The cooling water loss due to its evaporation is replenished by fresh water, thus we assume the supply water flowrate equals to the return water flowrate at the cooling tower. A fan or a set of fans is used to maintain the ambient airflow at the cooling tower. Connected to the chilled water loop is a TES tank that stores water (subscript tw). The total volume of the water in the TES tank is constant, but a thermocline separates two volumes: cold water that is supplied by the chiller (subscript twc for “tank water, cold”) and warm water returned from the load (subscript tww for “tank water, warm”).
2.1 DCEP Dynamics.
Assuming time is discretized with a sampling period ts with a counter k = 0, 1, … denoting the time-step. With the consideration of hardware limits and ease of implementation, the control commands are chosen as follows:
, the chilled water flowrate going through the cooling coil, to ensure the required cooling load is met.
, charging/discharging flowrate of the TES, to take advantage of load shifting.
, the number of active chillers to ensure the amount of chilled water required is met and the coldness of the chilled water is maintained.
, the flowrate of cooling water going through the condenser of chillers to absorb the heat from the chilled water loop.
, the flowrate of ambient air that cools down the cooling water to maintain its temperature within the desired range.
Since the TES can be charged and discharged, we declare for charging and for discharging as a convention.
The plant state xp is affected by exogenous disturbances , where is the required cooling load, the rate at which heat needs to be removed from buildings, and is the ambient wet-bulb temperature. The disturbance cannot be ignored, e.g., ambient wet-bulb temperature plays a critical role in cooling tower dynamics.
2.2 Electrical Demand and Electricity Cost.
2.3 Model Calibration and Validation.
The parameters of the simulation model in Sec. 2.1 and electrical demand model in Sec. 2.2 are calibrated using data from the energy management system in United World College (UWC) of South East Asia Tampines Campus in Singapore, shown in Fig. 3(b). The data are publicly available in Ref. [40], and details of the data are discussed in Ref. [41]. There are three chillers and nine cooling towers in the DCEP. The data from chiller one and cooling tower one are used for model calibration. We use 80% of data for model identification and 20% of data for verification. The out-of-sample prediction results for the total electrical demand are shown in Fig. 3. Comparison between data and prediction for other variables are not shown in the interest of space.
2.4 The (Ideal) Control Problem.
3 Reinforcement Learning Basics and Proposed Reinforcement Learning Controller
3.1 Reinforcement Learning Basics.
For the following construction, let x represent the state with state space and u the input with input space . Now consider the following infinite horizon discounted optimal control problem:
3.2 Proposed Reinforcement Learning Algorithm.
This algorithm is inspired by: (i) the Batch Convex Q-learning algorithm found in Ref. [27, Sec. III] and (ii) the LSPI algorithm [28]. The approach here is simpler than the batch optimization problem that underlies the algorithm in Ref. [27, Sec. III], which has an objective function that itself contains an optimization problem. In comparison to Ref. [28] we include a regularization term that is inspired by proximal methods in optimization that aid convergence, and a constraint to ensure the learned Q-function is non-negative.
Result: An approximate optimal policy .
Input:, , ,
Fordo
(1) Follow an exploration stategy and obtain input sequence , initial state , and state sequence .
(2) For , obtain: .
(3) Set and appearing in Eq. (23).
(4) Use the samples , , and to construct and solve Eq. (23) for .
(5) Set .
end
3.3 Proposed Reinforcement Learning Controller for DCEP.
We now specify the ingredients required to apply Algorithm 1 to obtain an RL controller (i.e., a state-feedback policy) for the DCEP from simulation data. Namely, (1) the state description, (2) the cost function design, (3) the approximation architecture, and (4) the exploration strategy. Parts (1), (2), and (3) refer to the setup of the optimal control problem that the RL algorithm is attempting to approximately solve. Part (4) refers to the selection of how the state/input space is explored (step (1) in Algorithm 1).
3.3.1 State Space Description.
3.3.2 Design of Stage Cost.
3.3.3 Approximation Architecture.
3.3.4 Exploration Strategy.
The entries correspond to the probability of using the corresponding control strategy, which appears in the (i)–(iii) order as just introduced. The rationale for this choice is that the BL controller provides “reasonable” state input examples for the RL algorithm in the early learning iterations to steer the parameter values in the correct direction. After this early learning phase, weight is shifted toward the current working policy to force the learning algorithm to update the parameter vector in response to its actions.
3.3.5 Training Settings.
The policy evaluation problem (23) during training is solved using cvx [43]. The simulation model (4) to generate state updates, which requires solving a non-convex NLP, is solved using casadi and ipopt [36,37].
The parameters used for RL training are γ = 0.97, d = 36, κ = 500, β = 100, , and . The parameter τ for the backward moving average filter on the electricity price is chosen to represent 4 h. The choice of the 36 basis functions is a bit involved; they are discussed in Sec. 7. Because a simulation time-step, k to k + 1, corresponds to a time interval of 10 min, corresponds to 3 days. The controller was trained with weather and load data for the 3 days (Oct. 10–12, 2011), from the Singapore UWC campus dataset described in Sec. 2.3. The electricity price data used for training were taken as a scaled version of the locational marginal price from PJM [44] for the 3 days (Aug. 30–Sept 1, 2021).
3.4 Real-Time Implementation.
Due to the non-convexity of the set and the integer nature of , Eq. (31) is non-convex and integer-valued. We solve it using exhaustive search: for each possible value of , we solve the corresponding continuous variable nonlinear program using casadi/ipopt [36,37], and then choose the minimum out of (+1) solutions by direct search. Direct search is feasible because for DCEP is a small number in practice ( in our simulated example).
4 Proposed Model Predictive Controller
The first challenge we have to overcome is not related to the mixed-integer nature of the problem but is related to the complex nature of the dynamics. Recall from Sec. 2.1 that the dynamic model, i.e., the function f in the equality constraint xk+1 = f(·) in Eq. (4) is not available in explicit form; rather the state is propagated in the simulation by solving an optimization problem. Without an explicit form for the function f(·), modern software tools that reduce the drudgery in nonlinear programming, namely numerical solvers with automatic differentiation, cannot be used.
The proposed algorithm to approximately solve Eq. (33) without using an MINLP solver or an MILP relaxation consists of three steps. These are listed below in brief, with more details provided subsequently.
The integer variable is relaxed to a continuous one . The relaxed problem, an NLP, is solved using an NLP solver to obtain a locally optimal solution. In this paper, we use ipopt (through casadi) to solve this relaxed NLP.
The continuous solution , resulting from step 1, is processed by using Algorithms 2 and 3 to produce a transformed solution that is integer-valued, which is denoted by .
In Eq. (33), the input is fixed at the values obtained in step 2, and the resulting NLP is solved again. The resulting solution is called the MBOC.
In the sequel, we will refer to a vector with non-negative integer components, , as an n-length discrete signal. For a discrete signal , the number of switches, , is defined as the number of times two consecutive entries differ: , where I(·) is the indicator function: I(0) = 0 and I(y) = 1 for y ≠ 0.
The continuous relaxation in step 1 is inspired by branch and bound algorithms for solving MINLPs, since such a relaxation is the first step in branch and bound algorithms. However, a simple round-off-based method to convert the continuous variable nch,c to a discrete one leads to a high number of oscillations in the solution. This corresponds to frequent turning on and off of one or more chillers, which is detrimental to them.
Step 2 converts the continuous solution from step 1 to a discrete signal, and involves multiple steps in itself. The first step is Algorithm 2, which filters the signal nch,c with a modified moving average filter with a two-hour window (corresponding to 12 samples with a 10 min sampling period) and then rounding up the filtered value to the nearest integer. Thus by operating the moving average filter on nch,c one obtains a discrete signal for the chiller command .
Input: Signal , (window length)
for i = 1: w
=
end
for
=
end
for
=
end
Output: Discrete signal .
The rounding moving average filter typically does not reduce the switching frequency sufficiently. This is why an additional step, Algorithm 3, described below, is used to operate on this signal and produce the output that has fewer switches.
xrs = reduce switching(x)
Input: Discrete signal and (window length)
1: Obtain indices of the entries of that are not to be changed, , as follows:
Initialize = zeros(n,1) # Array of dimension n with all entries zero
for
if = 0
end
2: Initialize : for each such that index_freezed[i] = 1.
3: For each in which is 0:
Find all the consecutive 0 entries till the next 1. Let these indices be , and define .
Set for every .
Set
end
Output:.
The need for step 3 is that the chiller command {nch,d} at the end of the second step, together with other variables in the solution vector from Step 1, may violate some constraints of the optimization problem (33). Even if and {nch,d} are feasible, the resulting control commands may not track the cooling load adequately. Step 3 ensures a feasible solution and improves tracking.
Forecasts: Implementation requires the availability of the forecasts of disturbance , i.e., cooling load reference and electricity price, over the next planning horizon. There is a large literature on estimating and/or forecasting loads for buildings and for real-time electricity prices; see Refs. [45–47] and references therein. The forecast of is available from National Weather Service [48]. We therefore assume the forecasts of the three disturbance signals, , , and ρk, are available to the MPC controller at each k.
5 Rule-Based Baseline Controller
In order to evaluate the performances of the RL and MPC controllers, we will compare them to a rule-based BL. The proposed baseline controller is designed to utilize the TES and time-varying electricity prices (to the extent possible with heuristics) to reduce energy costs. The RL controller and baseline controller have the same information about the price: the current price ρk and a backward moving average . At each time-step k, the baseline controller determines the control command following the procedure shown in Fig. 4. The flowcharts are elaborated in Ref. [34] and briefly explained in Secs. 5.1 and 5.2. The subscript “sat” indicates the variable is saturated at its upper or lower bounds; the numerical values of the bounds used in simulations are shown in Table 1. For estimating the outputs under nominal conditions and the time-dependent bounds, please refer to Ref. [34].
5.1 For Chilled Water Loop.
At time-step k, , , and are initialized to , , and .
The BL controller increases or decreases by a fixed amount (10 kg/s) if ρk is 5% lower or higher than in order to take advantage of time-varying electricity price.
The BL controller estimates , , under the assumption of and . If , , are within their bounds, the current control command for the chilled water loop is executed. Otherwise, the controller repeatedly increases/decreases and by a fixed amount (10 kg/s), and by 1 until , , and are within their bounds. Since determines the minimum required , the final is readjusted to meet the minimum required .
5.2 For Cooling Water Loop.
and are initialized to and .
The BL controller estimates by assuming a fixed fraction of electric power consumed by chillers is added into the cooling water loop. This fraction is to be estimated from historical data. If is above/below its bound, is increased/decreased by a fixed amount (20 kg/s) repeatedly until is within its bound.
Once is determined, the capacity of cooling tower and the required cooling that cools down to is computed. If , then the current control command for the cooling water loop is executed. If or , is increased or decreased by a fixed amount (0.05 kg/s). Since depends on the ambient wet-bulb temperature (illustrated in Ref. [34]), there can be a case that cannot satisfy even when is already at its bound. In this case, is varied by a fixed amount (20 kg/s) repeatedly until and are within their bounds.
6 Performance Evaluation
6.1 Simulation Setup.
Simulations for closed-loop control with RL, MPC, and baseline controllers are performed for the week of Sept. 6–12, 2021, which we refer to as the testing week in the sequel. The weather data for the testing week are obtained from the Singapore data set described in Sec. 2.3. The real-time electricity price used is a scaled version of PJM’s locational margin price for the same week [44]. Other relevant simulation parameters are located in Table 1. There is no plant-model mismatch in the MPC simulations. In particular, since the forecasts of disturbance signals are available in practice (see the discussion at the end of Sec. 4), in the simulations the MPC controller is provided with error-free forecasts in the interest of simplicity.
We emphasize that the closed-loop results with the RL controller presented here are “out-of-sample” results, meaning the external disturbance wk (weather, cooling load, and electricity price) used in the closed-loop simulations are different from those used in training the RL controller.
6.2 Numerical Results and Discussion.
A summary of performance comparisons from the simulations is shown in Table 2. All three controllers meet the cooling load adequately (more on this later), and both the RL and MPC controllers reduce energy cost over the baseline by about the same amount (16.8% for RL versus 17.8% for MPC). These savings are comparable with those reported in the literature for MPC with MILP relaxation and RL.
Total cost ($) | (kW) | No. of switches | Control computation time (s, μ ± σ) | |
---|---|---|---|---|
Baseline | 3308 | 4.14 × 10−4 | 45 | 8.9 × 10−5 ± 3.9 × 10−4 |
RL | 2752 | 1.85 | 114 | 0.32 ± 0.01 |
MPC | 2719 | 61.38 | 65 | 27.33 ± 5.99 |
Total cost ($) | (kW) | No. of switches | Control computation time (s, μ ± σ) | |
---|---|---|---|---|
Baseline | 3308 | 4.14 × 10−4 | 45 | 8.9 × 10−5 ± 3.9 × 10−4 |
RL | 2752 | 1.85 | 114 | 0.32 ± 0.01 |
MPC | 2719 | 61.38 | 65 | 27.33 ± 5.99 |
In terms of tracking the reference load, both RL and MPC again perform similarly while the baseline controller performs the best in terms of the standard deviation of tracking error; see Fig. 5 and Table 2. The worst tracking RMSE is 61 kW, which is a small fraction of the mean load (1313 kW). Thus the tracking performance is considered acceptable for all three controllers. The fact that the baseline performs the best in tracking the cooling load is perhaps not surprising since it is designed primarily to meet the required load and keep chiller switching low, with energy cost a secondary consideration.
In terms of chiller switches, the RL controller performs the worst; see Table 2. This is not surprising because no cost was assigned to higher switching in its design. The MPC performs the best in this metric, again most likely since keeping switching frequency low was an explicit consideration in its design. Ironically, this feature was introduced into the MPC controller after an initial design attempt without it, which led to a high switching frequency.
In terms of real-time computation cost, the baseline performs the best, which is not surprising since no optimization is involved. The RL controller has two orders of magnitude lower computation cost compared to MPC. The computation time for all controllers is well within the time budget since control commands are updated every 10 min.
Deeper look: Simulations are done for a week, but the plots below show only 2 days to avoid clutter. The cost savings by RL and MPC controller come from their ability to use the TES to shift the peak electric demand to periods of low price better than that of the baseline controller; see Fig. 6. The MPC controller has the knowledge of electricity price along the whole planning horizon, and thus achieves the most savings. The cause for the cost-saving differences between BL and RL controllers is that the RL controller learns the variation in the electricity price well, or at least better than the BL controller. This can be seen in Fig. 7. The RL controller always discharges the TES (Stwc drops) during the peak electricity price while the baseline controller sometimes cannot do so because the volume of cold water is already at its minimum bound. The BL controller discharges the TES as soon as the electricity price rises, which may result in insufficient cold water stored in the TES when the electricity price reaches its maximum. While both the RL and BL controllers are forced to use the same price information (current and a backward moving average), the rule-based logic in the baseline controller cannot use that information as effectively as RL.
An alternate view of this behavior can be obtained by looking at the times when the chillers are turned on and off, since using chillers costs much more electricity than using the TES, which only needs a few pumps. We can see from Fig. 8 that all controllers shift their peak electricity demand to the times when electricity is cheap. But the rule-based logic of the BL controller is not able to line up electric demand with the low price as well as the RL and MPC controllers do.
Another benefit of the RL controller is that it typically activates fewer chillers than the BL controller, though the cost of running active chillers is not incorporated in the cost function; see Fig. 9. This effect may increase the life expectancy of the DCEP.
7 Under the Hood of the Reinforcement Learning Controller
More insights about why the learned policy works under various conditions can be obtained by taking a closer look at the design choices made for the RL controller. All these choices were the result of considerable trial-and-error.
Choice of basis functions: The choice of basis to approximate the Q-function is essential to the success of the RL controller. It defines the approximate Q-function, and consequently the policy (31). Redundant basis functions can lead to overfitting, which causes poor out-of-sample performance of the policy. We avoid this effect by selecting a reduced quadratic basis, which are the 36 unique non-zero entries shown in Fig. 10. Another advantage of reducing the number of basis functions is that it reduces the number of parameters to learn, as training effort increases dramatically with the number of parameters to learn.
The choices for the basis were based on physical intuition about the DCEP. First, basis functions can be simplified by dropping redundant states. One example is Stww. Since Stwc and Stww are dual terms: Stwc + Stww = 1, so one of them can be dropped. Considering that the Stwc reflects the amount of cooling saved in the TES, we dropped Stww. Another example is the term Ttww, which is dropped since it is bounded by which is already included in the basis function. Second, if two terms have a strong causal or dependent relationship, e.g., and then the corresponding quadratic term should be selected as an element of the basis. Third, if two terms have minimal causal or dependent relationship, e.g., and (they are from different equipment and water loops), then the corresponding quadratic term should not be selected as an element of the basis.
Choice of States: Exogenous disturbances have to be included into the RL states to make the controller work under various cooling load, electricity price, and weather trajectories that are distinct from what is seen during training. Without this feature, the RL controller will not be applicable in the field.
Convergence of the learning algorithm: The learning algorithm appears to converge in training, meaning, |θk − θk−1| is seen to reduce as the number of training epochs k increases; see Fig. 11. This convergence should not be confused with convergence to any meaningful optimal policy. The policy learned in the 40th iteration can be a better-performing controller than the policy obtained in 50th iteration. We believe the proximal gradient type method used in learning helps in the parameters not diverging, but due to the same reason it may prevent the parameters from converging to a far away optima. This trade-off is necessary: our initial attempts without the damping proximal term were not successful in learning anything useful. As a result, after a few policy improvement iterations, every new policy obtained had to be tested by running a closed-loop simulation to assess its performance. The best performing one was selected as “the RL controller,” which happened to be the 26th one.
Numerical considerations for training: Training of the RL controller is an iterative task that requires trying many various configurations of the parameters appearing in Table 1. In particular, we found the following considerations useful:
If the value of κ is too small, the controller will not learn to track the load . On the other hand, if κ is too large the controller will not save energy cost. The chosen κ in Sec. 3.3.5 is determined by trial-and-error.
The condition number of Eq. (23) significantly affects the performance of Algorithm 1. However, the relative magnitudes of state and input values are very different, for example, (kW) and STWC ∈ [0.05, 0.95], which makes the condition number of Eq. (23) extremely large. Therefore, we normalize all magnitudes of state and input values with their average values. With appropriate scaling of the states/inputs, we reduced the magnitude of the condition number from 1 × 1020 to 1 × 103.
8 Conclusion
The proposed MPC and RL controllers are able to reduce energy cost significantly, around in a week-long simulation, over the rule-based baseline controller. Apart from the dramatically lower real-time computationally cost of the RL controller compared to the MPC, load tracking and energy cost-saving performances of the two controllers are similar. This similarity in performance is somewhat surprising. Though both controllers are designed to be approximations of the same intractable infinite horizon problem, there are nonetheless significant differences between them, especially the information the controllers have access to and the objectives they are designed to minimize. It should be noted that the MPC controller has a crucial advantage over the RL controller in our simulations: the RL controller has to implicitly learn to forecast disturbances while the MPC controller is provided with error-free forecasts. How much will MPC’s performance degrade in practice due to inevitable plant-model mismatch is an open question.
Existing work on RL and on MPC tend to lie in their own silos, with comparisons between them for the same application being rare. This paper contributes to such comparisons for a particular application: control of DCEPs. Much more remains to be done, such as an examination of robustness to uncertainties.
There are several other avenues for future work. One is to explore nonlinear bases, such as neural networks, for designing an RL controller. Another is to augment the state space with additional signals, especially with forecasts, which might improve performance. Of course, such augmentation will also increase the cost and complexity of training the policy. Another avenue for improvement in the RL controller is to reduce the number of chiller switches. In this paper, all the chillers are considered to be the same. An area of improvement is to extend heterogeneous chillers with distinct performance curves, for both RL and MPC. On the MPC front, an MILP relaxation is a direction to pursue in the future.
Acknowledgement
The research reported here has been partially supported by the NSF through award 1934322 (CMMI) and 2122313 (ECCS).
Conflict of Interest
There are no conflicts of interest.
Data Availability Statement
The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.