Abstract
Connected and autonomous vehicles have the potential to minimize energy consumption by optimizing the vehicle velocity and powertrain dynamics with Vehicle-to-Everything info en route. Existing deterministic and stochastic methods created to solve the eco-driving problem generally suffer from high computational and memory requirements, which makes online implementation challenging. This work proposes a hierarchical multi-horizon optimization framework implemented via a neural network. The neural network learns a full-route value function to account for the variability in route information and is then used to approximate the terminal cost in a receding horizon optimization. Simulations over real-world routes demonstrate that the proposed approach achieves comparable performance to a stochastic optimization solution obtained via reinforcement learning, while requiring no sophisticated training paradigm and negligible on-board memory.
1 Introduction
The introduction of connected and automated vehicles (CAVs) has had a significant impact due to its potential to improve efficiency by reducing energy consumption. The increased amount of information the vehicle receives from vehicle-to-infrastructure (V2I), vehicle-to-vehicle (V2V), and global positioning system (GPS) technology affords CAVs the ability to make context-aware decisions en route for higher travel and fuel efficiency [1].
An eco-driving problem for CAVs can be formulated as an optimal vehicle speed trajectory that minimizes a given cost functional over a route. Recent work in this field explores the opportunity for leveraging information on surrounding traffic and signalized intersection information, allowing for the harmonization of traffic speed [2]. Recent studies aim to minimize energy consumption by either sequentially optimizing [3], or co-optimizing the speed and powertrain dynamics [4]. Similarly, the potential of leveraging information from signalized intersections to improve CAVs energy efficiency has been shown in Ref. [5], while V2V opportunities for efficiency improvements on large-scale traffic scenarios are explored in Ref. [6].
Different solution methods have been investigated in literature to solve the eco-driving problem. Most of these include the use of Pontryagin's minimum principle (PMP) [7] and dynamic programming (DP) [8]. Although PMP and DP provide globally optimal solutions for a given route itinerary, it is difficult to solve the problem online due to high computational requirements. Alternatively, the problem is solved hierarchically using DP, namely utilizing multiple horizons [9]. This solution method, referred to as the rollout algorithm [4], involves solving a long-term optimization under nominal conditions first, then solving a short-term optimization to account for variability en route. The long-term optimization is considered a base-heuristic, and it is approximated as the terminal cost for the short-horizon optimization.
Different methods may be explored to generate the base-heuristic. For example, a full-route deterministic optimization could be performed using DP and stored as multi-dimensional maps [10]. Alternatively, approaches from the field of machine learning (ML) which leverage data sets obtained via field experiments or simulation may be applied. For instance, a Safe Model-based Off-policy Reinforcement Learning (SMORL) method was demonstrated to learn the terminal cost offline [1]. In addition, several methods utilize online methods based on Q-learning [11] and Actor-Critic networks [12] to optimize the base heuristic during simulation.
Computing a full-route DP solution requires no training online but fails to account for unknown route variability, such as signal phase and timing (SPaT), which refers to the duration and order of a traffic light's states (i.e., red, green and yellow). Moreover, the storing of the precomputed deterministic value function is memory intensive, hence prohibitive for in-vehicle deployment. Conversely, using ML for offline or online training requires less memory relative to the DP solution, but the extensive training and the feedback process intrinsic to reinforcement learning make this method computationally expensive. More specifically, model-based reinforcement learning introduces computational expense given its complexity and while model free reinforcement learning lessens the additional expense, it introduces inaccuracy by relying on a reward function that does not account for system characteristics or physics. This work proposes a neural network (NN) based methodology of extracting the base-heuristic from a pre-computed full-route solution derived from a co-optimization of powertrain dynamics and velocity [4]. Given a terminal state as an input, the NN approximates the terminal cost for the remainder of the driving mission. Training is performed using the value function computed on different routes and SPaT combinations. The presented approach is illustrated in Fig. 1 and offers three advantages over the previous methods. First, the offline training of the terminal cost saves significant computation time relative to the SMORL method. Second, the neural network trains on a set of optimal DP solution data based on an accurate physical model which improves on the accuracy achieved by online reinforcement learning methods. Third, the neural network structure is implemented as a function approximation rather than a static map, significantly decreasing the memory requirements.
2 Vehicle Dynamics and Powertrain Model
A forward-looking model of the longitudinal vehicle dynamics and a 48 V mild-hybrid powertrain is used in this work for computing energy use [8]. The powertrain consists of a Belted Starter Generator (BSG) connected to a 1.8 L turbocharged gasoline engine. The battery is modeled as a zero-th order equivalent circuit to determine the State-of-Charge (SoC). Quasi-static models predict the engine fuel consumption, as well as BSG, torque converter and transmission efficiency. The model was validated with test data collected on a chassis dynamometer for regulatory drive cycles [13].
3 Problem Formulation
3.1 Full Route Optimization.
3.2 Receding Horizon Optimization.
The OCP formulated in Eq. (7) is solved using SMORL [1]. The algorithm performs a data-driven approximation of the terminal cost, accounting for the variations in SPaT using a simulator of the vehicle interacting with the environment. The simulated ego-vehicle is assumed to be equipped with GPS and a Dedicated Short-Range Communication (DSRC) sensor, providing SPaT data within the communication range.
When compared against the deterministic rollout solution, which is assumed to be the baseline strategy, SMORL showed 11.0% less fuel consumption [1]. While the stochastic OCP solved using SMORL allowed the incorporation of a wide range of scenarios, the algorithm requires a large number of online simulations for the training process. In fact, the main limitation of this method is its reliance on an actor-critic network system and a perturbation network supported by both a safe set and a replay buffer [1]. As a direct consequence, this method requires an extremely large amount of memory allocation, which is prohibitive in realistic online implementations.
3.3 Neural Network–Based Rollout Algorithm.
Instead of learning the value function using SMORL, a novel rollout algorithm is developed in this paper. This approach requires the full-route optimization in Eq. (5) to be run offline for a wide range of simulated routes and SPaT combinations, storing the resulting value functions. A fully connected feed-forward NN is then trained to approximate the cost-to-go from the terminal cost matrices as a function of an extended state-space. The inputs to the NN are based on an extension of [1] and are summarized in Table 1.
Variable | Description | |
---|---|---|
SoC ∈ ℝ | Battery SoC | |
Vveh ∈ ℝ | Vehicle Velocity | |
Vrlim ∈ ℝ | Difference of Vehicle Velocity and Speed Limit at the Current Road Segment | |
Difference of Vehicle Velocity and Upcoming Speed Limit | ||
dtfc ∈ ℝ | Distance to upcoming traffic light | |
Distance to road segment at which the speed limit changes | ||
drem ∈ ℝ | Remaining distance of the trip | |
xtfc ∈ {ℝ| − 1 ≤ xtfc ≤ 1}6 | Sampled status of the upcoming traffic light encoded as six digits |
Variable | Description | |
---|---|---|
SoC ∈ ℝ | Battery SoC | |
Vveh ∈ ℝ | Vehicle Velocity | |
Vrlim ∈ ℝ | Difference of Vehicle Velocity and Speed Limit at the Current Road Segment | |
Difference of Vehicle Velocity and Upcoming Speed Limit | ||
dtfc ∈ ℝ | Distance to upcoming traffic light | |
Distance to road segment at which the speed limit changes | ||
drem ∈ ℝ | Remaining distance of the trip | |
xtfc ∈ {ℝ| − 1 ≤ xtfc ≤ 1}6 | Sampled status of the upcoming traffic light encoded as six digits |
The neural networks are trained offline with supervised learning using the processed DP solution as ground truth. This solution benefits the short-term rollout solution as it is trained inexpensively using offline data generation and an offline training process. In addition, the derived network accounts for SPaT, allowing it to function with stochastic elements while requiring significantly less storage space. A summary of the process for producing the neural network is given in Fig. 2.
3.3.1 Data Set Creation and Augmentation.
The 3-state vector is augmented using speed limit and SPaT information under the assumption that each route has information on the placement and the speed limit values used to define its relative difference from vehicle velocity. The route has a known length, therefore distance remaining is an additional state. Then, SPaT is included as the distance to the next traffic light, phase and time sample of the respective phase. The final input vector to the NN for the training takes the form shown in Table 1.
The cost-to-go values obtained from DP include high-cost regions that correspond to infeasible operations of the vehicle. Because the objective is to use supervised learning for the approximation of the value function, infeasibilities in the data have been addressed in preprocessing. Specifically, the infeasibilites due to the violation of the limits on the states are shown in Figs. 3–5. Infeasibilities in Fig. 4 imposing the recharge constraint were removed because they are applied in rollout. Infeasibilities in Fig. 5 for going too slow were removed because they are not applicable in an online simulation scenario. Infeasibilities due to traffic lights were also truncated so that the neural network could account for the obstacle, as in Fig. 5. Finally, infeasibilities from exceeding the speed limit, as shown in Fig. 3, were removed due to the constraint being applied during rollout.
3.3.2 Training Process.
To produce raw data, full-route DP optimization was performed on a set of routes. Once post-processed, this data was used to train the NN whose input layer is the same as the input vector , and its output layer is the predicted value function. This neural network is optimized using an offline supervised training process. During training, the data was normalized, scrambled, applied as mini-batches, and the chosen optimization algorithm was ADAM. Using a learning rate of 0.001 and a dropout rate of 30 percent on a two-layer network composed of 500 neurons each, an optimal set of neural network parameters was derived using the algorithm summarized in algorithm 1. This training process converges by early stopping such that it ceased training once changes in training and testing error became negligible.
Neural network supervised learning
1: Initialize neural network, encoder, ADAM loss function, output file for training and test loss
2: forniter in N epochs do
3: Initialize loss values
4: Randomly sample 100,000 points uniformly from the processed DP solution
5: forjth mini-batch of 500 in the data set do
6: Encode traffic light data
7: Perform a forward pass for the input data
8: Iterate the model weights using back-propagation on prediction error
9: Add training loss to running sum
10: end for
11: Average running loss sum for average training loss
12: Perform a forward run and calculate loss on test set data using the current network
13: end for
3.3.3 Neural Network.
The converged NN approximates the terminal cost as a function of the augmented vehicle state input vector . Utilizing an offline training process and requiring much less memory to store and use compared to the fully deterministic solution, this new formulation differs from the SMORL formulation by considering stochastic variation in a more computationally efficient manner.
4 Simulation Results
The trained NN is used as the base-heuristic in the RHOCP Eq. (6) and compared against SMORL. To demonstrate the ability of the NN to predict the terminal cost for a given set of states, an overfit was performed over a training data set whereby the NN is trained to predict only its training data accurately. The chosen route was an 8.2 km mixed-urban route with 2 traffic lights, a representative example from the 100-route dataset.
The trained NN was then integrated in the rollout scheme as shown in Fig. 1 with a horizon length of 1 km. Figure 6 shows the comparison between the solution of RHOCP Eq. (6) using the deterministic value function and the developed NN as the terminal cost. Results indicate that the velocity trajectories obtained with the two methods are within close proximity, which indirectly confirms that the trained NN is correctly predicting the terminal cost of the RHOCP.
After this initial confirmation, the NN was retrained to generalize and represent several possible routes. A set of 20 real-world routes was chosen with a given SPaT profile generated in SUMO (Simulation of Urban Mobility) [14]. Each route contains around 2 × 108 data points, together composing a set of approximately 4 × 109 data points. For this training process, the data was split into 16 and 4 routes forming the training and test set respectively. The training process detailed in Sec. 3.3.2 was followed. Figure 7 shows the average loss per epoch over the test set, demonstrating convergence during learning.
The resulting NN is integrated with the rollout using a 200 m horizon and simulated over five representative routes selected from a set of 100 routes. The routes include three mixed-urban route and two urban routes, summarized in Table 2.
Route | Length (km) | Average speed limit (m/s) | Traffic lights | Stop signs |
---|---|---|---|---|
Overfit | 8.22 | 25.32 | 2 | 2 |
Urban Route 1 (UR1) | 10.01 | 18.27 | 7 | 2 |
Urban Route 2 (UR2) | 8.41 | 19.22 | 3 | 2 |
Mixed-Urban Route 1 (MUR1) | 10.83 | 23.45 | 5 | 2 |
Mixed-Urban Route 2 (MUR2) | 7.40 | 22.44 | 6 | 2 |
Mixed-Urban Route 3 (MUR3) | 7.17 | 23.23 | 7 | 2 |
Route | Length (km) | Average speed limit (m/s) | Traffic lights | Stop signs |
---|---|---|---|---|
Overfit | 8.22 | 25.32 | 2 | 2 |
Urban Route 1 (UR1) | 10.01 | 18.27 | 7 | 2 |
Urban Route 2 (UR2) | 8.41 | 19.22 | 3 | 2 |
Mixed-Urban Route 1 (MUR1) | 10.83 | 23.45 | 5 | 2 |
Mixed-Urban Route 2 (MUR2) | 7.40 | 22.44 | 6 | 2 |
Mixed-Urban Route 3 (MUR3) | 7.17 | 23.23 | 7 | 2 |
The cumulative performance of the NN-rollout framework is compared against SMORL over the 5 selected routes, as summarized in Table 3. The results indicate that the NN-rollout method reduces the cumulative cost compared to SMORL for all the routes considered, due to the fact that the NN learned the value function from the global DP solutions. Further, the NN-rollout method improves the fuel economy over SMORL for 4 of the 5 selected routes, with minimal effect on the travel time.
Fuel economy (mpg) | Time (s) | Cumulative cost (—) | Final SoC (%) | |||||
---|---|---|---|---|---|---|---|---|
SMORL | NN-rollout | SMORL | NN-rollout | SMORL | NN-rollout | SMORL | NN-rollout | |
UR1 | 44.22 | 43.72 (−1.1%) | 752 | 748 (−0.53%) | 594.74 | 594.44 (−0.050%) | 0.5006 | 0.5689 |
MUR1 | 42.87 | 45.68 (+6.4%) | 702 | 712 (+1.4%) | 583.96 | 578.43 (−0.95%) | 0.5025 | 0.5448 |
UR2 | 43.38 | 46.22 (+6.3%) | 583 | 596 (+2.2%) | 473.29 | 472.11 (−0.25%) | 0.5009 | 0.5679 |
MUR2 | 40.08 | 42.52 (+5.9%) | 575 | 588 (+2.2%) | 462.10 | 461.80 (−0.065%) | 0.5008 | 0.5648 |
MUR3 | 39.79 | 41.54 (+4.3%) | 532 | 532 (+0%) | 434.31 | 428.91 (−1.3%) | 0.5002 | 0.5639 |
Fuel economy (mpg) | Time (s) | Cumulative cost (—) | Final SoC (%) | |||||
---|---|---|---|---|---|---|---|---|
SMORL | NN-rollout | SMORL | NN-rollout | SMORL | NN-rollout | SMORL | NN-rollout | |
UR1 | 44.22 | 43.72 (−1.1%) | 752 | 748 (−0.53%) | 594.74 | 594.44 (−0.050%) | 0.5006 | 0.5689 |
MUR1 | 42.87 | 45.68 (+6.4%) | 702 | 712 (+1.4%) | 583.96 | 578.43 (−0.95%) | 0.5025 | 0.5448 |
UR2 | 43.38 | 46.22 (+6.3%) | 583 | 596 (+2.2%) | 473.29 | 472.11 (−0.25%) | 0.5009 | 0.5679 |
MUR2 | 40.08 | 42.52 (+5.9%) | 575 | 588 (+2.2%) | 462.10 | 461.80 (−0.065%) | 0.5008 | 0.5648 |
MUR3 | 39.79 | 41.54 (+4.3%) | 532 | 532 (+0%) | 434.31 | 428.91 (−1.3%) | 0.5002 | 0.5639 |
Figures 8 and 9 shows the optimal state and control input trajectories, along with the time-space plot for UR1 and UR2, respectively. The state and control input trajectories show that all the constraints on speed limits, battery SoC limits and torque limits are met, which is ensured by the deterministic solution of the short-term RHOCP. A correct prediction of the terminal cost becomes important when approaching intersections, where errors could lead to infeasibilities. To this extent, the results indicate that the vehicle does not violate any red light and stop signs, while passing only when the light is green.
The velocity trajectory exhibit some noise resulting from a general NN fit, but also exhibits behavior that demonstrates the ability to extrapolate from its training set. For example, the vehicle accelerates and then coasts through the speed limit change, which is demonstrated on UR1 between 8000 m and 10,000 m in Fig. 8(a) and on UR2 between 6000 m and 7000 m in Fig. 9(a). Constant velocities are held on unobstructed sections of road, which is shown in UR1 between 7000 m and 8000 m in Fig. 8(a) and UR2 between 5000 m and 6000 m in Fig. 9(a).
The slight change in the velocity trajectory between the NN-based method and SMORL can result in the vehicle experiencing a different SPaT sequence along the route, which is evident from the time-space plot in Figs. 8 and 9. For the UR2 case, shown in Fig. 9(a), the speed profile of the NN-based approach is almost identical to SMORL, except for the NN causing a slower constant velocity between 2000 m and 3000 m. This is necessary to avoid a red light stop at 5000 m, as shown in Fig. 9(d), while SMORL encounters a red phase at the same intersection.
Comparison of the SoC trajectory demonstrates close to charge-sustaining behavior in both cases, with only a slight deviation from the 50% SoC target at the end of the route. This was due to pruning the infeasible raw data shown in Fig. 4, which discouraged low SoC values at the end of route, creating a preference for slight overshoots.
Overall, the NN-rollout method can outperform the SMORL approach in approximating the terminal cost. Due to its offline training, the NN-rollout is not only significantly less computationally expensive, but also results in faster simulation times. In addition, compared to the DP (deterministic) solution, the NN-rollout method was able to provide a full approximation of the mapping between terminal state and terminal cost without the need to always store a value function. For the routes considered in this paper, Table 4 shows that the NN used 268 kilobytes (kb) of memory compared to 2–5 gigabytes (Gb) needed for the deterministic approach. Therefore, the NN-rollout approach outperforms the existing methods of approximating the terminal cost in a computationally and memory-efficient manner.
5 Conclusion
A neural network approximation of the terminal cost in a receding horizon optimal control problem is proposed in this paper. The method is compared against a stochastic approach that captures the variability in SPaT. Simulations over five real-world routes show that the proposed NN-rollout outperforms the stochastic method, while reducing the computational and memory requirements. The NN-rollout provides an exhaustive mapping of the terminal cost similar to a full-route DP, but without the significant memory required. Additionally, the NN approximation shows similar robustness to variation as the reinforcement learning method, while requiring a completely offline process that is computationally more efficient.
The current work is focusing on integrating the proposed NN as a terminal cost approximator of the rollout algorithm for Eco-Driving presented in Ref. [4]. This will enable the integration in vehicle and experimental verification of the proposed strategy. Future work will focus on extending the NN framework to include uncertainties due to variations in traffic density.
Acknowledgment
The authors acknowledge the support from the United States Department of Energy, Advanced Research Projects Agency—Energy (ARPA-E) NEXTCAR project (Award No. DE-AR0000794).
Conflict of Interest
There are no conflicts of interest.
Data Availability Statement
The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.
Footnote
Paper presented at the 2023 Modeling, Estimation, and Control Conference (MECC 2023), Lake Tahoe, NV, October 2–5. Paper No. MECC2023-46.