CN112026744B

CN112026744B - A DQN variant-based energy management method for a hybrid hybrid system

Info

Publication number: CN112026744B
Application number: CN202010845021.9A
Authority: CN
Inventors: 周健豪; 薛四伍; 廖宇晖; 薛源
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2020-08-20
Filing date: 2020-08-20
Publication date: 2022-01-04
Anticipated expiration: 2040-08-20
Also published as: CN112026744A

Abstract

The invention discloses an energy management method for a hybrid hybrid power system based on a DQN variant, which belongs to the technical field of hybrid hybrid vehicles, and can improve the training convergence speed and the fuel economy of the vehicle. The invention includes: establishing a hybrid hybrid power system. Hybrid electric vehicle model, obtain the environmental parameters that affect the energy management strategy, including road gradient and vehicle quality; use the dynamic programming (DP) algorithm to solve the optimal energy management strategy, and save the experience into the optimal experience pool (OEB) , combined with the Hybrid Experience Replay (HER) technology, using the Dueling DQN strategy to train the model, to obtain the trained deep reinforcement learning agent, and to carry out the energy management of the hybrid hybrid vehicle under different working conditions. The HER technology and the DQN variant Dueling architecture constructed by the present invention can effectively improve the training convergence speed, vehicle fuel economy and algorithm robustness.

Description

Series-parallel hybrid power system energy management method based on DQN variants

Technical Field

The invention belongs to the technical field of series-parallel hybrid electric vehicles, and particularly relates to a series-parallel hybrid power system energy management method based on DQN variants.

Background

At present, the energy crisis is becoming more serious, the emission standard of automobiles is becoming more severe, the use of pure fuel oil automobiles is challenged, and the hybrid electric automobile has the advantages of long driving range of the fuel oil automobiles and no emission of electric automobiles to solve the problem of fossil fuel combustion, so that the energy management problem of the hybrid power is the key of research all the time.

At present, most of energy management of hybrid electric vehicles is a strategy based on rules, and by setting a certain energy management threshold value, the most common rule of the plug-in hybrid electric vehicles is to firstly consume battery energy, then maintain battery electric quantity and perform energy control on the rule. The optimization-based strategy has a representative benchmark which is Dynamic Programming (DP), and the hybrid electric vehicle relatively optimal energy management obtained off line under the condition that the global working condition information is known utilizes the known speed working condition to carry out corresponding optimal energy demand distribution on an engine and a battery of the hybrid electric vehicle so as to obtain the optimal energy management. In the prior art, engineers are used to develop rules for regular energy management or optimized model predictive control based on known or predicted speeds for energy management, thereby adjusting the equivalent fuel consumption of a hybrid vehicle.

However, the method in the prior art has some disadvantages, the rule-based energy management effect is often not obvious enough, many experience knowledge is needed for single working condition, global working condition is needed to be known based on optimized DP, real-time online application cannot be carried out due to too long calculation time, existing model prediction can be carried out in an optimized and real-time manner, but the prediction control step length cannot be selected too large, and the method still has a large difference compared with the optimization result of DP. And many optimization methods are not comprehensively considered, and the road gradient condition information and the vehicle-mounted mass change condition of the automobile are ignored.

Disclosure of Invention

The invention provides a series-parallel hybrid power system energy management method based on a DQN variant, which can improve the training convergence speed and the automobile fuel economy by combining the mixed experience playback of OEB and PER technologies.

In order to achieve the above purpose, the invention adopts the following technical scheme:

a method for the energy management of a series-parallel hybrid power system based on DQN variants comprises the following steps:

the method comprises the following steps: establishing a model of a passive series-parallel automobile;

step two: acquiring parameters influencing energy management of the experimental vehicle under a fixed route working condition, then solving by using DP to obtain an optimal solution, and storing the optimal solution experience in an OEB;

step three: based on parameters and observed quantities influencing energy management, a Dueling DQN neural network model is trained by using HER combined with PER to obtain a trained deep reinforcement learning agent;

step four: and acquiring parameters and observed quantities influencing energy management in the actual running of the automobile, and performing energy management on the hybrid electric vehicle under different working conditions based on the parameters and the observed quantities influencing the energy management in the actual running and the trained deep reinforcement learning agent.

In the above steps, the objective function of energy management:

wherein γ is an aggressive weighting factor, indicating a balanced equivalence to fuel consumption and battery power consumption; SOC_refRepresents the SOC reference value, m_dotfuelFuel consumption for each sampling time;

the passive series-parallel automobile model comprises an automobile dynamic model, a planetary gear transmission, a motor and a battery;

the automobile dynamics model is as follows:

wherein, T_outIs the drive shaft torque, R is the vehicle wheel radius, F_aIs the inertial resistance of the vehicle, F_rIs the air resistance of the car, F_gIs the car ramp resistance, F_fIs the rolling resistance of the automobile, m is the mass of the automobile, v is the speed of the automobile, a is the acceleration of the automobile, ρ is the air density, A is the frontal area of the automobile, C is the speed of the automobile_DIs the coefficient of air resistance, alpha is the gradient of the automobile road, mu_rIs the rolling resistance coefficient;

the planetary gear transmission model is as follows:

wherein n is_mIs the motor speed; β is a planetary gear parameter; n is_outIs the drive shaft speed; n is_eIs the engine speed; t is_mIs motor torque、T_eIs the engine torque;

the cell model shown is:

wherein P is_batt(t) is the battery power, V_ocIs the battery voltage, I_b(t) is the battery current, r_intIs the battery resistance, P_m(t) is the motor power, which is of the magnitude

Wherein eta_mIs the motor efficiency, SOC is the battery state of charge, Q_maxIs the maximum capacity of the battery;

the parameters influencing the energy management comprise road conditions, namely road gradient, of the hybrid electric vehicle on different working conditions and vehicle-mounted mass change caused by passenger or cargo change;

the observed quantity influencing energy management comprises the speed of the automobile, the acceleration of the automobile, the engine rotating speed, the engine torque, the motor rotating speed, the motor torque, the battery state of charge, the fuel consumption at the current moment, the difference value between the SOC and the reference SOC, the automobile displacement, the measurable interference road gradient and the vehicle-mounted quality variation quantity;

the second step to the fourth step specifically comprise the following steps:

under the condition that working condition information is known in advance, an optimal energy management experience is obtained by utilizing a DP algorithm to solve and is stored in an OEB, then, under the condition of real-time working conditions, a dulling DQN in deep reinforcement learning is utilized to train, in each training, the SOC at the current moment, the difference value between the SOC and a reference SOC, the automobile speed, the automobile displacement, the automobile acceleration, the fuel consumption, the road gradient and the vehicle-mounted quality variation are used as observed value input data of a dulling DQN agent, and the reward value at the current moment is used as reward value input data of the dulling DQN agent, the experience obtained at the moment is stored in the PEB, then, PER is utilized to sample in the PEB and randomly sample from the OEB, the experience of the two is combined, wherein the proportion of the experience in the PEB can be continuously reduced along with the progress of time, so that a neural network of the dulling DQN of the HER is trained to obtain a convergent agent, obtaining equivalent fuel consumption of the experimental vehicle under different working conditions and parameters influencing energy management; inputting the observed value into the deep reinforcement learning agent, and outputting the observed value as the control quantity of the series-parallel hybrid electric vehicle, namely the torque demand and the rotating speed of the engine at the next moment of the current moment, wherein the current moment is the moment of the current observed quantity;

the parameters of the experimental vehicle influencing energy management under different working conditions are obtained as follows: obtaining a plurality of samples, each sample including parameters collected on the test vehicle at different times that may affect energy management;

after obtaining the equivalent fuel consumption of the experimental vehicle under different working conditions and the parameters influencing the energy management, acquiring the equivalent fuel consumption of the experimental vehicle under the working conditions of a fixed route and the parameters influencing the energy management at a preset sampling frequency, and performing smoothing processing and normalization processing on the acquired data.

Has the advantages that: the invention provides a series-parallel hybrid power system energy management method based on DQN variants, which considers environmental parameters influencing energy management due to road gradient and vehicle-mounted quality change, utilizes DP to solve optimal experience in advance and utilizes PER to form HER, so that a Dueling framework DQN agent can be trained better, and environmental parameters influencing energy management due to equivalent fuel consumption of an experimental vehicle under different working conditions are obtained; training a deep reinforcement learning agent model based on parameters and observed quantities influencing the energy management to obtain a trained agent; the method comprises the steps of obtaining environmental parameters influencing energy management in actual running of the automobile, and carrying out energy management on the hybrid electric vehicle based on the environmental parameters influencing energy management in actual running and trained agents, so that energy optimization can be effectively controlled, real-time online application can be realized, more effective control on energy management of the hybrid electric vehicle is realized, and energy consumption is reduced.

Drawings

FIG. 1 is a schematic illustration of the training and application process of a series-parallel hybrid power system energy management method based on DQN variants in an embodiment of the present invention;

FIG. 2 is a flow chart of a specific application of a series-parallel hybrid power system energy management method based on DQN variants in an embodiment of the present invention;

FIG. 3 is a schematic network structure diagram of a hybrid-series hybrid power system energy management method Dueling DQN based on DQN variants in the embodiment of the invention;

FIG. 4 is a graph of reference SOC versus time for an embodiment of the present invention.

Detailed Description

The invention is described in detail below with reference to the following figures and specific examples:

In the above steps, the objective function of energy management is:

wherein γ is an aggressive weighting factor, indicating a balanced equivalence to fuel consumption and battery power consumption; SOC_refRepresents an SOC reference value; as shown in FIG. 4, in the objective functionThe SOC is mainly the working condition running time of the automobile through historical travel information, so that a simple SOC reference can be obtained, namely the reference SOC is uniformly reduced in a linear function manner along with the change of time, and the working effect of the battery under the condition is better;

the automobile dynamics model is as follows:

the planetary gear transmission model is as follows:

wherein n is_mIs the motor speed; β is a planetary gear parameter; n is_outIs the drive shaft speed; n is_eIs the engine speed; t is_mIs the motor torque, T_eIs the engine torque;

the cell model shown is:

wherein P is_batt(t) is the battery power, V_ocIs the battery voltage, I_b(t) is the battery current, r_intIs the battery resistance, P_m(t) is the motor power, which is largeIs small as

as shown in fig. 1, the off-line training and on-line real-time application process of the series-parallel hybrid power system energy management method based on the DQN variant includes the following steps:

The method can be applied to the scenes of the series-parallel hybrid electric vehicle under different working conditions, for example, the energy management of the vehicle is carried out by using the method, and the vehicle can adjust the online driving condition according to the agent trained in advance, so that the equivalent fuel consumption is reduced and the more accurate control is carried out. As shown in fig. 2, the specific application process of the above method is as follows:

step 201, obtaining environmental parameters of the experimental vehicle influencing energy management under different working conditions

The operating condition may represent a change of the running speed of the experimental vehicle with time, for example, the time from a starting station to an ending station of the automobile is 1024s, and how the speed change is, which may be regarded as an operating condition. In this embodiment, NEDC is used as an experimental training condition, and data of the experimental vehicle circulating at least three times under the condition is collected to ensure reliability of the training data.

The environmental parameter affecting energy management may be at least one of road conditions of different working conditions of the hybrid vehicle, i.e. road grade, vehicle-mounted mass changes due to passenger and cargo changes.

In the implementation, at least one environmental parameter which may influence energy management of the experimental vehicle under each working condition is obtained, and the environmental parameter which may influence energy management is selected from the at least one environmental parameter which may influence energy management. The equivalent fuel consumption of the experimental vehicle under the working condition is obtained by reading the data of the fuel consumption and the battery electric quantity sensor installed on the automobile.

Optionally, the method includes acquiring the equivalent fuel consumption of the experimental vehicle and the environmental parameters and the observed quantity influencing energy management under various working conditions at a preset sampling frequency, and performing smoothing processing and normalization processing on the acquired parameters.

The larger the sampling frequency is, the smaller the interval between sampling points is, the more data are obtained, and the greater the correlation between the data is, so that the result output by the finally trained agent model is more accurate. The technician can preset a sampling time interval, and samples the experimental vehicle in running according to the sampling time interval. For example, the technician may set the sampling frequency to 1 HZ.

By smoothing the collected data, the purpose of suppressing the collected inaccurate data can be achieved.

The acquired parameters are normalized, so that the values corresponding to different parameters have certain comparability, the accuracy of the proxy network model is improved, and great help is provided for setting the reward value.

Specifically, in the normalization process in this embodiment, because multiple sets of parameter items have been obtained in the above process, the normalization process may be calculated by determining the maximum parameter value and the minimum parameter value in each set of parameter items, and according to the determined maximum parameter value and the determined minimum parameter value, through the following formula:

wherein X is data after parameter normalization processing in each group of parameter items, and X_minFor the minimum parameter value, x, in each set of parameter terms_maxThe maximum parameter value in each group of parameter items.

And 202, based on the environmental parameters and the observed values influencing energy management, training an agent model by utilizing deep reinforcement learning to obtain a trained convergence agent.

The deep reinforcement learning agent is one of the combination of neural network models, and can control the action data of the next moment according to the observation data of the estimated moment. Observation data is input at the input layer of the agent model, and a Q value estimation is output at the output layer of the agent.

The neural network carries out Q function calculation to obtain a Q value: q (s, a | theta)^Q) The input is state s, action a, and the output is Q function Q (s, a | theta)^Q) The network is divided into an Online evaluation network and a Target evaluation network, the Target evaluation network and the Online evaluation network have the same structure, and the parameter theta of the Online evaluation network is the same^QCarrying out random initialization, and initializing a network parameter theta of a Target evaluation network through the two network parameters^Q′And simultaneously, opening up a space OEB as a storage space for experience playback.

After initialization is completed, iterative solution is started, action exploration is carried out by adopting an e-greed greedy algorithm, and action a is executed in the current state_tThe corresponding reward and next state are obtained and the process is formed into element groups(s)_t，a_t，r_t，s_t+1) Store into OEB space. Selecting a small batch of data from an OEB space by adopting a PER technology, combining the small batch of data randomly selected from the OEB to perform HER as training data of the Online evaluation network, and updating the Online evaluation network.

Defining an Online evaluation network Loss function: l ═ [ (r + γ Q)_t(s′，μ_t(s′|θ_t ^μ)|θ_t ^Q))-Q(s,a|θ^Q)]²The Online evaluation network is updated by minimizing the Loss function. Evaluating network parameter theta using updated Online^QNetwork parameter theta of network is evaluated for Target^Q′Updating:

in the deep reinforcement learning agent dulling DQN neural network framework shown in fig. 3, the Q value of standard DQN represents the value of a state and an action, and in this case, for the case that the Q value is not related to different actions in some states, the evaluation of the value of the action is not accurate, and the robustness and the stability are lacked.

The Dueling DQL divides the abstract features extracted by the convolutional layer into two streams in the fully connected layer, respectively represents the state action value function represented only by the original DQN as a state value function v(s), makes the state estimation value independent of action and environmental noise, makes each state have a relative value with respect to other unselected actions, and the other stream represents a state action dominance function a (s, a) in a certain state. Finally, combining the two streams together through a special aggregation layer to generate an estimate of the state-action-value function has the advantage of generalizing learning across operations without any modification to the underlying RL algorithm, in dulling DQL, the Q-value function is constructed as follows:

Q(s,a；θ,α,β)＝V(s；θ,α)+A(s,a；θ,β) (7)

α and β are parameters of two streams in the fully-connected layer, respectively, and θ is a parameter of the convolutional layer. However, when Q is given, the values of V and a are not unique, in other words, different combinations of V and a can result in the same Q value, which makes the algorithm less stable, and therefore the average of the merit function is used to improve the stability of the proposed algorithm:

only more layers are needed as compared with standard DQN training, but when there are many behaviors of similar value, dulling DQN can better perform strategy evaluation and improve stability and robustness.

In the actual process, the number of the neurons in the hidden layer of the agent model is set to be 40, and in order to accurately evaluate the effect of deep reinforcement learning energy management, the control effect of the deep reinforcement learning can be evaluated through the equivalent fuel consumption ratio R.

The equivalent fuel consumption ratio reflects a comparison between the effect of the actual control and the DP reference, with good results as the R value approaches 0. The formula for calculating the value of the ratio R is as follows:

wherein R is represented as the ratio between DP reference data and actual data, S_RLRepresenting equivalent fuel consumption, S, by deep reinforcement learning training_DPThe equivalent fuel consumption reference data obtained under the DP reference is shown.

It should be noted that the prediction performance of the nonlinear autoregressive dynamic neural network model after the training is completed is evaluated by calculating the ratio and the root mean square between the reference data and the actual data, for example, when the obtained ratio is close to 1, the root mean square is close to 0, which indicates that the proxy model after the training is completed by the deep reinforcement learning has good control performance.

Optionally, the training condition is NEDC, and the detection condition is WLTP, or FTP75, UDDS, JN 1015. In order to make the control data of the deep reinforcement learning algorithm more accurate, the trained agent models can be detected respectively under various different working conditions, the R value is calculated in the training process under each working condition, the control performance of each back propagation training method is compared by taking the R value as an index, and further the effect and robustness of the deep reinforcement learning are detected. The results are shown in table 1:

TABLE 1

From table 1, it can be seen that, in the deep reinforcement learning training process, the ratio R value of the obtained reference data to the actual data is close to 90%, which proves the effectiveness of real-time application.

And 303, acquiring environmental parameters and observed quantities influencing energy management in the actual running of the automobile, and controlling the energy management of the automobile based on the environmental parameters influencing the energy management in the actual running and the trained agent model.

In the steps, because the environmental parameter items influencing the energy management and the trained agent model are obtained, the parameters influencing the energy and the observed quantity can be input into the trained agent model in real time to control the running and the energy management of the automobile.

Specifically, if the control action of the vehicle at the current moment is to be controlled, the environmental parameters and the observed quantity affecting the energy management at the estimated moment need to be obtained, such as at least one of the road gradient condition and the vehicle-mounted mass change condition of the hybrid vehicle under different working conditions. The vehicle speed, the vehicle acceleration, the engine speed, the engine torque, the motor speed, the motor torque, the battery state of charge, the current fuel consumption, the difference between the SOC and the reference SOC, and the vehicle displacement. Inputting the parameters into the trained proxy model, and outputting the control action of the automobile at the estimated time, namely the engine torque and the rotating speed at the next time, wherein the estimated time is the time for sampling the parameters next to the current time, namely the estimated time is the time corresponding to the next sampling point of the sampling point corresponding to the current time.

The above description is only a preferred embodiment of the present invention, and the purpose, technical solution and advantages of the present invention are further described in detail without limiting the invention, and any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for energy management of a hybrid hybrid power system based on a DQN variant, comprising the following steps:

Build a passive hybrid vehicle model; use the DP algorithm to solve the optimal energy management experience when the working condition information is known in advance, store it in the OEB, and then use the Dueling DQN in deep reinforcement learning under real-time working conditions. Conduct training. In each training, the difference between the SOC at the current moment, the SOC and the reference SOC, the vehicle speed, the vehicle displacement, the vehicle acceleration, the fuel consumption, the road slope, and the vehicle mass change are used as the Dueling DQN agent. The observation value input data and the reward value at the current moment are used as the reward value input data of the Dueling DQN agent, the experience obtained at this time is stored in the PEB, and then the PER is used to sample in the PEB and randomly sample from the OEB, and the experience of both Combined, the proportion of experience in PEB will continue to decrease over time, so that the neural network of Dueling DQN of HER is trained to obtain a convergent agent, and the equivalent fuel consumption and impact of experimental vehicles under different working conditions are obtained. Parameters and observed quantities of energy management; input the observed quantities into the deep reinforcement learning agent for energy management of the hybrid electric vehicle under different operating conditions, and the output is the hybrid hybrid vehicle at the next moment at the current moment Controlled quantities are engine torque demand and rotational speed.

2 . The method for energy management of a hybrid hybrid system based on a DQN variant according to claim 1 , wherein the model of the passive hybrid vehicle includes a vehicle dynamics model, a planetary gear transmission, and an electric motor and an electric motor. 3 . Battery.

3. The method for energy management of a hybrid-connected hybrid power system based on a DQN variant according to claim 2, wherein the vehicle dynamics model is as follows:

where T _out is the drive shaft torque, R is the wheel radius of the car, F _a is the inertial resistance of the car, F _r is the air resistance of the car, F _g is the ramp resistance of the car, F _f is the rolling resistance of the car, m is the mass of the car, v is the speed of the vehicle, a is the acceleration of the vehicle, ρ is the air density, A is the windward area of the vehicle, C _D is the coefficient of air resistance, α is the road gradient of the vehicle, and μ _r is the coefficient of rolling resistance.

4. The method for energy management of a hybrid-connected hybrid power system based on a DQN variant according to claim 2, wherein the planetary gear transmission model is as follows:

where n _m is the motor speed; β is the planetary gear parameter; n _out is the drive shaft speed; _ne is the engine speed; T _m is the motor torque, and _Te is the engine torque.

5. The method for energy management of a hybrid-connected hybrid power system based on a DQN variant according to claim 2, wherein the battery model shown is:

where P _batt (t) is the battery power, V _oc is the battery voltage, I _b (t) is the battery current, r _int is the battery resistance, and P _m (t) is the motor power, whose magnitude is

where η _m is the motor efficiency, SOC is the battery state of charge, and Q _max is the maximum battery capacity.

6 . The method for energy management of a hybrid hybrid system based on a DQN variant according to claim 1 , wherein the parameters affecting energy management include road conditions on different operating conditions of the hybrid vehicle, that is, road gradients. 7 . and changes in onboard mass due to changes in passengers or cargo; the observations that affect energy management include the speed of the car, the acceleration of the car, the engine speed, the engine torque, the motor speed, the motor torque, the battery state of charge, the current moment Fuel consumption, difference between SOC and reference SOC, vehicle displacement, and the amount of disturbance road gradient and vehicle mass change can be measured.

7. The method for energy management of a hybrid-connected hybrid power system based on a DQN variant according to claim 1, wherein the objective function of energy management is:

Among them, γ is a positive weighting factor, indicating a balanced equivalence of fuel consumption and battery power consumption; SOC _ref represents the SOC reference value,

Fuel consumption for each sampling time.

8 . The method for energy management of a hybrid hybrid power system based on a DQN variant according to claim 1 , wherein the obtaining parameters that affect the energy management of the experimental vehicle under different working conditions is: obtaining a plurality of samples. 9 . , and each sample includes parameters collected on the experimental vehicle at different times that may affect energy management.

9 . The method for energy management of a hybrid hybrid system based on a DQN variant according to claim 1 , wherein the acquisition of the equivalent fuel consumption of the experimental vehicle under different operating conditions and the influence of the energy After the managed parameters, the equivalent fuel consumption of the experimental vehicle under fixed route conditions and the parameters affecting the energy management are collected at a preset sampling frequency, and the collected data is smoothed and normalized.