Disclosure of Invention
Aiming at the problem that the energy transaction based on edge calculation in the prior art cannot realize the privacy protection and the utility balance of users, the invention provides an energy internet transaction method and system based on reinforcement learning block chain energization. In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
An energy internet transaction method based on reinforcement learning block chain enabling comprises the following steps:
S1, constructing a three-stage game model among an operator, a retailer and a obstetrician based on the energy trade relation of the operator, the retailer and the obstetrician on a blockchain trade platform;
s2, solving game balance points in a three-stage game model by using a distributed hierarchical strategy gradient algorithm, wherein the game balance points comprise an optimal unit service price, an optimal unit energy price and an optimal energy demand;
and S3, the operators, retailers and producers conduct energy transaction according to the game balance points obtained in the step S2.
The step S2 includes the steps of:
s2.1, setting network parameters of a three-stage game model;
s2.2, initializing weight parameters of the three-stage game model;
S2.3, respectively acquiring the states of operators Status of retailerAnd the state of the sales producerEach edge server in the blockchain trading platform sequentially selects the appropriate unit service price η for the operator utility U o (η) as the reward function to maximize the operator utility U o (η), the appropriate unit energy price p for the retailer utility U r (p) as the reward function to maximize the retailer utility U r (p), and the respective seller utility using a markov decision processSelecting the appropriate energy demand q j as a function of rewards to be effective for the producerMaximization.
In step S2.3, the status of the operatorThe expression of (2) is:
wherein p t-1 represents the unit energy price at the time of t-1 step, Representing the energy demand submitted by the sales producer to the local retailer through the edge server j at step t-1;
the expression that the operator utility U o (η) maximizes is:
Where U m represents the extra rewards the operator obtains through the trusted blockchain service provided by each energy transaction, φ represents the transmission loss rate, C t represents the unit transmission cost, C o represents the fixed operation and maintenance cost, η min represents the lowest price per unit service, η max represents the highest price per unit service, q j represents the energy demand submitted by the seller to the local retailer through edge server j, and N represents the aggregate of all edge servers in the blockchain transaction platform.
The calculation formula of the bonus U m is:
Um=(Rf+rs)λ;
Where R f represents a fixed block prize, R represents the blockchain service fee offered to the operator by the producer at each energy transaction, s represents a blockparameter, and λ represents a probability factor in the blockchain.
The status of the retailerThe expression of (2) is:
Where η t denotes the price per service at step t, Representing the energy demand submitted by the sales producer to the local retailer through the edge server j at step t-1;
The retailer utility U r (p) maximizes the expression:
Where C g represents the cost of production that the retailer needs to afford to produce energy, C s represents the cost of storage that the retailer needs to afford to store energy, p min represents the lowest unit energy price, p max represents the highest unit energy price, q j represents the energy demand that the producer submits to the local retailer through edge server j, and N represents the set of all edge servers in the blockchain trading platform.
The calculation formula of the production cost C g is as follows:
Where a, b, k are weighting factors for the cost of electricity generation at the time of retailer production, and phi represents the transmission loss rate.
The calculation formula of the storage cost C s is as follows:
Where c s represents the unit cost of the retailer's energy storage, ζ c represents the charging efficiency of the energy storage device, and ζ d represents the discharging efficiency of the energy storage device.
The state of the sales producerThe expression of (2) is:
the utility of the obstetrician The maximized expression is:
Where δ represents a conversion factor, w j represents the usage scenario of edge server j in terms of energy utilization, q min represents the minimum energy demand, q max represents the maximum energy demand, q j represents the energy demand submitted by the seller to the local retailer via edge server j, and r represents the blockchain service fee offered to the operator by the seller at each energy transaction.
The energy internet transaction system based on reinforcement learning blockchain enabling comprises an energy application layer, an energy data layer and an edge control layer, wherein the energy application layer interacts with the edge control layer through an intelligent contract interface, the energy data layer interacts with the edge control layer, the energy application layer comprises retailers and sellers, the retailers and the sellers interact through a blockchain transaction platform, the edge control layer comprises edge servers and distributed SDN controllers maintained by operators, each edge server serves as a node in the blockchain transaction platform, the energy data layer comprises a switch and an energy router, the switch is connected with the distributed SDN controllers and is used for receiving scheduling instructions sent by the distributed SDN controllers and forwarding the scheduling instructions to the corresponding energy routers, and the energy router is used for sensing states of energy lines and reflecting the states of the energy lines to the edge servers.
The intelligent contract interface is built based on an intelligent contract system, the intelligent contract system comprises a user registration module, an energy transaction module, an energy transmission module, an energy recording module and an information query module, after three parties of an operator, a sales producer and a retailer register respective accounts through the user registration module, the sales producer places orders through the energy transaction module according to own needs, the energy is transmitted to the sales producer from the retailer through the energy transmission module, the energy recording module is used for recording respective electric quantity information of the retailer and the sales producer, and the information query module is used for enabling the parties to query own account information.
The invention has the beneficial effects that:
The hierarchical design of the actions and the learning processes of the multiple agents is helpful for the agents to learn own strategies according to competing strategies, the overall performance of a transaction system is improved, and compared with a popular deep reinforcement learning algorithm, the method can help operators and retailers to achieve higher utility, and meanwhile, manufacturers and retailers can achieve better utility. Under the unified pricing mechanism, the convergence sequence of different entities is consistent with the action sequence of three stages of the Stark Stackelberg game, so that a leader in the game is more likely to obtain better benefits than a follower.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without any inventive effort, are intended to be within the scope of the invention.
Embodiment 1 an energy internet transaction method based on reinforcement learning blockchain enabling, comprising the following steps:
S1, constructing a three-stage game model among an operator, a retailer and a obstetrician based on the energy trade relation of the operator, the retailer and the obstetrician on a blockchain trade platform;
As shown in fig. 2 and 3, the three-stage game model includes three policy networks corresponding to an operator, a retailer, and a sales producer, respectively, where the sales producer submits energy requirements to the operator through a blockchain transaction platform, and the operator assists in completing energy transactions and transmissions between the retailer and the sales producer through the blockchain transaction platform as an intermediary. The blockchain transaction platform is composed of a plurality of edge servers, each edge server is used as a node in the blockchain transaction platform and bears functions of accounting, broadcasting, verification and consensus, and the collection of the edge servers is represented as N= {1, 2.
S2, solving game equilibrium points in a three-stage game model by using a distributed hierarchical strategy gradient algorithm (HIERARCHICAL DISTRIBUTED POLICY GRADIENT, HDPG);
Solving a game balance point in a three-stage game model by using a Markov decision process in consideration of a Stackelberg game under incomplete information, wherein one Markov decision process is equivalent to the whole process of one-time energy transaction and transmission, and the game balance point comprises an optimal unit service price, an optimal unit energy price and an optimal energy requirement, so that the utility of an operator U o (eta), a retailer U r (p) and a seller is maximized The overall benefit is maximized under the condition that eta represents the unit service price, p represents the unit energy price, q j represents the energy demand submitted by the producer to the local retailer through the edge server j, and j epsilon N.
The step S2 includes the steps of:
S2.1, setting network parameters of three strategy networks;
The network parameters include a learning rate α r of the retailer policy network, a learning rate α p of the producer policy network, a learning rate α o of the carrier policy network, and a discount factor γ.
S2.2, weight parameters of three strategy networksCarrying out random initialization;
S2.3, respectively acquiring the states of operators Status of retailerAnd the state of the sales producerEach edge server in turn maximizes carrier utility U o (η) by selecting appropriate unit service price η as a reward function for carrier utility U o (η), maximizes retailer utility U r (p) by selecting appropriate unit energy price p as a reward function for retailer utility U r (p), maximizes respective seller utilitySelecting appropriate energy demand as a function of rewardsTo make the sales person effectiveMaximized, and η t∈Ao,Ao represents the operator's action space, p t∈Ar,Ar represents the retailer's action space,A p represents the action space of the sales producer;
The status of the operator The expression of (2) is:
wherein p t-1 represents the unit energy price at the time of t-1 step, Representing the energy demand submitted by the seller to the local retailer at step t-1 via edge server j.
The expression that the operator utility U o (η) maximizes is:
Where U m represents the extra rewards the operator obtains through the trusted blockchain service provided by each energy transaction, φ represents the transmission loss rate, C t represents the unit transmission cost, C o represents the fixed operation and maintenance cost, η min represents the lowest unit service cost, and η max represents the highest unit service cost.
The calculation formula of the bonus U m is:
Um=(Rf+rs)λ;
Where R f represents a fixed block prize, R represents the blockchain service fee offered to the operator by the producer at each energy transaction, s represents a blockparameter, and λ represents a probability factor in the blockchain. The operator may be further motivated to maintain the blockchain by the bonus U m.
The status of the retailerThe expression of (2) is:
The retailer utility U r (p) is determined by the unit energy price p and the total energy demand, and the expression of the retailer utility U r (p) maximization is:
where C g represents the production cost that the retailer needs to afford to produce energy, C s represents the storage cost that the retailer needs to afford to store energy, p min represents the lowest unit energy price, and p max represents the highest unit energy price.
The calculation formula of the production cost C g is as follows:
where a, b, k are weighting factors of the power generation cost.
The calculation formula of the storage cost C s is as follows:
Where c s represents the unit cost of the retailer's energy storage, ζ c represents the charging efficiency of the energy storage device, ζ d represents the discharging efficiency of the energy storage device, Representing the energy that the retailer needs to actually produce and the energy stored, taking into account the energy lost during transmission.
The state of the sales producerThe expression of (2) is:
the utility of the obstetrician The maximized expression is:
Where δ denotes a conversion factor, w j denotes the usage scenario of edge server j in terms of energy utilization, q min denotes minimum energy demand, q max denotes maximum energy demand, pq j denotes the energy payment cost of the seller to the retailer, Representing the benefit that the edge server j takes from its purchased energy in actual production.
The specific flow of the markov decision process is the prior art, and the invention is not repeated, in addition, the time complexity of each iteration of the outer loop, that is, the total iteration of executing the markov decision process, the middle loop, that is, the iteration step number of each strategy network, and the inner loop, that is, the number of producers in the producer strategy network is respectively O (E), O (T) and O (N). Each policy network includes two fully-connected layers, and the time complexity of each fully-connected layer is expressed asWhere K l refers to the number of fully connected neural units and L represents the number of layers of a policy network. Because each policy network contains two fully connected layers to generate policies, the overall temporal complexity of the algorithm is O (ETN (T f)).
The formula of the operator policy network update weight parameter θ o is:
In the formula, Representing the policy of the operator at step t,Representing the rewards of the operator at step t,Representing the status of the operator at step t, andS o represents the state space of the operator, α o represents the learning rate of the operator policy network.
The formula for the retailer policy network update weight parameter θ r is:
In the formula, Representing the policy of the retailer at step t,Indicating the rewards of the retailer at step t,Representing the status of the retailer at step t, anS r represents the state space of the retailer, a r represents the learning rate of the retailer policy network.
Producer policy network update weight parametersThe formula of (2) is:
In the formula, Representing the strategy of the shipper at step t,Indicating the rewards of the seller at step t,Indicating the state of the sales producer at step t, andS p represents the state space of the shipper,Representing the learning rate of the producer policy network.
In this embodiment, the producers refer to distributed energy users who cannot produce energy by themselves or whose energy produced cannot meet the energy consumption, and they can purchase energy from retailers of public energy companies or BSDEI according to the energy demand and unit energy price.
Retailers refer to energy users who use distributed power generation and energy storage devices that generate more total power than total power, and they benefit from providing energy to a variety of distributed applications. On the other hand, they need to bear the costs of distributed power generation, energy storage, and pay the operator for the transmission routing services.
The carrier is an intermediary between the seller and the retailer to assist in completing the energy trading process. In order to provide more convenient service and lower delay, operators deploy hardware devices such as an edge server, a distributed SDN controller and the like on an edge control layer, so that edge-to-edge coordination among the devices is realized. In return, it charges the retailer for transmission routing services and the seller for trusted blockchain services.
Embodiment 2 an energy internet transaction system based on reinforcement learning blockchain enabling, as shown in fig. 1, the system comprises an energy application layer, an energy data layer and an edge control layer, wherein the energy application layer, the energy data layer and the edge control layer form an energy transaction service system of a distributed energy market together, the three layers are mutually independent and mutually related, and energy routing and scheduling control in the blockchain enabling energy internet (Blockchain-ASSISTED ENERGY INTERNET, BEI) are decoupled; the energy application layer interacts with the edge control layer through an intelligent contract interface, and the energy data layer interacts with the edge control layer through a standard interface OpenFlow; the energy application layer comprises retailers and producers, the retailers and the producers directly conduct information interaction through a blockchain transaction platform, the direct transaction process can motivate the retailers and the producers to participate in a distributed energy market more actively, the blockchain transaction platform provides a reliable and stable third party service platform for energy transaction in the energy application layer, the edge control layer comprises edge servers maintained by operators and distributed SDN controllers, the distributed SDN controllers are used for dispatching and controlling energy routing of an energy data layer, each edge server serves as a node in the blockchain transaction platform and bears accounting, broadcasting, verifying and consensus functions, the aggregate of the edge servers is expressed as N= {1,2, j, N' }, the intelligent contract provides reliable and automatic process control for energy transaction in the blockchain transaction platform, the energy data layer comprises a switch and an energy router, the switch is used for receiving a dispatching instruction of the distributed SDN controller of the edge control layer and sending the dispatching instruction to the corresponding energy router, the energy router is used for sensing the state of the energy line and reflecting the real-time state of the energy line to the edge server, so that the distributed SDN controller is helped to modify the scheduling instruction, and the state of the energy line comprises the electric energy value, the voltage, the current and the like on the energy line. In addition, the energy router may also receive a command from the distributed SDN controller to change the state of the energy router.
As shown in fig. 4, the intelligent contract interface is established based on an intelligent contract system, the intelligent contract system comprises a user registration module, an energy transaction module, an energy transmission module, an energy recording module and an information query module, after three parties of an operator, a sales producer and a retailer register respective accounts through the user registration module respectively, the sales producer orders through the energy transaction module according to own requirements, the energy transaction realizes that energy flows from the retailer to the sales producer through the energy transmission module, the energy recording module is used for recording respective electric quantity information of the retailer and the sales producer, and the information query module is used for allowing the parties to query own account information.
Specifically, in the user registration module, the intelligent contract deployer is an initial administrator of the transaction system and initializes some common parameters. The participant needs to register an account through the user registration module according to the user name, the account address and the account type. Information for these accounts, including available energy and energy coins, and total energy generated and used, is then initialized. In view of the cold start problem, participants may acquire energy coins through the energy transaction module. After this, the sales producer confirms the exact amount of energy required and places an order, and the transaction process is completed by the energy transaction module, and the entire energy transaction process proceeds based on the method described in example 1. The energy transaction module can verify the authority of an account, which ensures the sufficient balance of energy, then the energy transmission module is invoked, which not only clears the transaction between a retailer and an operator, but also activates an energy scheduling switch, which ensures that energy flows from the retailer to a sales producer, the energy recording module can execute the modification of the total generated energy and the total used electric quantity according to the intelligent ammeter data of the places of the retailer and the sales producer through the energy recording module, and further according to the electric quantity accounting cost, and the information query module provides six types of interfaces for participants to query own account information.
The performance of the present invention is illustrated in terms of convergence performance under a unified pricing mechanism by setting up 1 operator, 1 retailer, and 10 edge servers as follows. Since the optimal demands of the edge servers are quite similar under the unified pricing mechanism, as shown in fig. 5, selecting one of the edge servers for presentation, both the carrier and retailer achieve quite good utility. Under the operator's high-level policies and the retailer's middle-level policies, the producer, i.e., consumer, can quickly converge to a relatively good solution. The convergence order of the different entities is consistent with the action order of the three phases of the Stackelberg game, so that the leader is more likely to gain better benefits than the follower.
To demonstrate the superior performance of the present invention, the present invention was compared to some popular deep reinforcement learning algorithms from the economic analysis perspective, as shown in fig. 6, where the data is the average of 10 experimental results to reduce random errors. As is evident from fig. 6a, HDPG obtains more total rewards than the three deep reinforcement learning algorithms PPO, also known as proximity strategy optimization, SAC, also known as flexible actor-critique, DQN, i.e. deep Q learning. In addition, different algorithms have their own characteristics, e.g., SAC assists retailers in achieving the highest utility, but perform poorly in operators' policies. In contrast HDPG helps operators and retailers achieve higher utility. In addition, edge device usage HDPG also achieves better utility. The hierarchical design of the actions and learning processes of multiple agents helps the agents learn their own strategies according to competing strategies, which is a potential reason for better performance of HDPG.
As shown in fig. 7, the parameter sensitivity of the utility as a function of the number of production users was analyzed. Since the utility of an edge server is not too sensitive to the number of participants, a box plot is used to carefully describe the impact of edge server energy usage on the utility of an edge server. As shown in fig. 7a, it is sensible for an edge server with high value production to participate in the energy market. As can be seen from fig. 7b and 7c, the transmission loss rate largely determines the utility of the operators and retailers. Therefore, the adoption of more advanced technology in the energy transmission and distribution network to reduce the transmission loss rate has important significance. As the number of edge servers increases, the utility of operators and retailers steadily increases, while the utility of each edge server tends to decrease slightly. It should be noted that an increase in the number of edge servers may affect the policy of each edge server, resulting in some fluctuation in the utility of all entities. Fig. 7d shows the trend of retailer utility for different energy storage efficiencies. Low energy storage efficiency may lead to negative utility, while the higher the efficiency of the energy storage device, the better the utility of the retailer. However, the cost and difficulty of reducing transmission loss rates and increasing energy storage efficiency are also difficulties in the energy trading market.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.