CN113888327B

CN113888327B - Energy internet transaction method and system based on reinforcement learning block chain enabling

Info

Publication number: CN113888327B
Application number: CN202111164320.7A
Authority: CN
Inventors: 曹一凡; 仇超; 任晓旭; 王晓飞
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-09-30
Filing date: 2021-09-30
Publication date: 2024-12-27
Anticipated expiration: 2041-09-30
Also published as: CN113888327A

Abstract

The present invention discloses an energy internet transaction method and system based on reinforcement learning blockchain empowerment, comprising the following steps: constructing a three-stage game model between operators, retailers and producers and sellers based on the energy transaction relationship between operators, retailers and producers and sellers on the blockchain transaction platform; using a distributed hierarchical policy gradient algorithm to solve the game equilibrium point in the three-stage game model, the game equilibrium point includes the best unit service price, the best unit energy price and the best energy demand; operators, retailers and producers and sellers conduct energy transactions according to the game equilibrium point. The present invention can help operators and retailers achieve higher utility, while producers and sellers can also achieve better utility.

Description

Energy internet transaction method and system based on reinforcement learning block chain enabling

Technical Field

The invention belongs to the technical field of energy Internet, and particularly relates to an energy Internet transaction method and system based on reinforcement learning blockchain enabling.

Background

With the trend of distributed energy, energy Internet (EI) is rapidly becoming a focus of attention. However, the influx of a large number of distributed energy sources and traditional control methods have hampered the development of the energy internet due to the intermittence and uncertainty of the distributed energy sources. At the same time, the advent of software defined networks (Software defined network, SDN) has brought reliability and flexibility to address these issues. Due to reasonable price and efficient transmission, distributed energy markets are evolving in the energy internet. The system enables traditional energy consumers to be converted into energy retailers and has the capability of producing, storing and selling distributed energy, and the mode can reduce power transmission loss and reduce load peaks of the energy Internet.

On the other hand, in the general trend of the internet of things, edge computing is widely applied to the architecture of various network computing by virtue of the advantages of the edge computing in terms of network delay, expandability and reliability. In order to continuously provide reliable computing, storage and communication services, energy utilization and supply of devices such as edge servers, gateways and the like are urgent problems to be explored.

In order to meet the energy demands of both the emerging energy retailers and various edge devices in the energy internet, an energy trading market serving edge computing is to be constructed. Although this model can effectively solve the demands of two parties, there are still many problems that 1) credit crisis among different trading entities makes it impossible to conduct energy trading reliably, (2) imperfect market modeling that model establishment for each character is imperfect in the existing energy trading market, a trading process that is mutually interactive and constrained is not formed, and (3) unbalanced utility that the optimization mechanism in the existing energy trading focuses on maximizing utility of one party, and utility balance among multiple parties is not considered. In addition, most current research uses the methodology of game theory to simulate interactions between parties in the transaction process in order to achieve a utility balance. Conventional approaches typically assume a centralized organization to collect the user's information and assist them in developing relevant policies, which are targeted optimizations built under complete information, ignoring the protection of the user's privacy parameters. Meanwhile, in real life, complete information of an individual cannot be well obtained, particularly some privacy parameters, so that the problem of difficult information collection is easy to generate when a traditional method is adopted to formulate related policies, and the traditional method cannot be used.

Disclosure of Invention

Aiming at the problem that the energy transaction based on edge calculation in the prior art cannot realize the privacy protection and the utility balance of users, the invention provides an energy internet transaction method and system based on reinforcement learning block chain energization. In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

An energy internet transaction method based on reinforcement learning block chain enabling comprises the following steps:

S1, constructing a three-stage game model among an operator, a retailer and a obstetrician based on the energy trade relation of the operator, the retailer and the obstetrician on a blockchain trade platform;

s2, solving game balance points in a three-stage game model by using a distributed hierarchical strategy gradient algorithm, wherein the game balance points comprise an optimal unit service price, an optimal unit energy price and an optimal energy demand;

and S3, the operators, retailers and producers conduct energy transaction according to the game balance points obtained in the step S2.

The step S2 includes the steps of:

s2.1, setting network parameters of a three-stage game model;

s2.2, initializing weight parameters of the three-stage game model;

S2.3, respectively acquiring the states of operators Status of retailerAnd the state of the sales producerEach edge server in the blockchain trading platform sequentially selects the appropriate unit service price η for the operator utility U ^o (η) as the reward function to maximize the operator utility U ^o (η), the appropriate unit energy price p for the retailer utility U ^r (p) as the reward function to maximize the retailer utility U ^r (p), and the respective seller utility using a markov decision processSelecting the appropriate energy demand q _j as a function of rewards to be effective for the producerMaximization.

In step S2.3, the status of the operatorThe expression of (2) is:

wherein p ^t-1 represents the unit energy price at the time of t-1 step, Representing the energy demand submitted by the sales producer to the local retailer through the edge server j at step t-1;

the expression that the operator utility U ^o (η) maximizes is:

Where U _m represents the extra rewards the operator obtains through the trusted blockchain service provided by each energy transaction, φ represents the transmission loss rate, C _t represents the unit transmission cost, C _o represents the fixed operation and maintenance cost, η _min represents the lowest price per unit service, η _max represents the highest price per unit service, q _j represents the energy demand submitted by the seller to the local retailer through edge server j, and N represents the aggregate of all edge servers in the blockchain transaction platform.

The calculation formula of the bonus U _m is:

U_m＝(R_f+rs)λ;

Where R _f represents a fixed block prize, R represents the blockchain service fee offered to the operator by the producer at each energy transaction, s represents a blockparameter, and λ represents a probability factor in the blockchain.

The status of the retailerThe expression of (2) is:

Where η ^t denotes the price per service at step t, Representing the energy demand submitted by the sales producer to the local retailer through the edge server j at step t-1;

The retailer utility U ^r (p) maximizes the expression:

Where C _g represents the cost of production that the retailer needs to afford to produce energy, C _s represents the cost of storage that the retailer needs to afford to store energy, p _min represents the lowest unit energy price, p _max represents the highest unit energy price, q _j represents the energy demand that the producer submits to the local retailer through edge server j, and N represents the set of all edge servers in the blockchain trading platform.

The calculation formula of the production cost C _g is as follows:

Where a, b, k are weighting factors for the cost of electricity generation at the time of retailer production, and phi represents the transmission loss rate.

The calculation formula of the storage cost C _s is as follows:

Where c _s represents the unit cost of the retailer's energy storage, ζ _c represents the charging efficiency of the energy storage device, and ζ _d represents the discharging efficiency of the energy storage device.

The state of the sales producerThe expression of (2) is:

the utility of the obstetrician The maximized expression is:

Where δ represents a conversion factor, w _j represents the usage scenario of edge server j in terms of energy utilization, q _min represents the minimum energy demand, q _max represents the maximum energy demand, q _j represents the energy demand submitted by the seller to the local retailer via edge server j, and r represents the blockchain service fee offered to the operator by the seller at each energy transaction.

The energy internet transaction system based on reinforcement learning blockchain enabling comprises an energy application layer, an energy data layer and an edge control layer, wherein the energy application layer interacts with the edge control layer through an intelligent contract interface, the energy data layer interacts with the edge control layer, the energy application layer comprises retailers and sellers, the retailers and the sellers interact through a blockchain transaction platform, the edge control layer comprises edge servers and distributed SDN controllers maintained by operators, each edge server serves as a node in the blockchain transaction platform, the energy data layer comprises a switch and an energy router, the switch is connected with the distributed SDN controllers and is used for receiving scheduling instructions sent by the distributed SDN controllers and forwarding the scheduling instructions to the corresponding energy routers, and the energy router is used for sensing states of energy lines and reflecting the states of the energy lines to the edge servers.

The intelligent contract interface is built based on an intelligent contract system, the intelligent contract system comprises a user registration module, an energy transaction module, an energy transmission module, an energy recording module and an information query module, after three parties of an operator, a sales producer and a retailer register respective accounts through the user registration module, the sales producer places orders through the energy transaction module according to own needs, the energy is transmitted to the sales producer from the retailer through the energy transmission module, the energy recording module is used for recording respective electric quantity information of the retailer and the sales producer, and the information query module is used for enabling the parties to query own account information.

The invention has the beneficial effects that:

The hierarchical design of the actions and the learning processes of the multiple agents is helpful for the agents to learn own strategies according to competing strategies, the overall performance of a transaction system is improved, and compared with a popular deep reinforcement learning algorithm, the method can help operators and retailers to achieve higher utility, and meanwhile, manufacturers and retailers can achieve better utility. Under the unified pricing mechanism, the convergence sequence of different entities is consistent with the action sequence of three stages of the Stark Stackelberg game, so that a leader in the game is more likely to obtain better benefits than a follower.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a transaction system according to the present invention.

FIG. 2 is a schematic diagram of a three-stage Stackelberg gaming model.

FIG. 3 is a block diagram of a hierarchical policy gradient algorithm.

Fig. 4 is a block diagram of the smart contract system.

Fig. 5 is a game convergence presentation under HDPG.

Fig. 6 is a graph of performance versus various algorithms.

FIG. 7 is a schematic diagram of utility under different parameters.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without any inventive effort, are intended to be within the scope of the invention.

Embodiment 1 an energy internet transaction method based on reinforcement learning blockchain enabling, comprising the following steps:

As shown in fig. 2 and 3, the three-stage game model includes three policy networks corresponding to an operator, a retailer, and a sales producer, respectively, where the sales producer submits energy requirements to the operator through a blockchain transaction platform, and the operator assists in completing energy transactions and transmissions between the retailer and the sales producer through the blockchain transaction platform as an intermediary. The blockchain transaction platform is composed of a plurality of edge servers, each edge server is used as a node in the blockchain transaction platform and bears functions of accounting, broadcasting, verification and consensus, and the collection of the edge servers is represented as N= {1, 2.

S2, solving game equilibrium points in a three-stage game model by using a distributed hierarchical strategy gradient algorithm (HIERARCHICAL DISTRIBUTED POLICY GRADIENT, HDPG);

Solving a game balance point in a three-stage game model by using a Markov decision process in consideration of a Stackelberg game under incomplete information, wherein one Markov decision process is equivalent to the whole process of one-time energy transaction and transmission, and the game balance point comprises an optimal unit service price, an optimal unit energy price and an optimal energy requirement, so that the utility of an operator U ^o (eta), a retailer U ^r (p) and a seller is maximized The overall benefit is maximized under the condition that eta represents the unit service price, p represents the unit energy price, q _j represents the energy demand submitted by the producer to the local retailer through the edge server j, and j epsilon N.

The step S2 includes the steps of:

S2.1, setting network parameters of three strategy networks;

The network parameters include a learning rate α _r of the retailer policy network, a learning rate α _p of the producer policy network, a learning rate α _o of the carrier policy network, and a discount factor γ.

S2.2, weight parameters of three strategy networksCarrying out random initialization;

S2.3, respectively acquiring the states of operators Status of retailerAnd the state of the sales producerEach edge server in turn maximizes carrier utility U ^o (η) by selecting appropriate unit service price η as a reward function for carrier utility U ^o (η), maximizes retailer utility U ^r (p) by selecting appropriate unit energy price p as a reward function for retailer utility U ^r (p), maximizes respective seller utilitySelecting appropriate energy demand as a function of rewardsTo make the sales person effectiveMaximized, and η ^t∈A_o,A_o represents the operator's action space, p ^t∈A_r,A_r represents the retailer's action space,A _p represents the action space of the sales producer;

The status of the operator The expression of (2) is:

wherein p ^t-1 represents the unit energy price at the time of t-1 step, Representing the energy demand submitted by the seller to the local retailer at step t-1 via edge server j.

The expression that the operator utility U ^o (η) maximizes is:

Where U _m represents the extra rewards the operator obtains through the trusted blockchain service provided by each energy transaction, φ represents the transmission loss rate, C _t represents the unit transmission cost, C _o represents the fixed operation and maintenance cost, η _min represents the lowest unit service cost, and η _max represents the highest unit service cost.

The calculation formula of the bonus U _m is:

U_m＝(R_f+rs)λ;

Where R _f represents a fixed block prize, R represents the blockchain service fee offered to the operator by the producer at each energy transaction, s represents a blockparameter, and λ represents a probability factor in the blockchain. The operator may be further motivated to maintain the blockchain by the bonus U _m.

The status of the retailerThe expression of (2) is:

The retailer utility U ^r (p) is determined by the unit energy price p and the total energy demand, and the expression of the retailer utility U ^r (p) maximization is:

where C _g represents the production cost that the retailer needs to afford to produce energy, C _s represents the storage cost that the retailer needs to afford to store energy, p _min represents the lowest unit energy price, and p _max represents the highest unit energy price.

The calculation formula of the production cost C _g is as follows:

where a, b, k are weighting factors of the power generation cost.

The calculation formula of the storage cost C _s is as follows:

Where c _s represents the unit cost of the retailer's energy storage, ζ _c represents the charging efficiency of the energy storage device, ζ _d represents the discharging efficiency of the energy storage device, Representing the energy that the retailer needs to actually produce and the energy stored, taking into account the energy lost during transmission.

The state of the sales producerThe expression of (2) is:

the utility of the obstetrician The maximized expression is:

Where δ denotes a conversion factor, w _j denotes the usage scenario of edge server j in terms of energy utilization, q _min denotes minimum energy demand, q _max denotes maximum energy demand, pq _j denotes the energy payment cost of the seller to the retailer, Representing the benefit that the edge server j takes from its purchased energy in actual production.

The specific flow of the markov decision process is the prior art, and the invention is not repeated, in addition, the time complexity of each iteration of the outer loop, that is, the total iteration of executing the markov decision process, the middle loop, that is, the iteration step number of each strategy network, and the inner loop, that is, the number of producers in the producer strategy network is respectively O (E), O (T) and O (N). Each policy network includes two fully-connected layers, and the time complexity of each fully-connected layer is expressed asWhere K _l refers to the number of fully connected neural units and L represents the number of layers of a policy network. Because each policy network contains two fully connected layers to generate policies, the overall temporal complexity of the algorithm is O (ETN (T _f)).

The formula of the operator policy network update weight parameter θ _o is:

In the formula, Representing the policy of the operator at step t,Representing the rewards of the operator at step t,Representing the status of the operator at step t, andS _o represents the state space of the operator, α _o represents the learning rate of the operator policy network.

The formula for the retailer policy network update weight parameter θ _r is:

In the formula, Representing the policy of the retailer at step t,Indicating the rewards of the retailer at step t,Representing the status of the retailer at step t, anS _r represents the state space of the retailer, a _r represents the learning rate of the retailer policy network.

Producer policy network update weight parametersThe formula of (2) is:

In the formula, Representing the strategy of the shipper at step t,Indicating the rewards of the seller at step t,Indicating the state of the sales producer at step t, andS _p represents the state space of the shipper,Representing the learning rate of the producer policy network.

In this embodiment, the producers refer to distributed energy users who cannot produce energy by themselves or whose energy produced cannot meet the energy consumption, and they can purchase energy from retailers of public energy companies or BSDEI according to the energy demand and unit energy price.

Retailers refer to energy users who use distributed power generation and energy storage devices that generate more total power than total power, and they benefit from providing energy to a variety of distributed applications. On the other hand, they need to bear the costs of distributed power generation, energy storage, and pay the operator for the transmission routing services.

The carrier is an intermediary between the seller and the retailer to assist in completing the energy trading process. In order to provide more convenient service and lower delay, operators deploy hardware devices such as an edge server, a distributed SDN controller and the like on an edge control layer, so that edge-to-edge coordination among the devices is realized. In return, it charges the retailer for transmission routing services and the seller for trusted blockchain services.

Embodiment 2 an energy internet transaction system based on reinforcement learning blockchain enabling, as shown in fig. 1, the system comprises an energy application layer, an energy data layer and an edge control layer, wherein the energy application layer, the energy data layer and the edge control layer form an energy transaction service system of a distributed energy market together, the three layers are mutually independent and mutually related, and energy routing and scheduling control in the blockchain enabling energy internet (Blockchain-ASSISTED ENERGY INTERNET, BEI) are decoupled; the energy application layer interacts with the edge control layer through an intelligent contract interface, and the energy data layer interacts with the edge control layer through a standard interface OpenFlow; the energy application layer comprises retailers and producers, the retailers and the producers directly conduct information interaction through a blockchain transaction platform, the direct transaction process can motivate the retailers and the producers to participate in a distributed energy market more actively, the blockchain transaction platform provides a reliable and stable third party service platform for energy transaction in the energy application layer, the edge control layer comprises edge servers maintained by operators and distributed SDN controllers, the distributed SDN controllers are used for dispatching and controlling energy routing of an energy data layer, each edge server serves as a node in the blockchain transaction platform and bears accounting, broadcasting, verifying and consensus functions, the aggregate of the edge servers is expressed as N= {1,2, j, N' }, the intelligent contract provides reliable and automatic process control for energy transaction in the blockchain transaction platform, the energy data layer comprises a switch and an energy router, the switch is used for receiving a dispatching instruction of the distributed SDN controller of the edge control layer and sending the dispatching instruction to the corresponding energy router, the energy router is used for sensing the state of the energy line and reflecting the real-time state of the energy line to the edge server, so that the distributed SDN controller is helped to modify the scheduling instruction, and the state of the energy line comprises the electric energy value, the voltage, the current and the like on the energy line. In addition, the energy router may also receive a command from the distributed SDN controller to change the state of the energy router.

As shown in fig. 4, the intelligent contract interface is established based on an intelligent contract system, the intelligent contract system comprises a user registration module, an energy transaction module, an energy transmission module, an energy recording module and an information query module, after three parties of an operator, a sales producer and a retailer register respective accounts through the user registration module respectively, the sales producer orders through the energy transaction module according to own requirements, the energy transaction realizes that energy flows from the retailer to the sales producer through the energy transmission module, the energy recording module is used for recording respective electric quantity information of the retailer and the sales producer, and the information query module is used for allowing the parties to query own account information.

Specifically, in the user registration module, the intelligent contract deployer is an initial administrator of the transaction system and initializes some common parameters. The participant needs to register an account through the user registration module according to the user name, the account address and the account type. Information for these accounts, including available energy and energy coins, and total energy generated and used, is then initialized. In view of the cold start problem, participants may acquire energy coins through the energy transaction module. After this, the sales producer confirms the exact amount of energy required and places an order, and the transaction process is completed by the energy transaction module, and the entire energy transaction process proceeds based on the method described in example 1. The energy transaction module can verify the authority of an account, which ensures the sufficient balance of energy, then the energy transmission module is invoked, which not only clears the transaction between a retailer and an operator, but also activates an energy scheduling switch, which ensures that energy flows from the retailer to a sales producer, the energy recording module can execute the modification of the total generated energy and the total used electric quantity according to the intelligent ammeter data of the places of the retailer and the sales producer through the energy recording module, and further according to the electric quantity accounting cost, and the information query module provides six types of interfaces for participants to query own account information.

The performance of the present invention is illustrated in terms of convergence performance under a unified pricing mechanism by setting up 1 operator, 1 retailer, and 10 edge servers as follows. Since the optimal demands of the edge servers are quite similar under the unified pricing mechanism, as shown in fig. 5, selecting one of the edge servers for presentation, both the carrier and retailer achieve quite good utility. Under the operator's high-level policies and the retailer's middle-level policies, the producer, i.e., consumer, can quickly converge to a relatively good solution. The convergence order of the different entities is consistent with the action order of the three phases of the Stackelberg game, so that the leader is more likely to gain better benefits than the follower.

To demonstrate the superior performance of the present invention, the present invention was compared to some popular deep reinforcement learning algorithms from the economic analysis perspective, as shown in fig. 6, where the data is the average of 10 experimental results to reduce random errors. As is evident from fig. 6a, HDPG obtains more total rewards than the three deep reinforcement learning algorithms PPO, also known as proximity strategy optimization, SAC, also known as flexible actor-critique, DQN, i.e. deep Q learning. In addition, different algorithms have their own characteristics, e.g., SAC assists retailers in achieving the highest utility, but perform poorly in operators' policies. In contrast HDPG helps operators and retailers achieve higher utility. In addition, edge device usage HDPG also achieves better utility. The hierarchical design of the actions and learning processes of multiple agents helps the agents learn their own strategies according to competing strategies, which is a potential reason for better performance of HDPG.

As shown in fig. 7, the parameter sensitivity of the utility as a function of the number of production users was analyzed. Since the utility of an edge server is not too sensitive to the number of participants, a box plot is used to carefully describe the impact of edge server energy usage on the utility of an edge server. As shown in fig. 7a, it is sensible for an edge server with high value production to participate in the energy market. As can be seen from fig. 7b and 7c, the transmission loss rate largely determines the utility of the operators and retailers. Therefore, the adoption of more advanced technology in the energy transmission and distribution network to reduce the transmission loss rate has important significance. As the number of edge servers increases, the utility of operators and retailers steadily increases, while the utility of each edge server tends to decrease slightly. It should be noted that an increase in the number of edge servers may affect the policy of each edge server, resulting in some fluctuation in the utility of all entities. Fig. 7d shows the trend of retailer utility for different energy storage efficiencies. Low energy storage efficiency may lead to negative utility, while the higher the efficiency of the energy storage device, the better the utility of the retailer. However, the cost and difficulty of reducing transmission loss rates and increasing energy storage efficiency are also difficulties in the energy trading market.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. The energy internet transaction method based on reinforcement learning block chain enabling is characterized by comprising the following steps of:

s3, the operators, retailers and producers conduct energy transaction according to the game balance points obtained in the step S2;

The block chain transaction platform comprises a plurality of edge servers which are connected with each other, wherein the edge servers are connected with a distributed SDN controller through intelligent contract interfaces, retailers and producers and sellers realize energy transaction through the edge servers and the intelligent contract interfaces, and the producers and sellers provide block chain services for the retailers and the producers through the edge servers and the intelligent contract interfaces;

the step S2 includes the steps of:

s2.1, setting network parameters of a three-stage game model;

s2.2, initializing weight parameters of the three-stage game model;

S2.3, respectively acquiring the states of operators Status of retailerAnd the state of the sales producerEach edge server in the blockchain trading platform sequentially selects the appropriate unit service price η for the operator utility U ^o (η) as the reward function to maximize the operator utility U ^o (η), the appropriate unit energy price p for the retailer utility U ^r (p) as the reward function to maximize the retailer utility U ^r (p), and the respective seller utility using a markov decision processSelecting the appropriate energy demand q _j as a function of rewards to be effective for the producerMaximizing;

the expression that the operator utility U ^o (η) maximizes is:

Wherein U _m represents an additional incentive obtained by an operator through a trusted blockchain service provided by each energy transaction, phi represents a transmission loss rate, C _t represents a unit transmission cost, C _o represents a fixed operation and maintenance cost, eta _min represents a minimum unit service price, eta _max represents a maximum unit service price, q _j represents an energy demand submitted by a seller to a local retailer through an edge server j, and N represents a set of all edge servers in a blockchain transaction platform;

The calculation formula of the bonus U _m is:

U_m＝(R_f+rs)λ;

Wherein R _f represents a fixed block prize, R represents the blockchain service fee offered to the operator by the producer upon each energy transaction, s represents a blockparameter, and λ represents a probability factor in the blockchain;

The retailer utility U ^r (p) maximizes the expression:

wherein, C _g represents the production cost required to be born when the retailer produces the energy, C _s represents the storage cost required to be born when the retailer stores the energy, p _min represents the lowest unit energy price, and p _max represents the highest unit energy price;

the utility of the obstetrician The maximized expression is:

where δ represents a conversion factor, w _j represents a use scenario of the edge server j in terms of energy utilization, q _min represents a minimum energy demand, and q _max represents a maximum energy demand;

The calculation formula of the production cost C _g is as follows:

wherein a, b and k are weighting factors of power generation cost when produced by retailers, and phi represents transmission loss rate;

The calculation formula of the storage cost C _s is as follows:

2. The reinforcement learning blockchain enabled energy internet transaction method of claim 1, wherein in step S2.3, the status of the operatorThe expression of (2) is:

3. The reinforcement learning blockchain enabled energy internet transaction method of claim 1, wherein,

The status of the retailerThe expression of (2) is:

Where η ^t denotes the price per service at step t, Representing the energy demand submitted by the seller to the local retailer at step t-1 via edge server j.

4. The reinforcement learning blockchain enabled energy internet transaction method of claim 1, wherein,

The state of the sales producerThe expression of (2) is:

Wherein p ^t represents the unit energy price at the time of t steps.