Acceso abierto

Study on Electricity Settlement Mechanism Considering Competitiveness of Electricity Markets

 y   
29 sept 2025

Cite
Descargar portada

Introduction

Spot market, in economics, generally refers to the market for immediate delivery of commodities, however, due to the special characteristics of real-time balancing of electricity commodities and the inability to store them on a large scale, the trading period of the electricity spot market is often extended to the moment of real-time operation up to the day before the real-time operation, forming a three-tiered market of the day before day-intraday-real-time [1-2]. In terms of the specific market model, the power spot market includes decentralized and centralized market, in which the decentralized market refers to the power generator and power buyer to sign a bilateral contract, and according to the contract of self-scheduling, grid scheduling agencies to ensure the implementation of the contract at the same time, to maintain the balance of the system power, while the centralized market is the use of the whole amount of power centralized bidding in the form of the system security constraints to get the combination of units and the power curve The centralized market is based on the system security constraints to obtain the unit mix and output curve, which is closely connected with the grid operation [3-6].

Different countries and regions design their own spot market structure according to the power structure, power system and other factors. The trading and settlement mechanism of the power market is a prerequisite for the normal operation of the entire power market. The trading and settlement mechanism of the power market directly affects the operation results of the power market [7-9]. Only through a scientific and appropriate transaction and settlement mechanism can the smooth development of the power market be maintained, and only in this way can the fundamental interests of market players be better protected [10-11].

Since China launched a new round of power system reform and development work, with the increasing proportion of power market transactions, as well as the corresponding power trading work continues to deepen, the market access standards are further relaxed, a large number of power sales companies continue to influx, the proportion of renewable energy sources and the corresponding proportion of transactions continue to grow, the power market presents a widening of the scope of the opening up of the diversification of the main body of the transaction, the type of transaction, transaction cycle Diversification of the significant development characteristics, and these changes in turn further increase the power market transaction settlement requirements, it can be seen that the establishment and improvement of the power market transaction settlement mechanism should not be delayed [12-15].

The power market has a variety of trading methods, in order to realize the form of trading as a classification basis can be divided into spot trading, forward contract trading and futures, options trading three types of trading, according to the time scale of the transaction to be divided, can be divided into ultra-short-term power transactions, short-term transactions, medium- and long-term transactions [16-17]. Electricity spot trading can be further divided into day-ahead market electricity trading, real-time market electricity trading and so on. Its main features are shorter time or real-time quotes, real-time transactions, etc., in the price fluctuations have a frequent and large range of characteristics. The power market forward contract type trading, refers to the way through the signing of forward contracts to carry out [18-20]. In the corresponding contract will be clearly put forward the total power trading volume, as well as the corresponding principles and methods of apportionment, so as to facilitate the subsequent operation of power trading.

Electricity market trading is of great significance for the normal functioning of the social economy, in which the trading mechanism, regulatory model has been the research hotspot in this field. Literature [21] conceived an optimized and improved bidding and settlement strategy in the market before the day of generating enterprises, and introduced the league champion algorithm for numerical examination, and found that the power bidding and settlement strategy envisaged in the study enhanced the expected returns. Literature [22] envisioned a dynamic regulatory market mechanism for electricity and corroborated it with three regional cases, confirming that this dynamic market regulatory mechanism for electricity effectively improves the current situation of financial settlement in the wholesale electricity market. Literature [23] verified through simulation experiments that electricity retailers can reduce EDS costs by choosing partners, while the coalition revenue allocation method proposed in the study is more scientific and targeted compared to the cooperative gaming method.

Literature [24] explores the bidding procedures and fairness of market parties under the three settlement mechanisms of locational marginal pricing, regional pricing and average system pricing in the electricity spot market, and proposes a strategy for selecting the electricity pricing mechanism that is suitable for China’s national conditions. Literature [25], based on the current situation of power market transaction settlement, explores and explores the effective operation mode of the transaction settlement mechanism, aiming to promote the establishment of a more advanced power market transaction settlement mechanism. Literature [26], in order to solve the problems of electricity tariff diversion and imbalance cost sharing in the bilateral electricity market under China’s dual-track system, utilizes the dual settlement system of day-ahead benchmark, real-time volume difference and contractual spread, which effectively enhances the market participation enthusiasm of the main parties of electricity sales and consumption. Literature [27] analyzes the Brazilian wholesale electricity trading model in depth, and argues that its liberalized market environment promotes large-scale investment in low-carbon power generation, but it is weak in regulatory design and prone to imbalances that generate financial and fiscal pressures. Literature [28] conducted a study on the single clearing mechanism in the Brazilian electricity market, revealing that the dual clearing system helps to mitigate anti-competitive behavior and promote market efficiency. Literature [29] studied two relatively novel peer-to-peer (P2P) power exchange settlement mechanisms, and concluded that this novel power settlement transaction mechanism reduces the transaction costs of both parties to the transaction compared to the traditional power transaction settlement mechanism. Literature [30] conceptualized an Ethernet-based power energy trading and settlement framework, which is of positive significance for the improvement of transaction transparency and security. The above study investigates the electricity market bidding mechanism from the perspectives of expected benefits, transaction efficiency, transparency and fairness.

In this paper, based on the infinite repeated game theory, the bidding process model of power producers and the International Organization for Standardization (ISO) electricity market clearing model are constructed by using the SFE model, and the MADDPG deep reinforcement learning algorithm is used to realize the solution of the model. In order to verify the feasibility of the model and the algorithm, the improved IEEE 33-node distribution system is selected for case study analysis, which analyzes the superiority of the game theory-based bargaining model over the constant/time-sharing tariff method and compares the performance of the MADDPG algorithm with the traditional reinforcement learning algorithm in terms of the optimization of offer strategies. Finally, based on the model construction and solution, a universal settlement mechanism for the electricity market is designed in combination with the principle of electricity settlement mechanism.

Intelligent Body Equilibrium Modeling of Electricity Markets Introducing Reinforcement Learning

To explore the power settlement mechanism and optimize it in a competitive power market under consideration, the first step is to make the power market reach Nash equilibrium. To this end, this chapter constructs an intelligent body equilibrium model of the electricity market and introduces a reinforcement learning method to solve the model.

Equilibrium modeling of power market intelligences
Bidding procedures for power producers

Generator Bidding

Generation vendor bidding is an act in which a power producer bids to the demand side of the electricity to provide electricity services with the aim of offering the best price in order to obtain an order from the demand side of the electricity.

SFE model-based bidding process among power producers

The SFE model is a mathematical modeling method used to support the bidding process of power generation manufacturers [31]. The basic idea of the SFE model is that the power generation manufacturers first estimate their optimal bidding price with the maximum profit they can make based on their maximum biddable price, as well as the prices of their competitors in the bidding process.

The bidding process of the power producer is studied using the SFE model. The cost function of the power producer is modeled as a quadratic function of its output power: Cgm(pgt)=αgmpgt+12βgmpgt2 pgminpgtpgmax,gM$$\begin{array}{l} C_g^m({p_{gt}}) = \alpha_g^m{p_{gt}} + \frac{1}{2}\beta_g^mp_{gt}^2 \\ p_g^{\min } \leq {p_{gt}} \leq p_g^{\max },\forall g \in M \\ \end{array}$$

where pgt is the output power for time interval t, psmin$$p_s^{min}$$ and pgmax$$p_g^{max}$$ are the min/max offered generation limits, and M is the set of generation costs. The marginal cost of the generator is a linear function of the output power: pgm(pgt)=αgm+βgmpgt$$p_g^m({p_{gt}}) = \alpha_g^m + \beta_g^m{p_{gt}}$$

where αgm$$\alpha_g^m$$ and βgm$$\beta_g^m$$ represent the intercept and slope of the marginal cost function, respectively.

At time interval t, each power producer submits a supply offer to the ISO. In this paper, the intercept parameterization is used to model the bidding strategy of the power producers, and the supply offer is formulated as: pgt(pgt)=αgt+βgmpgt,αgtAg$${p_{gt}}({p_{gt}}) = {\alpha_{gt}} + \beta_g^m{p_{gt}},\quad {\alpha_{gt}} \in {A_g}$$

where αgi is the intercept of the supply function, known as the strategic variable, specified by Gencog in the strategic space, which can be deviated from αgm$$\alpha_g^m$$ to exert market power. The slope of the supply function is kept equal to βgm$$\beta_g^m$$.

International Organization for Standardization market clearing model

ISO market clearing refers to a set of standards developed globally by the International Organization for Standardization (ISO) for the management and clearing of market transactions.

Electricity consumption under load at interval time t is modeled with a linear demand curve: pdt(qdt)=fd(qdtDdtmax)$${p_{dt}}({q_{dt}}) = {f_d} \cdot ({q_{dt}} - D_{dt}^{\max })$$

Where, fd slope, does not vary with time, qdt is the demand, Ddtmax$$D_{dt}^{\max }$$ is the maximum load demand for the time interval t and D is the set of loads. The total maximum load demand for all loads is: Dt=dDDdtmax$${D_t} = \sum\limits_{d \in D} {D_{dt}^{\max }}$$

Satisfying the conditions of nodal power balance, branch circuit current constraints, and generator output constraints, the ISO clears the market with the objective of maximizing the total benefit to society. The market clearing in the time interval can be expressed in terms of the DC tidal model as: maxpg,qdtdeD(fdtDdtmaxqdt+12fdtqdt2)(αgtpgt+12) s.t.gMpgtdDqdt=0 FPTDF(ptqt)F pminpgmax.M$$\begin{array}{c} {\max_{{p_g},{q_{dt}}}}\sum\limits_{deD} {\left( { - {f_{dt}}D_{dt}^{\max }{q_{dt}} + \frac{1}{2}{f_{dt}}q_{dt}^2} \right)} - ({\alpha_{gt}}{p_{gt}} + \frac{1}{2}) \\ s.t.\sum\limits_{g \in M} {{p_{gt}}} - \sum\limits_{d \in D} {{q_{dt}}} = 0 \\ - F \leq PTDF({p_t} - {q_t}) \leq F \\ {p^{\min }} \leq p_{ \leq g}^{\max }.\forall \in M \\ \end{array}$$

pt and qt are the generation and demand vectors for all buses, which are linear combinations of pdt and qdt, respectively, PTDF is the matrix of power transfer allocation factors, and F is the vector of maximum line flow restrictions.

During this time interval, the revenue of each generation manufacturer is: rgt=λitpgt(αgmpgt+12βgmpgt2),gM$${r_{gt}} = {\lambda_{it}}{p_{gt}} - (\alpha_g^m{p_{gt}} + \frac{1}{2}\beta_g^mp_{gt}^2),g \in M$$

λn is the nodal price of the i generator, calculated based on the Lagrange multiplier corresponding to the constraints in Eq. (6), and gM indicates that the generator g is located at No. i.

Infinitely Repeated Game Theory and Nash Equilibrium Analysis

Game Theory

Game theory is the branch of mathematics that studies the choices made by decision makers under the influence of each other. The core idea of game theory is to model the behavior of participants to understand the possible outcomes and optimal strategies [32]. The analytical methods of game theory include game tree, Nash equilibrium, optimal strategy, game matrix and so on.

Static game theory

Static game theory is a branch of game theory, which mainly studies the process of participants making decisions according to their own interests within a certain period of time. Strategic game is a typical static game, which is divided into pure strategic game and mixed strategic game according to whether the participants’ actions are random or not.

A purely strategic game consists of a set of participants, a set of actions, and a utility. For a game with N participant, the participants are denoted by the positive integer i, and the set of participants is denoted I={i|iN,iZ+}$$I = \left\{ {i|i \leq N,i \in {Z^ + }} \right\}$$. The set of actions for participant i is denoted Si, and contains all of the optional actions for participant i. The decision of participant i is to select action element Si in Si. Utility ui is a function of the action set s = (s1, s2, …sN) of all participants, and ui(s) denotes the benefit or cost of the action to participant i.

Mixed-strategy games add mixed-strategy sets to pure-strategy games. The content of the decision-making of the participants i in the mixed-strategy game is the probability distribution σi of the action set Si, so the actions are randomized. Specifically, let there be n action elements Si, j in set Si, where j = 1, 2,…, n, the probability distribution σi gives the probability of choosing action Si, j as σi(si, j). Participant i’s mixed-strategy set i$$\sum i$$ is the set of all σi that satisfy j=1nσi(si,j)=1$$\sum\nolimits_{j = 1}^n {{\sigma^i}({s^{i,j}}) = 1}$$. The utility of a mixed-strategy game is defined as the expectation of the pure strategy payoffs under the probability distribution: E[ui(σ)]=siSi(ui(s1,s2,,sN)j=1Nσj(sj))$$E[{u^i}(\sigma )] = \sum\limits_{{s^i} \in {S^i}} {({u^i}(} {s^1},{s^2},...,{s^N})\prod\limits_{j = 1}^N {{\sigma^j}} ({s^j}))$$

The constraints on actions in a strategic game are implicit in the set of actions, but are expressed in this way on the assumption that the set of actions Si is unaffected by the actions of the other participants, which is equivalent to si not being constrained by si. If si is constrained by si, then the problem is a generalized Nash equilibrium problem. In this type of problem, the set of actions Si(si) of participant i is a function of the actions si of the other participants, and each participant goes on to solve the constrained optimization problem given si: minsgi(si,si)s.t.siSi(si)$${\min_{s'}}{g^i}({s^i},{s^{ - i}})s.t.\quad {s^i} \in {S^i}({s^{ - i}})$$

Where: gi is the objective function of participant i. The portfolio of actions s* = (s1*,…, sN*) is a generalized Nash equilibrium when each participant minimizes the objective function given si.

The operation of the electricity market can be viewed in static game theory as a static game Γt, participants Gencos, strategy space A1,…, As, and a vector of all Gencos returns (rlt,…, rgt). In the game Γt, the power producer chooses the strategic variables αg$${\alpha_{{g^\prime }}}$$ to maximize the returns through ISO market clearing, which is an MPEC model that can be formulated as a two-layer optimization problem. Considering that the market clearing problem is a quadratic programming problem, its corresponding Karush-Kuhn-Tucker (KKT) condition is globally optimal. Each power producer MPEC problem can be formulated as: maxαgt=rgt=λitpgt(αgmpgt+12βgmpgt2) s.t.αgt[0,3αgm]$$\begin{array}{l} {\max_{{\alpha_{gt}}}} = {r_{gt}} = {\lambda_{it}}{p_{gt}} - (\alpha_g^m{p_{gt}} + \frac{1}{2}\beta_g^mp_{gt}^2) \\ s.t.{\alpha_{gt}} \in [0,3\alpha_g^m] \\ \end{array}$$

Infinite Repetition Game Theory

Infinite Repetition Game Theory is a theoretical discipline that studies the strategies and outcomes of repeated game situations in the game process. In the infinite repeated game, the game participants will face multiple opportunities to play the game, and the result of each game will affect the result of the next game. Therefore, participants in infinitely repeated games need to consider long-term interests and strategies rather than focusing only on short-term interests.

Repeated sales in the electricity market can be modeled as an infinitesimal sequence statistic Γ1, Γ2… discount factor γ. For each generator, the payment sequence rg1, rg2, rg3, … is: rg1+γg2+γ2rg3+=t=1γt1rgt$${r_{g1}} + {\gamma_{g2}} + {\gamma^2}{r_{g3}} + \ldots = \sum\limits_{t = 1}^\infty {{\gamma^{t - 1}}} {r_{gt}}$$

Factor γ ∈ [0, 1] reflects the value of time. γ is close to l, future payoffs will be greater, and power producers will gain more.

If the static game Γt is the same at each time interval (Γ = Γ1 = Γ2 = Γ3 = Γt), the sequence of games Γ1, Γ2, Γ3, … is called an infinite repetition game Γ(∞, γ). In an infinite repetition game, players who are patient enough (γ → 1) can obtain higher payoffs than in a single-stage game.

Nash equilibrium

Nash equilibrium is an important concept in game theory, which refers to a game in which all players choose the optimal strategy, and no player can obtain a higher payoff by unilaterally changing the strategy. In a Nash equilibrium, if for all participants i have: ui(σi*,σi*)ui(si,σi*)(siSi)$${u_i}\left( {\sigma_i^*,\sigma_{ - i}^*} \right) \geq {u_i}\left( {{s_i},\sigma_{ - i}^*} \right)\left( {{s_i} \in {S_i}} \right)$$

Then Mixed Strategy Portfolio σ* is a Nash equilibrium.

The application of Nash equilibria in infinitely repeated games can help players determine the best strategy, respond to changes in their opponents’ strategies, and deal with issues such as cooperation and betrayal to maximize long-term gains.

Reinforcement learning based model solving approach

The whole model framework can be solved iteratively based on Multi-intelligent Reinforcement Learning (MARL) or Multi-intelligent Deep Reinforcement Learning (MADRL) methods, where the ISO market clearing model belongs to the Mixed-Integer Quadratic Programming Problem (MIQP), which can be computed optimally with the help of commercial solvers Cnley or Gurobi.

Setting up elements of enhanced learning

Multi-intelligent (deep) reinforcement learning uses each power generator as an intelligent body with autonomous learning capabilities, and the day-ahead market with which it interacts as the environment. Its elements include:

State s: The weighted average electricity price of each unit in the previous round of interaction is taken as state sd = {λ1,d−1, λ2,d−1, ⋯λ[,d−1}S.

Action a: The action of Intelligent Body i corresponds to the unit offer decision variable ki( ∈ A = [0, Kmax]), and the joint action set of all units is denoted as O = {a1, a2, ⋯, a1}.

Strategy μθ: In state s, the way the intelligent body chooses its action behavior is called strategy. A strategy in MARL corresponds to the probability value of an action chosen by an intelligent body. In MADRL, the strategy is represented as a mapping relation between state and action μθ : SA, and a definite action is obtained at state s, i.e. a ~ μθ(s).

Payoff r: The cumulative payoff rjlol=Σγd1*ri,d$$r_j^{lol} = \Sigma {\gamma^{d - 1*}}{r_{i,d}}$$ of constant interaction in terms of the day-ahead market-clearing profit obtained by the intelligent body i interacting with the environment as payoff ri,d.

The state transfer of the intelligent body is determined by the market clearing process.

Model Solving Based on Multi-Intelligent Reinforcement Learning

In order to avoid the computation falling into local optimal solutions and to improve the convergence speed, the WoLF mechanism (win or learn fast) is combined with the strategy hill-climbing algorithm to form the WoLF-PHC reinforcement learning algorithm, which solves for the optimal offer strategy of the power generator and the market equilibrium outcome [33].

The main idea of the WoLF mechanism is that by adjusting the learning rate to learn slowly and cautiously when the strategy performs well, and learn fast when it performs poorly, so that the intelligences can quickly adapt to the strategies of other intelligences when they perform worse than expected, while reserving enough time for other intelligences to adjust their strategies when they perform better than expected.

The solution process of multi-intelligence reinforcement learning is schematically shown in Fig. 1. Before the start of each trial, each intelligent body selects an offer in the offer space according to its own state and strategy in a sampling way such as roulette, and the returns of each intelligent body and its new state are obtained through market clearing calculation, and the intelligent bodies update their strategies accordingly, and finally reach the market equilibrium.

Figure 1.

Solution flow diagram of multi-agent reinforcement learning

The specific process is as follows:

Discretize the state space and action space.

Generate strategies and select offer curves. The strategy corresponds to the probability level of selecting each ki-value in the action space, and the initial value is set to the same probability of selecting each offer curve, which is continuously updated with iterative computation. In each iteration, the intelligent body selects a certain action in the action space according to the strategy in a roulette wheel fashion and executes it.

Market clearing. The market clearing model is used to calculate the winning power and its revenue of each intelligent body, and this clearing result is fed back to the intelligent body for updating the strategy.

Q-value update. After the market is cleared, the intelligences update the corresponding Q-value with the profit gained as an immediate return: Qi(s,ac)=(1lr)Qi(s,ac)+lr[ri+γmaxaQ(s,at)]$${Q_i}(s,{a_c}) = (1 - {l^r}){Q_i}(s,{a_c}) + {l^r}[{r_i} + \gamma {\max_{{a^\prime }}}Q({s^\prime },{a^\prime }_t)]$$

where ac is the currently selected action.

Average strategy update. Updates the expected value of the intelligent body’s historical strategies: μ¯i(s,ai)=μ¯i(s,ai)+1Ccount(s)[μi(s,ai)μ¯i(s,ai)]$${\bar \mu_i}(s,{a_i}) = {\bar \mu_i}(s,{a_i}) + \frac{1}{{{C^{count}}\left( s \right)}}[{\mu_i}(s,{a_i}) - {\bar \mu_i}(s,{a_i})]$$

where Ccount(s)$${C^{count}}\left( s \right)$$ is the number of occurrences of state s. μi(s, ai) the average strategy of intelligent body i in state s.

Strategy update: μi(s,ai)=μi(s,ai)+Δsai,aiAi$${\mu_i}(s,{a_i}) = {\mu_i}(s,{a_i}) + {\Delta_{s{a_i}}},\forall {a_i} \in {A_i}$$

Among them: Δsai={ δsai aiAδsai aargmaxQi(s,ai) a=argmaxQi(s,ai)$${\Delta_{s{a_i}}} = \left\{ {\begin{array}{*{20}{c}} \begin{array}{l} - {\delta_{s{a_i}}} \\ \sum\limits_{{a_i} \ne A} {{\delta_{s{a_i}}}} \\ \end{array} &\begin{array}{l} a \ne \arg \max {Q_i}\left( {s,{a_i}} \right) \\ a = \arg \max {Q_i}\left( {s,{a_i}} \right) \\ \end{array} \end{array}} \right.$$ δsai=min(μi(s,ai)δN1)$${\delta_{s{a_i}}} = \min \left( {{\mu_i}\left( {s,{a_i}} \right)\frac{\delta }{{N - 1}}} \right)$$

where N is the total number of actions of the intelligent body i and δ is the learning rate factor.

Based on Eqs. (15)~(17), the WoLF mechanism is added. In this paper, the difference between the average strategy and the current lower Q expectation is used as a criterion to judge the strategy performance as follows: δ={ δw aiAμi(s,ai)Qi(s,ai)>aiAμ¯i(s,ai)Qi(s,ai) δv others$$\delta = \left\{ \begin{array}{l} \begin{array}{*{20}{c}} {{\delta_w}}&{ \sum\limits_{{a_i} \in A} {{\mu_i}\left( {s,{a_i}} \right){Q_i}\left( {s,{a_i}} \right) > \sum\limits_{{a_i} \in A} {{{\bar \mu }_i}\left( {s,{a_i}} \right){Q_i}\left( {s,{a_i}} \right)} } } \end{array} \\ \begin{array}{*{20}{c}} {{\delta_v}}&{ others} \end{array} \\ \end{array} \right.$$

where δw, δv denotes the learning rate of the intelligence when it performs better or worse than expected, respectively.

Model Solving Based on Multi-Intelligent Deep Reinforcement Learning

In this paper, the MADDPG method is used to iteratively solve the two-layer model [34]. MADDPG uses the fully connected neural network as the simulation of the policy function μ and Q-value function, and integrates the experience playback and independent target network in deep Q learning to form two strategy networks and two value networks, namely the main policy network (AON) with the parameter θ, the target strategy network (ATN) with parameter θ′, and the main Q network with parameter α (CON) and the target Q network (CTN) with parameter ω′. Among other things, the function of the strategy network is to select actions ai to interact with the environment based on state s and deterministic strategies μθ. The role of the Q network is to evaluate the behavior of the strategy network and guide its subsequent actions. The main networks (AON, CON) are trained and updated using a gradient approach, and the target networks (ATN, CTN) periodically copy the parameters from the main networks and are trained using a soft update approach.

The MADDPG framework is shown in Fig. 2. Each intelligence is mainly composed of three modules, Actor, Critic and experience playback storage. The whole framework is carried out using centralized training and decentralized execution: a centralized approach is used to train each network, where the Q-network can observe the global information to guide the policy network training, while the execution takes action using only the policy network with local information.

Figure 2.

MADDPG framework

The process of iteratively solving the market equilibrium two-layer model based on the MADDPG algorithm is shown in Fig. 3.

Figure 3.

Flowchart of MADDPG solving two-layer model

First, each strategy network generates action set μθi(sd) based on state sd and adds certain Gaussian noise Nj.d(0,σi,d2)$${N_{j.d}}(0,\sigma_{i,d}^2)$$ as shown in Eq. (19) to interact with the environment with action set ai,d = μθj(sd) + Nj,d. The environment decides the next state sd+1 and return ri,d for all the intelligences and stores the complete information set {sd,rj,d,oll,Sd+1}$$\left\{ {{s_d},{r_{j,d}},{o_{ll,{S_{d + 1}}}}} \right\}$$ in the experience playback storage Bj of each intelligence.

Next, a Nbal sample set sn,ri,n,on,sn1$${s_n},{r_{i,n}},{o_{n,{s_{n - 1}}}}$$ is generated by sampling and passed to the policy network and Q-network parts, respectively. Based on this, the target Q network (CTN) updates the target Q value yj,n by combining the sample sets and sample actions oj,n+1=μθ(sn+1)$${o_{j,n + 1}} = \mu \theta '\left( {{s_{n + 1}}} \right)$$ as shown in equation (21). In turn, the main Q network (CON) is trained as shown in equation (22).

Then, the main strategy network (AON) determines the sample actions oj,n=μθj(sn)$${o_{j,n}} = \mu {\theta_j}\left( {{s_n}} \right)$$, which are evaluated by the main Q-network, and updates its own parameters, as shown in equation (25).

Finally, the target network periodically copies the parameters from the main network and performs soft updates, as shown in equation (26). The relevant equations are as follows:

Gaussian noise Nj.d(0,σi,d2) $${N_{j.d}}(0,\sigma_{i,d}^2)$$ parameters: σj,d={ 1 0<dTB max(eεx(dTB),0.03) TB<dTtrain$${\sigma_{j,d}} = \left\{ \begin{array}{l} \begin{array}{*{20}{c}} 1&{ 0 < d \leq {T_B}} \end{array} \\ \begin{array}{*{20}{c}} {\max \left( {{e^{ - \varepsilon x\left( {d - {T_B}} \right)}},0.03} \right)}&{ {T_B} < d \leq {T_{train}}} \end{array} \\ \end{array} \right.$$

where σj,d denotes the noise standard deviation. e, ε denotes the natural constant and decay rate, respectively. Ttrain denotes the total number of training times.

Master Q network training

The training objective of the master Q network is to minimize the mean square deviation between the sample Q value and the target Q value, and its error function is: L(ωj)=1Nbatn[yj,nQωj(Sn,On)|Oj,n=μj(sn)]2$$L\left( {{\omega_j}} \right) = \frac{1}{{{N_{bat}}}}\sum\limits_n {{{\left[ {{y_{j,n}} - {Q_{{\omega_j}}}\left( {{S_n},{O_n}} \right)\left| {{{_{{O_{j,n}} = \mu }}_{_j\left( {{s_n}} \right)}}} \right.} \right]}^2}}$$

Among them: yj,n=rj,n+γQωj(Sn+1,On+1)|Oj,n+1=μj(Sn+1)$${y_{j,n}} = {r_{j,n}} + \gamma {Q_{{\omega_j}}^\prime}\left( {{S_{n + 1}},{O_{n + 1}}} \right)\left| {_{{O_{j,n + 1}} = {{\mu }_j^\prime}\left( {{S_{n + 1}}} \right)}} \right.$$

Its gradient can be calculated according to the automatic differentiation technique, ω updated according to the following equation: ωjωjζ2(dTB)ωL(ωi)$${\omega_j} \leftarrow {\omega_j} - {\zeta_2}\left( {d - {T_B}} \right){\nabla_\omega }L\left( {{\omega_i}} \right)$$

where Nbat, n denotes the number of sample sets and the ordinal number, respectively. On=(O1,n,O2,n,)$${O_n} = \left( {{O_1},n,{O_2},n, \ldots } \right)$$ denotes the nrd action set sampled, and Oj,n denotes the action of intelligent body j therein. yj,n denotes the update parameter of the sample Q-value of intelligent body j. μ′, Q′ denotes the policy function and Q-value function obtained from the corresponding target network, respectively. ζ2 denotes the learning rate of the main Q network. TB denotes the capacity of the experience playback storage.

Master policy network training

The deterministic strategy gradient formula is: Oi(μOj)=ESn,On[OjμOj(Sn)Oj,nQωi(Sn,On)|Oj,n=μOj(Sn)]$${\nabla_{{O_i}}}\left( {{\mu_{{O_j}}}} \right) = {E_{{S_n},{O_n}}}\left[ {{\nabla_{{O_j}}}{\mu_{{O_j}}}\left( {{S_n}} \right){\nabla_{{O_{j,n}}}}{Q_{{\omega_i}}}\left( {{S_n},{O_n}} \right)\left| {_{{O_{j,n}} = {\mu_{{O_j}}}\left( {{S_n}} \right)}} \right.} \right]$$

According to the Monte Carlo method, substituting the sampled dataset into the above equation can be used as an unbiased estimate of this expectation by rewriting the equation as the sampling strategy gradient: Oj(μOj)1NbatnOj,nQω(Sn,On)|Oj,n=μOj(Sn)OjμOj(Sn)$${\nabla_{{O_j}}}\left( {{\mu_{{O_j}}}} \right) \approx \frac{1}{{{N_{bat}}}}\sum\limits_n {{\nabla_{{O_{j,n}}}}} {Q_\omega }\left( {{S_n},{O_n}} \right)\left| {_{{O_{j,n}} = {\mu_{{O_j}}}\left( {{S_n}} \right)}{\nabla_{_{{O_j}}}}} \right.{\mu_{_{{O_j}}}}\left( {{S_n}} \right)$$

θ Press the following to update: θjθj+ζ1(dTB)θjL(μθi)$${\theta_j} \leftarrow {\theta_j} + {\zeta_1}\left( {d - {T_B}} \right){\nabla_{{\theta_j}}}L\left( {{\mu_{{\theta_i}}}} \right)$$

where ζ1 denotes the learning rate of the main strategy network.

Update the target network parameters: { θjτθj+(1τ)θj ωjτωj+(1τ)ωj$$\left\{ \begin{array}{l} {{\theta}_j^\prime} \leftarrow \tau {\theta_j} + \left( {1 - \tau } \right){{\theta }_j^\prime} \\ {{\omega }_j^\prime} \leftarrow \tau {\omega_j} + \left( {1 - \tau } \right){{\omega }_j^\prime} \\ \end{array} \right.$$

where, τ the learning rate of the target network. The entire algorithm is programmed and computed based on PyTorch and Gurobi frameworks.

Analysis of examples
Parameters of the algorithm

Example model setup: in this paper, a modified IEEE33 node power distribution system is used as an example. Its system wiring is shown in Figure 4.

Figure 4.

Distribution system of the IEEE33 nodes

The Distributed Energy Operator (DGO) operates WG-BESS only at nodes 24 & 30, and the Load Aggregator (LA) operates PV-BESS only at nodes 8 & 24. The Distribution Network Operator (DNO) puts in place a gas turbine M at node 14, which has a maximum usable power of 160 kW. LA has a controllable IL at node 14, which has a maximum controllable power of 90 kW, and a maximum continuous control duration is 6 hours. Both interruptible loads and gas turbines participating in the market transaction are used to assume the responsibility of guaranteeing power supply. The three-party game relationship is shown in Table 1.

Tripartite game relationship

Players
Market transaction load number DNO DGO LA
8 ×
24
30 ×

For Q-learning parameters, the learning factor α is taken as 0.04, the reward discount factor γ is taken as 0.9, and for the action strategy, there are three actions for energy storage: charging, idling, and discharging, i.e., AE=(1,0,1)$${A_E} = \left( { - 1,0,1} \right)$$. Both the gas turbine and the IL have only two actions: idling and running, i.e., AM=(0,1),AIL=(0,1)$${A_M} = \left( {0,1} \right),{A_{IL}} = \left( {0,1} \right)$$. In terms of state space, the installed wind/photovoltaic capacity P, the storage system capacity E, and the scheduling period T are divided into intervals on the scales of 60kW, 120kW·h, and 1h, respectively, so that the state space dimension is PS=(10,7,10,7),ES=(6,6,6,6),Ts=24$${P_S} = \left( {10,7,10,7} \right),{E_S} = \left( {6,6,6,6} \right),Ts = 24$$. For any moment, given the value of the new energy output and the current storage power, the unique state can be determined S={St,SP,SE}$$S = \left\{ {{S_t},{S_P},{S_E}} \right\}$$. Q-learning uses this to establish the action Q-learning is used to establish action-state value pairs. In addition, this paper takes the midday period T1 as 11:00-14:00 and the evening time T2 as 17:00-19:00.

The energy storage system and other related parameters are shown in Tables 2 and 3.

Air storage and optical storage system parameters

ID Capacity/ (kW-kW·h) Maximum charge and discharge power /kW Minimum storage power / kW·h Charge and discharge efficiency Final period charge requirement / kW·h
8 PV-BESS 550-800 50 120 0.9 400
24 WG-BESS 700-800 70 60 0.9 300
PV-BESS 700-800 70 60 0.9 400
30 WG-BESS 550-800 60 60 0.9 300

Other simulation example parameters

Parameters Value Parameters Value
wtpd$$w_t^{pd}$$/ (yuan/ kW·h) 0.30 λ1 12000
wtwg$$w_t^{wg}$$/ (yuan/ kW·h) 0.07 λ2 12000
wtpv$$w_t^{pv}$$/ (yuan/ kW·h) 0.07 KDNO 0.95-1.15
wtus$$w_t^{us}$$/ (yuan/ kW·h) 0.16 KDGO 0.90-1.10
CtM$$C_t^M$$/ (yuan/time)) 12 KLA 0.97-1.12
Scene Comparison

In order to illustrate the validity of the model in this paper, the current mainstream trading mechanism in the electricity sales market is introduced: fixed tariffs or time-sharing tariffs for comparison.

Scenario 1: New energy in the distribution network is sold at a fixed tariff.

Scenario 2: Time-of-day tariffs are adopted, and the new energy sources have the right to formulate their own electricity sales strategies.

In Scenarios 1 and 2, the interruptible loads do not participate in the market, the distribution network maximizes the acceptance of the new energy, and there is no market gaming behavior among the three parties.

Scenario 3: The three-party game is conducted by the method of this paper.

Scenario two is solved by Q-learning algorithm and scenario three is solved by Nash-Q method.

In terms of economic benefits, the benefits and costs of the parties in the three scenarios are shown in Table 4. In Scenario 1, since the fixed tariff has no guiding effect on the new energy output, the new energy output is completely determined by its own characteristics, and its own profit is the lowest level among the three scenarios, while the cost of the gas turbine called by DNO to undertake the guaranteed power supply mechanism is the highest among the three scenarios. Scenario 2 adopts time-sharing tariff, the distribution network to maximize the acceptance of new energy, in the midday and late hours of the price of higher new energy supply power is larger, so the LA and DGO profit over the scenario 1 were increased by 268.27 yuan (12.93%) and 191.89 yuan (6.56%), respectively. 143.35 yuan (6.9%). The time-of-use tariff directs new energy sources to participate in load peaking, so the DNO’s cost of utilizing the gas turbine is reduced by 23.67 yuan (4.81%) compared to Scenario One.

Benefits and partial costs in different scenarios

DNO LA DGO
Profit/yuan Call gas turbine cost / yuan Purchase IL cost /yuan Safeguard the cost of the power supply / yuan Profit/ yuan IL sells electrical benefits/ yuan Profit/ yuan
Scene 1 2157.16 492.51 0 0 2074.31 0 2924.35
Scene 2 2341.52 468.84 0 0 2342.58 0 3116.24
Scene 3 2738.67 174.38 254.17 0 2617.63 254.17 3435.19

The results of the tripartite game of load supply power at node 24 under the three scenarios are shown in Fig. 5. In the figure, P-DNO, P-WG, and P-PV are the power supply of DNO, DGO, and LA, respectively, and EV-WG and EV-PV are the values of power stored in wind storage and optical storage, respectively. Figures (a)~(c) represent the power supply game results of Scenario 1, Scenario 2 and Scenario 3, respectively.

Figure 5.

Tripartite power supply game results of node 24 under different scenarios

From Fig. 5(b), it can be seen that in the midday period the optical storage system and wind storage system participate in the power supply with the maximum power under the premise of satisfying the tripartite revenue, which leads to the DNO power supply power appearing to be a minimum of 11kW, while the new energy source in the evening period reduces its own power supply power in order to satisfy the constraints of the energy storage equipment, and thus the DNO has a maximum power supply power of 347kW. The difference between the peaks and valleys of the supply is 336kW. It can be seen that the time-sharing tariff mechanism can improve the profit of new energy feed-in, but it will sacrifice the profit of DNO and increase the degree of fluctuation of DNO power supply.

Comparing Fig. 5(c) with Fig. 5(a) and Fig. 5(b), it can be seen that the DNO’s output increases significantly during the peak load hours, and the DNO competes with the new energy for the load supply power by adjusting the offer price, which improves the position of the DNO in the market game, and also inhibits the tendency of the new energy and other subjects to arbitrarily supply power for the purpose of pursuing profits to a certain extent. At the same time, because DNO has the bargaining power, DNO output in other hours is more moderate, the difference between peak and valley of power supply is only 128kW, and the pressure of DNO power supply becomes smaller.

Scenario 3 adopts game bargaining, where all three parties have bargaining rights, and the load supply power and benefits obtained by the three parties in each round of the game for a typical time period of t=14 are shown in Figs. 6 and 7, respectively. Among them, Figures 6(a)~(c) represent the two-party supply of load No. 8, the three-party supply of load No. 24, and the two-party supply of load No. 30, respectively.

Figure 6.

The three forces in each round at t=14

Figure 7.

Tripartite benefits in each round at t=14

At the beginning of the game, the initial offer of the three parties differs greatly, and the load supply power obtained from the three-party competition fluctuates greatly in each round of clearing, so the profit obtained by each party also fluctuates. In order to compete for a larger load supply, the three parties adjust the unit price of electricity downward, so the benefit of each party is gradually declining. Late in the game, due to the limitations of the market rules, the parties must adjust the price of electricity within a certain range, the price of electricity tends to stabilize the value of the three parties to profit also tends to stabilize. The game reaches equilibrium. But at this time, the price of electricity is still higher than the fixed price, so compared with scene one, the new mechanism can enhance the interests of all parties, but also incentives for new energy and other subjects to actively participate in the market, is conducive to the promotion of market reform.

Optimal offer strategy results

The purpose of this section is to compare the differences brought by the traditional reinforcement learning algorithm (DQN) and the MADDPG deep reinforcement learning algorithm in the solution process in order to verify the effectiveness of the proposed MADDPG method. The default hyperparameters of the traditional reinforcement learning algorithm and the MADDPG deep reinforcement learning algorithm are set as follows: training rounds N=12000, time interval T=24(t=1,2,,24)$$T = 24\left( {t = 1,2, \ldots ,24} \right)$$, learning efficiency of the neural network learning_rate = 10−3, decay parameter γ = 0.9, and action selection step ka = 0.1.

In the experiment, one day is designated as a training round, and the optimal offers of each time interval are added together to obtain a mean value to observe the power generator’s offer strategy in an unknown environment. The results of the convergence curves for the offer strategy variable αg,tstrategy$$\alpha_{g,t}^{strategy}$$ under the two algorithms are shown in Fig. 8.

Figure 8.

Convergence curve of quotation strategy variable αg,tstrategy$$\alpha_{g,t}^{strategy}$$ under two algorithms

Based on the generator g under the MADDPG algorithm in an unknown electricity market environment, is more cautious to explore the optimal offer strategy and learn. Meanwhile, αg,tstrategy$$\alpha_{g,t}^{strategy}$$ has a fluctuation range of [0,0.47165] and gradually stabilizes at 0.2534 after about 5,000 training rounds, on the contrary, αg,tstrategy$$\alpha_{g,t}^{strategy}$$ solved under the traditional reinforcement learning algorithm behaves particularly aggressively, and the fluctuation range is even widened to [0,0.51156], and its αg,tstrategy$$\alpha_{g,t}^{strategy}$$ finally stabilizes at 0.2644.

The comparison results of the optimal offer strategy variable αg,tstrategy$$\alpha_{g,t}^{strategy}$$ under the two algorithms are shown in Fig. 9. The optimal offer strategy variable αg,tstrategy$$\alpha_{g,t}^{strategy}$$ based on the traditional reinforcement learning algorithm fluctuates too much, which is not conducive to the learning of the power generator, and also increases the risk of the power generator to a certain extent. Based on the MADDPG deep reinforcement learning algorithm can well avoid the problem of overfitting, ensuring that the power supply trader can obtain a higher return, but also to a certain extent to avoid the risk of loss due to the lack of understanding of the market environment.

Figure 9.

Comparison of the best quotation strategy variable under the two algorithms

The results of the comparison of the strategy variance as well as the total returns of the two algorithms are shown in Table 5. It can be seen that the strategies learned under the MADDPG algorithm have smaller variance, more stable results, and obtain higher total return on rounds, reaching a mean value of $2206.52. This validates the effectiveness of the MADDPG deep reinforcement learning algorithm proposed in this paper.

Other simulation example parameters

Algorithm μ(αg,tstrategy)$$\mu \left( {\alpha_{g,t}^{strategy}} \right)$$($/MWh) σ(αg,tstrategy)$$\sigma \left( {\alpha_{g,t}^{strategy}} \right)$$($/MWh) Average round total return($) Standard deviation of total return($)
DQN 0.2617 0.04021 812.38 542.13
MADDPG 0.2504 0.03257 2206.52 478.24
Design of universal settlement mechanisms for electricity markets taking into account game equilibria

Synthesizing the construction and solution of the intelligent body equilibrium model of the power market in the previous paper, this paper realizes the design of the universal settlement mechanism of the power market on the basis of the principle of power market settlement.

Principles of Electricity Market Settlement

The specificity of the power market and the indispensability of the power spot market deviation power settlement

The same as general commodity trading, a variety of power trading products in the trading method, trading volume, price formation mechanism, delivery mode, settlement mode, etc. have their own rules. In addition to power financial derivatives trading, the power market and general commodity trading market is different, must establish centralized trading power spot market, and in the power spot market to establish deviation power settlement mechanism, the reason is four aspects:

electric energy products with hair transmission and distribution of electricity at the same time to complete the specificity of the power system operation and power market transactions must also maintain real-time balance of power generation and consumption of electricity power and meet the safety of the grid operating conditions, in order to ensure the safe, reliable and high-quality operation of the grid and the implementation of power trading contracts, which requires the establishment of a replacement for the original planning and scheduling of the market machine, i.e., the spot market for electric power and electric power auxiliary services market.

If each electric energy transactions can be achieved in accordance with the agreed power delivery, in the normal operating state of the grid, through the power market mechanism (including transmission blocking management and then scheduling of power generation and consumption) will be able to fully guarantee the real-time balance of the power system power. However, due to the uncertainty of renewable energy generation output, temporary maintenance of generation and transmission equipment, etc., the supply side may not be able to provide electricity as agreed in the transaction. The demand side may also deviate from the agreed curve due to load forecasting errors, which may cause actual power imbalance and jeopardize the safety of grid operation. Therefore, it is necessary to establish a deviation power settlement mechanism with different constraint strengths according to the performance awareness, performance ability and potential of the market players, so as to promote the market players to assume the responsibility of power balancing according to their power trading agreements, i.e., generating and consuming electricity in accordance with the trading agreements, or paying the cost of breach of contract (i.e., assuming the economic responsibility), which is used to make up for the expenses incurred by the power dispatch organization for maintaining the real-time power balancing of the system.

In terms of a time period, each electric energy transaction of the power generation enterprise should be delivered in that time period of electric energy, at the same time sent by the power grid to the power users to complete the delivery of electricity, each market player in a time period of the delivery of electric energy can not be in accordance with each of its transactions individually measured, each time period of the actual power generation or consumption of electricity is only a measurement data available for the settlement of the use of, and thus can only be on the total deviation from the total electric energy transactions Therefore, only a unified calculation and settlement rule can be formulated for the total electric energy transaction deviation. In this way, all the electric energy transaction contracts can be regarded as having carried out the electric power delivery according to the agreement, and can be decoupled from each other and settled separately, and there is no question of the order of settlement.

Considering the time sequence, the electricity spot market is the last market mechanism to guarantee real-time power balance, and the power imbalance caused by defaults in the delivery of all electricity transactions and the short-term costs paid by the power dispatch organization for guaranteeing real-time power balance can be reflected in the price of the electricity spot market. Therefore, all power markets are required to establish a deviation power settlement mechanism in the power spot market.

Principle of Settlement of Deviation Power in Electricity Markets

For centralized power markets, power financial derivatives transactions are used to hedge spot market price risks, and cash settlement is implemented for these trading contracts without physical delivery. The deviation power calculation and settlement rules are only related to the trading products and their trading rules in the electric energy spot market.

For decentralized electricity markets, medium- and long-term trading contracts for electricity energy are subject to physical delivery, and electricity spot is traded in the day-ahead market, the intraday market and the real-time balancing market. Considering the time sequence, the real-time balancing market is the last market mechanism to guarantee the real-time power balance of the system. The power demand in the real-time balancing market is the difference between the actual system load and the sum of the traded demand in the medium- and long-term power market, the day-ahead market, and the intraday market, i.e., the demand-side load forecasting errors prior to the real-time balancing market, the power imbalances caused by delivery defaults of all power transactions, and the upward or downward adjustments of unit output by the power dispatchers to guarantee the real-time power balance and their short-term costs are all able to be reflected in the reflected in prices in the real-time balancing market.

Design of a universal settlement mechanism for the electricity market

In view of the special characteristics of the electricity market described above, in order to study and design the settlement mechanism of the electricity market, it is first necessary to clarify the relationship between the trading of various power products and power dispatch, contractual delivery, and its relationship with the settlement of the spot electricity market.

Electricity financial derivatives trading cash delivery system, its contract power does not need to be physical delivery, and therefore do not need power scheduling agencies to arrange power generation plan, there is no contract power and the actual power generation and use of the difference between the settlement problem. Therefore, the settlement of the electricity spot market has nothing to do with electricity financial derivatives transactions. However, as the spot market price of electricity is borrowed as the settlement reference price for power financial derivatives transactions, its transaction price is greatly influenced by the expectation of the future spot market price. Conversely, the trading scale, trading price and liquidity of the power financial derivatives market have a greater impact on power investment, and will also have a certain impact on the future power spot market price. Therefore, the fairness and reasonableness of the trading mechanism and settlement mechanism of the electric power spot market are crucial to the benign operation of the whole electric power market and the play of the market mechanism to optimize the allocation of resources.

Electricity medium- and long-term transactions belong to physical transactions, which require physical delivery. Even if a power generation enterprise transfers its physical contract to a third party, the third party has to provide electric energy and make electric power delivery according to the contract. The physical delivery of the medium- and long-term power contract has to be transmitted through the power grid, and is therefore subject to transmission capacity constraints. Both parties to the transaction need to formulate a trading strategy and carry out the power transaction in accordance with the available transmission capacity of the relevant transmission line (or section) released by the market, so as to avoid economic losses caused by the blockage of transmission and failure to realize the delivery of power when the power delivery is carried out. If necessary, the contract terms and conditions should also agree on the economic responsibility of both parties in the event of economic losses caused by transmission blockage.

The relationship between trading, scheduling, delivery and settlement in the spot electricity market varies according to the market model. In the centralized power market, except for a few units tasked with guaranteeing the safe and stable operation of the power grid, all centralized power sources are required to trade their feed-in power through the power spot market. The generation curves of the units cleared in the day-ahead market and the real-time market are then used as the generation dispatch planning curves to realize the power delivery and settled in accordance with the market clearing price. The market clearing price can be the market clearing price; if the length of the market clearing period is 5 minutes and the length of the market clearing period is 15 minutes, the market clearing price is the weighted average or arithmetic mean of the market clearing prices of the three 5-minute market clearing prices within 15 minutes. If the market clearing adopts the node marginal tariff and the market settlement adopts the regional tariff, the market settlement price of each region is the weighted average of the market clearing prices of each node in the corresponding region. In addition to captive power plant users, all power users (or their power purchasing agents) are required to settle in the spot market based on time-phased electricity consumption and market clearing prices.

In the decentralized electricity market, the portion of demand that is met through medium- and long-term trading of electricity does not need to be repurchased in the electricity, power spot market. Demand for electricity that is not met outside of medium- and long-term contracts for electricity is declared in the electricity spot market. Power generation enterprises declare to the power dispatching organization the power medium- and long-term contract transactions of the time trading power (trading curve), the corresponding part of the generating capacity is no longer to the spot market offer, the remaining generating capacity to the spot market offer, to participate in the spot market competition. The power dispatching organization arranges the power generation plan for each generating unit based on the medium- and long-term contract trading curve and the power generation curve of the unit cleared in the spot market, carries out power dispatching and realizes power delivery.

In summary, the universal settlement mechanism in the power market is: the establishment of a deviation power settlement mechanism based on the real-time balanced market price, various types of power transactions to realize the decoupling of the settlement power, the independent settlement of different power trading varieties without the need to stipulate the settlement priority. In view of the large differences in the settlement mechanism of each spot pilot power market and the high cost of construction and upgrading, it is proposed to take the establishment of deviation power settlement mechanism as an entry point, standardize the settlement rules of the power market transactions, realize the separate settlement of each contract, the transparency of the market settlement imbalance funds, and the generalization of the settlement software system. In addition, in order to promote the performance of market players and avoid the security risk of power grid caused by excessive deviation of power quantity in regions adopting decentralized power market mode, differentiated deviation power settlement price mechanism can be adopted at the initial stage of the market according to the impact of deviation of power quantity on the balance of the system, i.e., for the deviation of power quantity that helps to promote the real-time balance of the system, the real-time balance of the market price is adopted for settlement; for the deviation of power quantity that is not conducive to the real-time balance of the system, the punitive pricing mechanism is adopted for the deviation of power quantity. For deviation power that is not conducive to real-time system balance, a punitive pricing mechanism is used for settlement.

Conclusion

In this paper, the construction and solution of the equilibrium model of power market intelligences are realized by combining infinite repetitive game theory and MADDPG deep reinforcement learning algorithm, and the pervasive settlement mechanism of power market is designed by combining the principle of power settlement.

The modified IEEE 33-node distribution network system is selected to analyze the arithmetic case. Among them, the profit of load aggregator (LA), distributed energy operator (DGO), and distribution network operator (DNO) obtained from scenario 2, which adopts time-of-use tariff, is increased by 268.27 yuan (12.93%), 191.89 yuan (6.56%), and 143.35 yuan (6.9%) compared with scenario 1, respectively. And the cost of DNO moving gas turbine is reduced by 23.67 yuan (4.81%) compared with Scenario One. Comparing Scenario 2 with time-sharing tariff and Scenario 1 with fixed tariff, Scenario 3 with game bargaining model is more reasonable in the distribution of benefits, which not only guarantees the safety and quality of power supply, but also incentivizes new energy and other major players to participate in the market, and motivates new energy to actively participate in peaking and reduces the risk of new energy consumption in the distribution network.

In addition, the fluctuation range of the generator’s best offer strategy αg,tstrategy$$\alpha_{g,t}^{strategy}$$ solved by the MADDPG deep reinforcement learning algorithm used in this paper is [0,0.47165] and gradually stabilizes at 0.2534 after about 5,000 training rounds, whereas the fluctuation range of the strategy αg,tstrategy$$\alpha_{g,t}^{strategy}$$ solved under the traditional reinforcement learning algorithm widens to [0,0.51156], and αg,tstrategy$$\alpha_{g,t}^{strategy}$$ ultimately stabilizes at 0.2644. Compared to the traditional reinforcement learning algorithms, the strategies learned under the MADDPG algorithm have smaller variance, more stable results, and higher total rounds gain obtained, reaching a mean value of $2206.52, which proves the effectiveness of the MADDPG deep reinforcement learning algorithm proposed in this paper.

Finally, on the basis of systematically studying the relationship between various types of power transactions and power dispatch, power delivery and settlement, this paper proposes a universal settlement mechanism applicable to various types of power market models.