Research on Reinforcement Learning Based Regulation Scheme for Renewable Energy System in Green Buildings

In today’s society, with the increasing severity of environmental problems and energy crisis, the concept of green building is getting more and more attention and importance. Green building emphasizes to maximize resource conservation, protect the environment and reduce pollution during the whole life cycle of the building, and to provide people with healthy, suitable and efficient use of space [1-4]. The utilization of renewable energy is a crucial part of green building, which can not only reduce the energy consumption of the building, reduce the dependence on traditional energy sources, but also provide a clean and sustainable energy supply for the building [5-7].

Renewable energy refers to energy sources that can be continuously regenerated and perpetually utilized in nature, such as solar energy, wind energy, hydroelectric energy, and bioenergy. The utilization of renewable energy in green buildings is of great significance [8-9]. Reducing energy consumption and greenhouse gas emissions: Traditional buildings often rely on non-renewable energy sources such as fossil fuels, and the combustion process produces a large amount of carbon dioxide, sulfur dioxide and other greenhouse gases and pollutants, causing serious damage to the environment. The use of renewable energy can effectively reduce the energy demand of buildings, reduce greenhouse gas emissions, and alleviate the pressure of global climate change [10-13]. Reducing building operating costs: Although the initial investment in renewable energy may be relatively high, its operation and maintenance costs are low in the long run. With the continuous progress of technology and the gradual reduction of costs, the use of renewable energy can save building owners a lot of energy costs and improve the economic efficiency of buildings [14-17]. In addition, the utilization of renewable energy can also provide buildings with a more stable and comfortable indoor environment [18].

Literature [19] reveals the opportunities and challenges of renewable energy in green buildings, with the most significant challenge being the high upfront cost of renewable energy technologies. It also pointed out that the reliability issues posed by renewable energy require effective energy storage solutions and grid integration strategies. Literature [20] reviewed the emerging practices of integrating renewable energy in the building sector. Based on a case study, it was noted that the integration of renewable energy into buildings can fulfill the energy needs of buildings in different aspects. Literature [21] initiated a literature review on the optimization of solving renewable energy problems in green building rating systems, reconstructing the variables in renewable energy optimization and implementing an appropriate CA. by proposing a framework consisting of renewable energy optimization, green innovations, and CA, it links it to recent reviews of optimization in the literature. Literature [22] emphasizes the importance of energy savings through building energy efficiency by describing several key aspects of building energy efficiency and exploring the economic and environmental impacts of these aspects. Buildings with integrated renewable systems such as hot water heating and solar photovoltaic electrification are realized. Literature [23] developed a systematic analysis of HRES, stating that the system can transform a facility into a green building and reduce dependence on conventional energy sources by generating clean energy close to GHG emissions, and its effectiveness was verified in simulation experiments. Literature [24] compares these standards in terms of energy efficiency and indoor and outdoor environmental quality based on the introduction of the latest evaluation standards for green buildings in China, the United Kingdom and the United States. The characteristics of each standard system are outlined, and relevant suggestions are put forward to improve the green building evaluation standards in China. Literature [25] aims to develop a new methodology that will be used to design and analyze the effectiveness of RES-based energy supply strategies for green buildings, and determines the feasibility and advantages of the proposed hybrid system based on a comparison with conventional energy supply systems. Literature [26] applies a powerful reinforcement learning control methodology in order to minimize the energy and power losses in the distribution network, and this optimization is embodied in the BEEL system. The comparison shows the higher compliance of the proposed method compared to other methods. Literature [27] aims to examine the economic impact of increasing energy expenditure. Based on the results of economic scale theory and model simulation, it is concluded that there is a large gap between generation and use of energy and recommendations are made to increase energy production in order to reduce this gap. The urgency of investing in renewable energy projects is also emphasized.

In this paper, the green building renewable energy system such as photovoltaic and grid is constructed and the equipment models within the system such as fan coil unit, ground source heat pump unit, and battery are studied. Based on reinforcement learning algorithm, Deep Deterministic Policy Gradient (DDPG) algorithm is proposed to obtain more effective and attractive regulation strategies for green building renewable energy systems. Dealing with continuous state-space problems, the satisfying equations of PV power, user load, SOC and real-time tariffs are determined based on realistic physical constraints on the maximum and minimum values of charging and discharging power of the constrained batteries. The reward function settings are consistent with those of the DQN model and Q-learning model, taking into account the user’s comprehensive energy cost and battery operation. The winter months of January, November and December in Changsha, Hunan Province, China in 2021 are set as the simulation scenarios, and simulation experiments of regulation and optimization of green building renewable energy system are carried out after certain parameter settings are completed to test the actual regulation effect of the green building renewable energy system strategy based on DDPG algorithm proposed in this paper.

2

Green Building Renewable Energy Systems

As countries around the world become increasingly concerned about environmental issues, the realization of cleaner energy is becoming a common goal for all countries.The use of renewable energy is one of the most important ways to achieve this goal. Currently, the carbon emissions associated with the energy consumption of the building industry account for about 22% of the global total, and the development of renewable energy sources such as photovoltaic (PV) and wind power to meet the growing demand for building energy consumption is an important way to realize the dual-carbon goal.

The renewable energy system for green buildings constructed in this chapter includes the power supply side, consisting of photovoltaic panels, wind turbines, and power grids.The demand side consists of ground source heat pumps, fan coil units (FCUs), and various appliances.The system uses batteries as energy storage units. The GSHP is connected to the water tank to ensure a steady supply of heat and cold.

Next, the modeling of the equipment within the system will be further discussed.

2.1

Fan coil unit model

The system constructed in this chapter controls the room temperature primarily by varying the amount of cooling/heating provided by the FCU.The amount of cooling/heating provided by the FCU (Q_FCU) is calculated based on the enthalpy difference between the air inlet and outlet of the unit: (1) $Q_{F C U} = Δ t * {\begin{cases} G_{F C U} [C_{a} T_{a} + H_{a} (2500 + 1.84 T_{a})] \\ - G_{F C U} [C_{a} T_{F C U} + H_{a} (2500 + 1.84 T_{F C U})] \end{cases}}$ \[{{Q}_{FCU}}=\Delta t*\left\{ \begin{align} & {{G}_{FCU}}\left[ {{C}_{a}}{{T}_{a}}+{{H}_{a}}\left( 2500+1.84{{T}_{a}} \right) \right] \\ & -{{G}_{FCU}}\left[ {{C}_{a}}{{T}_{FCU}}+{{H}_{a}}\left( 2500+1.84{{T}_{FCU}} \right) \right] \\ \end{align} \right\}\] where Δt is the time step; G_FCU is the airflow of the FCU; T_FCU is the supply water temperature of the water tank; T_a is the room temperature; H_a is the relative humidity of the room air; and C_a is the specific heat capacity of the room air.

Based on Q_FCU, the relationship between the supply water temperature of the tank and the return water temperature of the FCU can be derived: (2) $Q_{F C U} = C_{w} m_{w} Δ t (T_{F C U b} - T_{F C U})$ \[{{Q}_{FCU}}={{C}_{w}}{{m}_{w}}\Delta t({{T}_{FCUb}}-{{T}_{FCU}})\]

Where C_w is the specific heat capacity of water; m_w is the mass flow rate of water in the pipeline; T_FCUb is the return water temperature of the FCU.

The power consumption of FCU at each stage can be calculated according to the set airflow at each moment and the rated airflow and rated power of FCU: (3) $e_{F C U} = P_{r F C U} \cdot {(\frac{G_{F C U}}{G_{r F C U}})}^{3} \cdot Δ t$ \[{{e}_{FCU}}={{P}_{rFCU}}\cdot {{\left( \frac{{{G}_{FCU}}}{{{G}_{rFCU}}} \right)}^{3}}\cdot \Delta t\]

Where P_rFCU is the rated power of the FCU; G_rFCU is the rated airflow of the FCU.

2.2

Ground source heat pump unit modeling

When the water in the tank enters the GSHP, the compressor starts when the water temperature is below the heat pump temperature set point (set to 40°C in this study). The water supply temperature to the GSHP can be calculated according to equation (4): (4) ${\begin{array}{l} T_{G} = T_{t a n k} + Δ t P_{G S H P} (Δ t C_{w} m_{w}), T_{t a n k} < T_{q} \\ T_{G} = T_{t a n k}, T_{t a n k} \geq T_{q} \end{array}$ \[\left\{ \begin{array}{*{35}{l}} {{T}_{G}}={{T}_{tank}}+\Delta t{{P}_{GSHP}}\left( \Delta t{{C}_{w}}{{m}_{w}} \right),{{T}_{tank}}<{{T}_{q}} \\ {{T}_{G}}={{T}_{tank}},{{T}_{tank}}\ge {{T}_{q}} \\ \end{array} \right.\]

Where T_q is the temperature set point of GSHP; P_GSHP is the heating power of GSHP.

The GSHP power consumption can be further calculated based on the coefficient of performance (COP).The COP formula is obtained by fitting based on the actual operation data, as shown in equation (5). (5) ${\begin{matrix} \begin{array}{l} C O P = 1.190230 \times 10^{3} x^{6} - 5.130190 \times 10^{3} x^{5} + 8.944914 \times 10^{3} x^{4} \\ - 8.102804 \times 10^{3} x^{3} + 4.038387 \times 10^{3} x^{2} - 1.047582 \times 10^{3} x + 1.118404 \times 10^{2}, x \geq 0.4 \end{array} \\ C O P = 1.8, x < 0.4 \end{matrix}$ \[\left\{ \begin{matrix} \begin{align} & COP=1.190230\times {{10}^{3}}{{x}^{6}}-5.130190\times {{10}^{3}}{{x}^{5}}+8.944914\times {{10}^{3}}{{x}^{4}} \\ & -8.102804\times {{10}^{3}}{{x}^{3}}+4.038387\times {{10}^{3}}{{x}^{2}}-1.047582\times {{10}^{3}}x+1.118404\times {{10}^{2}},x\ge 0.4 \\ \end{align} \\ COP=1.8,x<0.4 \\ \end{matrix} \right.\]

Where x is the load factor of the heat pump unit, which can be calculated according to equation (6): (6) $x = P_{G S H P} / P_{r G S H P}$ \[x={{P}_{GSHP}}/{{P}_{rGSHP}}\]

Where P_rGSHP is the rated heating power of GSHP.

Based on the COP and heating power of GSHP, the power consumption of GSHP can be calculated using equation (7): (7) $e_{G S H P} = Δ t P_{G S H P} / C O P$ \[{{e}_{GSHP}}=\Delta t\text{ }{{P}_{GSHP}}/COP\]

Where e_GSHP is the power consumption of GSHP.

2.3

Battery model

In this study, the capacity state of the battery is represented by the State of Charge (SOC). This model divides a day into k phase. The SOC of the battery can be calculated for each time period by using Equation (8).

(8)

{\begin{matrix} S O C^{k + 1} = S O C^{k} - \frac{(z_{b c}^{k} \cdot P_{b c}^{k} - z_{b d c}^{k} \cdot P_{b d c}^{k}) \cdot Δ t}{e_{b \max}}, 0.2 \leq S O C \leq 1 \\ P_{b c}^{k} \in [P_{b c, \min}, P_{b c, \max}] \\ P_{b d c}^{k} \in [P_{b d c, \min}, P_{b d c, \max}] \\ z_{b c}^{k} + z_{b d c}^{k} \leq 1 \end{matrix}

\[\left\{ \begin{matrix} SO{{C}^{k+1}}=SO{{C}^{k}}-\frac{\left( z_{bc}^{k}\cdot P_{bc}^{k}-z_{bdc}^{k}\cdot P_{bdc}^{k} \right)\cdot \Delta t}{{{e}_{b\max }}},0.2\le SOC\le 1 \\ P_{bc}^{k}\in \left[ {{P}_{bc,\min }},{{P}_{bc,\max }} \right] \\ P_{bdc}^{k}\in \left[ {{P}_{bdc,\min }},{{P}_{bdc,\max }} \right] \\ z_{bc}^{k}+z_{bdc}^{k}\le 1 \\ \end{matrix} \right.\]

Where P_bdc,min,P_bdc,max,P_bc,min,P_bc,max is the max/min power constraint of discharge and max/min power constraint of charge respectively; $z_{b d c}^{k}, z_{b c}^{k}$ $z_{bdc}^{k},z_{bc}^{k}$ takes the value of 0 or 1, when $z_{b d c}^{k} (or z_{b c}^{k}) = 1$ $z_{bdc}^{k}(\text{or }z_{bc}^{k})=1$ when the battery is discharging (charging) state; e_bmax is the maximum capacity of the battery.

The charging and discharging process of the battery causes a loss in the capacity of the battery. When the charging and discharging cycle depth of the battery is ΔSOC, the maximum number of charging and discharging cycles before failure N can be calculated by Equation (9). Equation (10) can then be used to calculate the equivalent power cost of the battery using the maximum number of charging and discharging cycles.

(9)

N = b_{1} + b_{2} e_{b m a x}^{b_{3} \cdot Δ S O C} + b_{4} e_{b m a x}^{b_{5} \cdot Δ S O C} ...

\[N={{b}_{1}}+{{b}_{2}}e_{bmax}^{{{b}_{3}}\cdot \Delta SOC}+{{b}_{4}}e_{bmax}^{{{b}_{5}}\cdot \Delta SOC}...\]

(10)

O_{b}^{k} = \frac{O_{i}}{N}

\[O_{b}^{k}=\frac{{{O}_{i}}}{N}\]

In Equation (9), b₁,b₂,… is a characteristic parameter of the battery. Based on measured battery discharge data, relatively accurate results can be obtained when using the first two items in Equation (9).

2.4

Renewable energy modeling

Renewable energy generation data are actual collection data. The installed capacity of the wind turbine is 3kW, the installed capacity of the photovoltaic is 3.9kW, and the time step of the power generation data is in minutes.

3

Green Building Renewable Energy System Regulation Strategies

Data-driven reinforcement learning algorithms are widely used in the field of energy system regulation for their strong self-adaptation and low requirements on model accuracy. In this chapter, based on the reinforcement learning algorithm, we will obtain a more effective and attractive regulation strategy for green building renewable energy system, realize the mutual coordination of green building renewable energy in different time periods to enhance the level of building renewable energy consumption, improve the balance between supply and demand of the energy system, and maintain the stability of the power grid.

3.1

Enhanced learning

3.1.1

Reinforcement Learning Definition

Reinforcement Learning (RL) is a machine learning approach to artificial intelligence that involves creating problem computer programs that can solve problems that require intelligence.RL is unique in that it learns by trial and error and learns from feedback that is continuous, evaluative, and meter-sampled through the use of powerful nonlinear function approximations. This means that the RL program learns how to perform tasks or solve problems better by repeated trial and error. Reinforcement learning has a wide range of applications in many fields due to its strong generalizability.

1)

Components of reinforcement learning

Reinforcement learning is actually the process by which an intelligent body learns to make optimal decisions while interacting with its environment. The rules on how to choose actions based on states and rewards are called strategies π. The optimization goal of reinforcement learning strategy optimization is given in the following equation: (11) $\max m i z e \sum_{t}^{\infty} r_{t}$ \[\max mize\sum\limits_{t}^{\infty }{{{r}_{t}}}\]

2)

Markov Decision Process

Markov Decision Process (MDP) is mainly based on Markov Process and Dynamic Programming Theory, which provides a mathematical framework for sequential decision making to represent and deal with decision problems with uncertainty and delayed feedback [28]. In general, a Markov decision process consists of the following five tuples.

S, the set of all possible states of the environment, also known as the state space.

A, The set of all possible actions of the intelligences, also known as the action space.

$R : S \times A \times S \to ℝ$ $R:S\times A\times S\to \mathbb{R}$ : The reward value function of the environment.

T : S×A×S→[0,1]: The state transfer probability function of the environment.

ρ₀ denotes the initial state distribution $ρ_{0} : S \to ℝ_{\in [0, 1]}$ ${{\rho }_{0}}:S\to {{\mathbb{R}}_{\in [0,1]}}$.

γ, Discount factor.

An intelligent body interacts with its environment (MDP) at discrete time steps by executing strategy π : S → P(A). Strategy π is defined as a probability distribution over the action space conditional on each state: S × A → [0,1]. The goal of the intelligent body in learning a strategy is the total reward that can be obtained from a given state given the strategy. The formula is as follows: (12) $J (π) = E_{ρ_{0}, π, T} [\sum_{t = 0}^{\infty} γ^{t} r_{t}]$ \[J(\pi )={{\text{E}}_{{{\rho }_{0}},\pi ,T}}\left[ \sum\limits_{t=0}^{\infty }{{{\gamma }^{t}}}{{r}_{t}} \right]\] where the state of the environment is initialized at the beginning of training as s₀, sampled from the initial state distribution: s₀ ~ ρ₀(s₀), and at each time step, depending on the state of the environment: s_t+1 ~ T(·|s_t,a_t), the intelligent body chooses the corresponding action a_t, where a_t ~ π(s_t),s_t+1, the state value function V^π(s) and the action value function Q^π(s,a) are defined as the expectation of the long-term gain obtained by following the strategy π, described as follows: (13) $V^{π} (s) = E_{π} [\sum_{t = 0}^{\infty} γ^{t} r_{t + 1} | s_{0} = s] \forall s \in S$ \[{{V}^{\pi }}(s)={{\text{E}}_{\pi }}\left[ \sum\limits_{t=0}^{\infty }{{{\gamma }^{t}}}{{r}_{t+1}}|{{s}_{0}}=s \right]\forall s\in S\] (14) $Q^{π} (s, a) = E_{π} [\sum_{t = 0}^{\infty} γ^{t} r_{t + 1} | s_{0} = s, a_{0} = a] \forall s \in S, a \in A$ \[{{Q}^{\pi }}(s,a)={{\text{E}}_{\pi }}\left[ \sum\limits_{t=0}^{\infty }{{{\gamma }^{t}}}{{r}_{t+1}}|{{s}_{0}}=s,{{a}_{0}}=a \right]\forall s\in S,a\in A\]

The learning objective of reinforcement learning in a Markov decision process is to find the optimal policy π of maximizing the value function, i.e., π_* = argmax_πV^π(s) = argmax_πQ^π(s,a).

3)

Semi-Markov decision process [29]

In a Markov decision process, each action lasts for one discrete time step, whereas in a semi-Markov decision process, each time-abstracted action lasts for a number of finite time steps. Equipped with a set of options in a Markov decision process becomes a semi-Markov decision process, while the optimal option value function on the included options is used to select the optimal option to execute the option internal policy. The formal definition of the option value function $Q_{o}^{π} (s, o)$ $Q_{o}^{\pi }(s,o)$ is as follows: (15) $Q_{o}^{π} (s, o) = E_{π} [\sum_{t = 0}^{\infty} γ^{t} r_{t + 1} | s_{0} = s, o_{0} = o] \forall s, o \in S \times O$ \[Q_{o}^{\pi }(s,o)={{\text{E}}_{\pi }}\left[ \sum\limits_{t=0}^{\infty }{{{\gamma }^{t}}}{{r}_{t+1}}|{{s}_{0}}=s,{{o}_{0}}=o \right]\forall s,o\in S\times O\]

The above equation represents an estimate of the long-term cumulative return obtained by following strategy π, in state s option o. Where π is an option-based strategy. The option learning framework uses a call-and-return execution mechanism, specifically, the intelligent body selects an option o according to its option value function $Q_{a}^{π} (s, o)$ $Q_{a}^{\pi }(s,o)$ and follows the option internal policy π_o to select the original action, the environment returns to the next state, at which time the intelligent body decides whether to terminate the option according to the termination function of the option o, and if it does so, it selects a new option according to the option value function $Q_{o}^{π} (s, o)$ $Q_{o}^{\pi }(s,o)$ and repeats the process. An option and repeat the process.

3.1.2

Main Algorithms for Reinforcement Learning

1)

Value function based approach

Q-Learning is a value-based algorithm in reinforcement learning algorithms, Q-Learning refers to the expected gain that can be obtained by taking an action in a state at a certain moment in time, and the environment will feed back the corresponding reward value according to the action of the intelligent body, the main idea of the algorithm is to construct the gain Q of the State and the Action into a Q-table to be stored, and then according to the Q-value to select the action that can get the biggest gain. In the process of updating the Q-function, Q-Learning usually adopts ò–greedy strategy for exploration. The principle of the ò–greedy-exploration strategy is to greedily select the best current action while maintaining a certain amount of exploration in anticipation of selecting a better action. Specifically, the ò–greedy-exploration strategy selects the action corresponding to the maximum value of the function with probability 1–ò, and randomly selects an action in the current state with probability ò. After selecting an action, Q-Learning updates the Q-function using the Bellman equation, i.e.,: (16) $Q (s, a) = r + γ (\max_{a^{'}} Q (s^{'}, a^{'}))$ \[Q(s,a)=r+\gamma ({{\max }_{{{a}^{\prime }}}}Q({{s}^{\prime }},{{a}^{\prime }}))\]

The Deep Q-Network (DQN) algorithm is an approach that incorporates Deep Neural Networks (DNN) and Q-Leaming. The target network mechanism refers to the use of two neural networks with the same structure but different parameters in DQN, where the Q-eval network has the most recent parameters, while the Qtarget network uses parameters that are some time old [30]. The Q-function of the deep neural network parameterization is denoted as Q_(s,a;θ), where parameter θ is updated by minimizing the time difference error: (17) $L^{D Q N} (θ) = E_{s_{t}, a_{t}, r_{t}, s_{t + 1}^{'} ~ D} {[y - Q (s_{t}, a_{t}; θ)]}^{2}$ \[{{\text{L}}^{DQN}}(\theta )={{\text{E}}_{{{s}_{t}},{{a}_{t}},{{r}_{t}},s_{t+1}^{'}\tilde{\ }D}}{{\left[ y-Q({{s}_{t}},{{a}_{t}};\theta ) \right]}^{2}}\]

2)

Policy gradient based approach

Outputting actions directly based on the current state leads to another very important algorithm in reinforcement learning, namely the policy gradient and parameterizing π_ϕ, and optimizing policy parameter ϕ with the objective of maximizing the long term cumulative gain J(π_ϕ) = J(ϕ), which utilizes rewards to directly augment and diminish the likelihood of choosing an action. Following the policy gradient theorem, policy parameter ϕ is updated in the direction of the increasing gradient of the curated objective function, as in the following equation: (18) $\nabla_{ϕ} J (ϕ) = E_{π_{ϕ}} [\nabla_{ϕ} \log π_{ϕ} (a_{t} | s_{t}) q^{π_{ϕ}} (s_{t}, a_{t})]$ \[{{\nabla }_{\phi }}J(\phi )={{\text{E}}_{{{\pi }_{\phi }}}}\left[ {{\nabla }_{\phi }}\log {{\pi }_{\phi }}({{a}_{t}}|{{s}_{t}}){{q}^{{{\pi }_{\phi }}}}({{s}_{t}},{{a}_{t}}) \right]\] where ∇_ϕlogπ_ϕ(a_t|s_t) is the gradient of strategy π_ϕ with respect to parameter ϕ, and q^π0(s_t,a_t) is the expectation of the true cumulative return, which can be estimated in different ways.

The strategy constraints of the trust domain strategy optimization algorithm are defective in the way of implementation, in order to further improve the efficiency of strategy update, the new algorithm Proximal Policy Optimization Algorithm (PPO) proposes a new way of constraining the difference between the old and the new strategies in the process of strategy updating, and eliminates the effect of the drastic change of the strategies by cropping the strategy updating amplitude, and its strategy gradient updating method is as follows: (19) $L^{P P O} (ϕ) = E_{π_{ϕ}} [\min (ρ_{t} \hat{A} (s_{t}, a_{t}), c l i p (ρ_{t}, 1 - \int, 1 + \int) \hat{A} (s_{t}, a_{t}))]$ \[{{L}^{PPO}}(\phi )={{\text{E}}_{{{\pi }_{\phi }}}}[\min ({{\rho }_{t}}\hat{A}({{s}_{t}},{{a}_{t}}),clip({{\rho }_{t}},1-\int{,}1+\int{)\hat{A}({{s}_{t}},{{a}_{t}}))}]\]

Where $\hat{A} (s_{t}, a_{t})$ $\hat{A}({{s}_{t}},{{a}_{t}})$ is the old strategy π_ϕ₋ the estimation of the dominance function: $A^{π_{ϕ^{-}}} (s_{t}, a_{t}) = q^{π_{ϕ^{-}}} (s_{t}, a_{t}) - v^{π_{ϕ} -} (s_{t})$ ${{A}^{{{\pi }_{{{\phi }^{-}}}}}}({{s}_{t}},{{a}_{t}})={{q}^{{{\pi }_{{{\phi }^{-}}}}}}({{s}_{t}},{{a}_{t}})-{{v}^{{{\pi }_{\phi }}-}}({{s}_{t}})$, ρ_t is the importance sampling weights: $ρ_{t} = \frac{π_{ϕ} (a_{t}, s_{t})}{π_{ϕ} - (a_{t}, s_{t})}$ \[{{\rho }_{t}}=\frac{{{\pi }_{\phi }}({{a}_{t}},{{s}_{t}})}{{{\pi }_{\phi }}-({{a}_{t}},{{s}_{t}})}\].

3)

Actor-Critic based algorithm

Actor-Critic based reinforcement learning algorithms, as another class of policy-based reinforcement learning algorithms, integrate the value function based reinforcement learning methods and policy-based reinforcement learning methods by using Actor to directly optimize the parameters of the policy while also using Critic estimation of the value function v(s) to evaluate the performance of the policy q^πϕ(s,α).

(20)

\hat{Q} (s_{t}, a_{t}) = r_{t} + γ V (s_{t})

\[\hat{Q}({{s}_{t}},{{a}_{t}})={{r}_{t}}+\gamma V({{s}_{t}})\]

Unlike the stochastic policy gradient algorithm which outputs policies as probability distributions over the action space, the deterministic policy gradient algorithm (DPG) optimizes the deterministic policy μ_ϕ with the objective of maximizing the Q function. For example, in the Deep Deterministic Policy Gradient Algorithm (DDPG), the policy update formula is as follows: (21) $\nabla_{ϕ} J (ϕ) = E_{μ_{ϕ}} [\nabla_{ϕ} μ_{ϕ} (s) \nabla_{a} Q^{μ_{ϕ}} (s, a) |_{a - μ_{ϕ} (s)}]$ \[{{\nabla }_{\phi }}J(\phi )={{\text{E}}_{{{\mu }_{\phi }}}}\left[ {{\nabla }_{\phi }}{{\mu }_{\phi }}(s){{\nabla }_{a}}{{Q}^{{{\mu }_{\phi }}}}(s,a){{|}_{a-{{\mu }_{\phi }}(s)}} \right]\]

The output of the deterministic policy gradient algorithm is a deterministic action, so it performs better on problems in the continuous action space, and is well able to solve continuous control problems such as robotics.

3.2

Rule-based system regulation strategy

Photovoltaic power generation cuts the peak load of zero energy residential daytime, the load shows a “duck curve” pattern, the user load peak shifts to 3:00~6:00 and 19:00~22:00, i.e., the heat pump operation time and heating power demand concentration time. Therefore, to determine the basic rules of the battery storage system control, that is, in the peak 3:00 ~ 6:00 and 19:00 ~ 22:00 hours of the battery according to the load demand discharge, in the 6:00 ~ 19:00 hours of the time to use photovoltaic power charging in the other hours of the battery is in an idle state, does not take place in the charging or discharging action.

3.3

System regulation strategy based on DDPG algorithm

3.3.1

Deep deterministic policy gradient algorithm

Prior to the proposal of DQN, it was widely recognized in the academic community that it was more difficult to use value function approximation, and the proposal of empirical playback regions and dual networks was a great innovation, and the Deep Deterministic Policy Gradient Algorithm (DDPG) continues the structure of empirical playback regions and dual networks in the DQN algorithm [31]. Thus, the DDPG algorithm consists of four main convolutional neural networks: an actor network, a critic network, a target actor network, and a target critic network. Among them, the actor network mainly outputs a determined action a in the current state s based on the strategy, and the critic network gives the Q value for performing the action in the current state based on the value. The target actor network and target critic network correspond to output a determined action a and Q value corresponding to the next state s', respectively. Similar to DON’s target network, DDPG’s target actor network and target critic network improve the stability of algorithm training, continue the use of empirical replay area to reduce the temporal correlation between samples, enrich the environment of intelligent decent pairs with continuous states, and facilitate the rapid convergence of training results. In addition, unlike the DON estimation network which outputs the Q-value of all discrete actions in the current state, the DDPG actor critic network outputs the Q-value of a definite action in the current state. Equations (22)-(25) give the computation of the loss parameter lossactor for updating the weights of the actor network and the loss parameter loss_critic for updating the weights of the critic network, respectively: (22) $a_{i} = μ (s_{i} | θ^{μ})$ \[{{a}_{i}}=\mu ({{s}_{i}}|{{\theta }^{\mu }})\] (23) $l o s s_a c t o r = \frac{1}{b a t c h size} \sum_{i} Q (s_{i}, a_{i})$ \[loss\_actor=\frac{1}{batch~\text{size}}\sum\limits_{i}{Q}({{s}_{i}},{{a}_{i}})\] (24) $y_{i} = r_{i} + γ Q^{'} (s_{i + 1}, a_{i + 1})$ \[{{y}_{i}}={{r}_{i}}+\gamma {{Q}^{\prime }}({{s}_{i+1}},{{a}_{i+1}})\] (25) $l o s s_c r i t i c = \frac{1}{b a t c h size} \sum_{i} (y_{i} - Q (s_{i}, a_{i}))^{2}$ \[loss\_critic=\frac{1}{batch\text{ size}}\sum\limits_{i}{({{y}_{i}}-Q(}{{s}_{i}},{{a}_{i}}){{)}^{2}}\]

Where,

batch size - the number of samples taken from the experience playback area.

1≤i≤batch size,a_i,s_i,r_i - denotes the action, state and reward of a set of samples, respectively.

μ - Probability of consecutive actions in state s_i.

θ - Weights.

Q - a_i,s_i corresponding to the value of Q.

γ - The discount factor.

3.3.2

Strategies for regulating renewable energy systems in green buildings

1)

State Space

The DDPG algorithm is able to deal with continuous state space problems, therefore, only the state space S constituents need to be specified. There is no need to make specific constraints on each constituent element, and according to the realistic physical constraints, it can be seen that the PV power, customer load, SOC and real-time tariffs satisfy Eqs. (26)-(28), respectively: (26) $0 \leq P_{p v} (t) \leq \infty$ \[0\le {{P}_{pv}}(t)\le \infty \] (27) $0 \leq P_{l o a d} (t) \leq \infty$ \[0\le {{P}_{load}}(t)\le \infty \] (28) $- \infty \leq R_{g r i d} (t) \leq \infty$ \[-\infty \le {{R}_{grid}}(t)\le \infty \]

2)

Action space

In the DDPG model, the current state s is input to the actor network in the intelligent body, and the actor network gives the probability of successive actions, the probability of successive actions presents a normal distribution curve, and the actor network outputs a determined action a based on the probability, which may not correspond to the maximum probability value of action a. Since the intelligent body can deal with the continuous action control problem, here it is sufficient to constrain only the maximum and minimum values for the action-battery charging and discharging power.

3)

Reward function

The purpose of this paper is to examine the benefits and drawbacks of various reinforcement learning algorithms for ZEH energy system management, ultimately leading to a more efficient and appealing regulation strategy. The setting of the reward function in the DDPG model mainly takes into account the user’s comprehensive energy cost and the operation of the battery, which is consistent with that of the reward function in the DQN model and the Q-learning model, so that it is easy to compare and evaluate the regulation effect in a specific In the DDPG model, the reward function setting mainly considers the user’s comprehensive energy consumption cost and battery operation, and is consistent with the reward function setting in the DQN model and Q-learning model, which facilitates the comparison and evaluation of the regulation effect in the specific case analysis.

4

Simulation Experiments on Regulation and Optimization of Renewable Energy System for Green Buildings

In this chapter, the effectiveness of the proposed regulation strategy for renewable energy systems in green buildings will be evaluated.

4.1

Simulation Setup

For the optimization of the regulation strategy of green building renewable energy system under winter conditions, the corresponding scenarios in the simulation experiments in this chapter are January, November and December winter in Changsha City, Hunan Province, China in 2021. The meteorological data of Changsha city is simulated, and the area of PV panels is set to 40m², and the PV power generation in winter is calculated according to the meteorological data as shown in Fig. 1. The model of the energy storage device is 6-GFMJ-200, the capacity is 7kW·h, the charging/discharging efficiency is 90%, the charging/discharging power is set to 1.44kW, and the maximum/minimum value of the charging state is 0.9/0.2.

In order to consider the influence of real-time electricity price on the system strategy, the real-time winter electricity price data under similar climatic conditions in the Australian energy website were used, and the peak and trough electricity prices were set at 0.8 and 0.3 yuan/(kW·h), respectively, and the surplus electricity feed-in tariff was 0.4536 yuan/(kW·h). The parameters related to the electric heat pump and the building are shown in Table 1, and the upper and lower limits of the indoor comfort temperature are set to 21°C and 17°C, respectively.

Table 1.

Related parameters

Window size/m²	Building heat capacity/(J·K^-1)	Thermal resistance between indoor and environment/(K·W^–1)	Ashp heating power/kW
10	7453000	5.28×10^–3	2

4.1.1

Baseline model

For the green building renewable energy system in this paper, the following control strategy is proposed as a benchmark model. The electric heat pump and the energy storage system control the system’s operation by adjusting the operating power and charging/discharging state, respectively. The operating power of the heat pump is determined according to the current indoor temperature and electricity price. The charging and discharging states are within the specified battery charging range, which is determined according to the current electricity price and PV power generation. Thus, it can be seen that the advantage of the benchmark model is to be able to give a definite control strategy according to the current environmental parameters, and timely dynamic adjustment to cope with environmental changes, so as to meet the user’s comfort and economic needs. The specific control strategies are shown in Table 2.

Table 2.

Operation strategy of heat pump and energy storage

Operation strategy of heat pump
Room temperature $T_{t}^{in}$ \[T_{\text{t}}^{\text{in}}\]/°C	Electrovalence p_t(yuan(kW·h)^-1)
	p_t≤0.35	0.35<p_t≤0.65	p_t>0.65
Low	1	1	0.75
Medium	0.75	0.75	0.5
High	0.25	0	0
Operation strategy of energy storage
Photovoltaic power generation $T_{t}^{in}$ \[T_{\text{t}}^{\text{in}}\]/kW	Electrovalence p_t(yuan(kW·h)^-1)
	p_t≤0.35	0.35<p_t≤0.65	p_t>0.65
Low	-1	-1	-1
Medium	1	0	1
High	1	0	1

4.1.2

Parameterization

The minimum optimization step is set to be 15 min, and the optimization period is set to be 31 days, i.e., there are 2976 optimization periods in the period. The training process of the system model uses the November and December datasets, and 1500 rounds of training are conducted, and the data of 31 consecutive days are randomly selected during the training. The January dataset is used for the performance validation of the DDPG algorithm, in which the Q-network and the target network each contain three fully connected hidden layers with 128, 256, and 256 neurons, respectively, and rectified linear units are used as the activation function of the hidden layers, and the Adam optimizer is used to update the network weights. The main hyperparameters include the learning rate (0.0001), discount factor (0.99), minimum batch size of 32, and network update rate of 0.002.

4.2

Analysis of simulation results

In order to evaluate the performance of the Deep Deterministic Policy Gradient (DDPG) algorithm in the proposed green building renewable energy system regulation strategy in this paper, the following three system regulation schemes are used for performance comparison.

1) Scheme I. This scheme does not use any energy storage system, but uses an ON/OFF strategy to control the heat input power to satisfy heating demands.

2) Scheme II. The electrical and thermal energy flows are scheduled without coordination in this scheme.

3) Scheme III. This scheme assumes that all energy storage systems have been dispatched with all uncertainty parameter information obtained. Although it is difficult to obtain such information due to the stochastic nature of the uncertainty parameters, this scheme can provide a lower bound on the performance of the proposed algorithm and is used here only as a reference for optimal performance.

4.2.1

Convergence

The convergence process of the DDPG algorithm in this paper is specifically shown in Fig. 2. Due to the existence of exploration probability and stochastic system parameters, the reward obtained in each segment will fluctuate within a certain range. In order to show the trend of segment rewards more clearly, we took the average rewards of the previous 200 segments in each segment as the vertical coordinate to draw the average reward curve, i.e., the orange curve in the figure. As the number of clips increases, the average reward gradually increases and becomes more and more stable, which indicates the good convergence of the proposed algorithm.

4.2.2

Effectiveness

The operational effectiveness of each system regulation scheme is shown specifically in Figure 3. Figure (a) shows the operating cost of each scheme, while Figure (b) shows the change of indoor temperature under the operation of each scheme. From the figure, it can be seen that the DDPG algorithm in this paper can realize lower operating costs compared with scheme I and scheme II. Specifically, the DDPG algorithm can reduce the operating cost by 24.63% and 5.07% compared to Scenarios I and II, respectively. Although the proposed algorithm has a higher running cost compared to Scheme III, the relative difference between the two is less than 8.05%. Since it is not practical to obtain the perfect information assumed in Scheme III, the system regulation strategy proposed in this paper has the best utility and possesses near-optimal performance without the need to know the explicit building thermal dynamics model. In addition, the DDPG algorithm achieves smaller temperature deviations compared to Scenarios I and II, based on the variation of indoor temperature in Fig.

In order to further illustrate the effectiveness of the DDPG algorithm, more simulation results are given next in this paper. In this paper, the electricity price curve under the system regulation strategy based on the DDPG algorithm with the PV storage system energy storage level and the electric energy storage system energy storage level is specifically shown in Fig. 4, where the PV storage system and the electric energy storage system respond to the fluctuation of the electricity price and operate dynamically. Figure (a) is the electricity price curve, while Figures (b) and (c) show the energy storage level of PV storage system and the energy storage level of electric energy storage system, respectively. Specifically, when the electricity price is high, the PV storage system and the electrical energy storage system work in the discharge mode, while when the electricity price is low, they work in the charge mode. Therefore, the DDPG algorithm can utilize the dynamic operation of the PV storage system and the electrical energy storage system to reduce the operating cost of the multi-energy system built in the building that includes PV.

The thermal energy supply and thermal energy storage system energy storage level situation under the system regulation strategy based on the DDPG algorithm in this paper is specifically shown in Fig. 5. Combined with the case of the energy storage level of the PV storage system above, when the energy storage level of the PV storage system decreases, the level of the thermal energy supply used to satisfy the heating demand in the DDPG algorithm and the energy storage in the thermal energy storage system will increase. As an example, for the case when the number of time slots is 150, the thermal energy supply output of the DDPG algorithm decreases to nearly 0 kWh, but the energy storage level in the thermal energy storage system shows a significant increase to about 21.2 kWh. This suggests that the thermal energy required for space heating is primarily supplied by PV, which effectively reduces the reliance on natural gas boilers for the building’s multi-energy system that includes PV. On the contrary, the thermal energy used to satisfy the heat demand in Scenarios I and II cannot change with the PV output. Therefore, the DDPG algorithm can achieve coordinated operation between the electrical energy flow and the thermal energy flow.

4.2.3

Robustness

The robustness of the DDPG algorithm for the three cases of 0.9, 1.8, and 2.4°F are considered as shown specifically in Table 3. The ATV in the table denotes the average temperature deviation. It can be seen that the operating costs of the DDPG algorithm for the 0.9, 1.8, and 2.4°F cases are 2181.3 yuan, 2284.4 yuan, and 2284 yuan, respectively, while the average temperature deviation reaches 0.003°C, 0.005°C, and 0.075°C, respectively. Compared with Scenarios I and II, the DDPG algorithm proposed in this paper achieves lower operating costs as well as smaller temperature deviations.Compared to scheme III, the DDPG algorithm can sometimes trade off smaller temperature deviations for lower operating costs.

Table 3.

Robustness of DDPG

Scheme	Operating cost(RMB)			ATV(°C)
Scheme	0.9°F	1.8°F	2.4°F	0.9°F	1.8°F	2.4°F
Scheme 1	3023	3034	3029	0.182	0.186	0.235
Scheme 2	2392	2406	2407	0.182	0.186	0.235
Scheme 3	2204.34	2263.91	-	0	0	-
DDPG	2181.3	2284.4	2284	0.003	0.005	0.075

5

Conclusion

Facing the industry trend and demand of green building popularization and energy cleanliness in the construction industry, this paper sets up a green building renewable energy system and proposes a system regulation strategy based on DDPG algorithm based on reinforcement learning algorithm to provide ways and means for fine management and real-time control of green buildings. In order to test the regulation effect of the green building renewable energy system strategy based on DDPG algorithm proposed in this paper, simulation experimental research is carried out with the winter months of January, November and December in Changsha, Hunan Province, China in 2021 as a simulation scenario. Scenario 1, which is controlled only by an ON/OFF strategy, Scenario 2, which is scheduled without coordination, and Scenario 3, which is scheduled after all uncertain parameter information has been acquired, are set as comparison objects.With the increase in the number of segments, the average reward of the DDPG algorithm gradually increases and gradually stabilizes, demonstrating good convergence. As for the effectiveness performance, the DDPG algorithm in this paper reduces the running cost by 24.63% and 5.07% comparing with Scheme I and Scheme II respectively, and the relative difference with Scheme III, which assumes perfect information and is impractical, is only less than 8.05%, and achieves a smaller temperature deviation. In addition, the DDPG algorithm can reduce the operating cost of the building multi-energy system including PV with the help of the dynamic operation of the PV storage system and the electrical energy storage system, and realize the coordinated operation between the electrical energy flow and the thermal energy flow. The robustness of this paper’s regulation strategy is considered in three degrees Fahrenheit cases, 0.9, 1.8, and 2.4°F. Compared with Scenarios I and II, the DDPG algorithm in this paper is still able to maintain the lowest operating cost, which is 2,181.3 yuan, 2,284.4 yuan, and 2,284 yuan for the 0.9, 1.8, and 2.4°F cases, respectively, and can sometimes trade a smaller temperature deviation for a lower operating cost compared with Scenario III.

Idioma:: Inglés

Calendario de la edición:: 1 veces al año
Temas de la revista:: Ciencias de la vida, Ciencias de la vida, otros, Matemáticas, Matemáticas aplicadas, Matemáticas generales, Física, Física, otros

RSS Feed de revista

Research on Reinforcement Learning Based Regulation Scheme for Renewable Energy System in Green Buildings

Yin Li

Ang Wang

Publicado en línea: 21 mar 2025

Recibido: 22 oct 2024

Aceptado: 02 feb 2025

DOI: https://doi.org/10.2478/amns-2025-0607

Palabras claveReinforcement learning, DDPG algorithm, Deep Q-network algorithm, Green building, Renewable energy system regulation

© 2025 Yin Li et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Palabras clave
Reinforcement learning, DDPG algorithm, Deep Q-network algorithm, Green building, Renewable energy system regulation