A new strategy for power monitoring data collection based on data mining and its role in improving prediction accuracy

The operation process of the power system is relatively cumbersome and complex, if you want to carry out real-time monitoring of the power grid operation status and changes in real-time monitoring of the relevant data to be recorded and analyzed, you should give full play to the role of the terminal equipment, the operation of the power system data to be measured, and summarized and collated, etc. [1-3]. Therefore, electric power personnel should attach great importance to the collection and detection of electric power data input and output process, assist the power grid to control the data, and timely discovery of problems and failures in the operation of the power system, and the reasonable application of effective measures to solve [4-6]. In addition, the power monitoring data to strengthen the collection and collation, help optimize the operation of the power grid, and increase the power grid resource reserves, improve the integrity and effectiveness of the power grid resources. At the same time, power monitoring data collection itself as the source of information of the power system, can provide accurate information data support for the operation of the power system [7-9]. The main features of monitoring data collection are to ensure the effectiveness of data selection and data collection in the mode of monitoring data on the power site and to control and optimize the input and output process of monitoring data. Due to the power system itself is more special, in the power field should be information input and processing results output, but the past data management model has been difficult to meet the current needs of the times [10-12]. With the continuous development of computer technology, especially the use of computer technology for data mining methods continue to innovate, a large amount of data generated by the power system can be analyzed by the application of data mining technology, and draw valuable conclusions [13-14]. In particular, decision tree, clustering, classification, regression analysis and other methods in data mining methods are most commonly used in power monitoring data collection, some of which can describe the current status of power monitoring and some can make predictions about future power monitoring data, which can provide valuable auxiliary effects for the formulation of power monitoring data collection strategies and the improvement of power monitoring prediction accuracy [15-18].

This paper proposes a method of analyzing the operation data of electric power equipment based on big data mining, that is, using the analysis method of big data, the data in the process of electric power monitoring is first collected with data and processing, so that the data is suitable to be used in the later detection and analysis. After that, the iForest power integration method, based on the LOF algorithm, is used to monitor abnormal data generated during the power monitoring process. Finally, the processed data is input into the improved Transformer model to predict the accuracy of the power monitoring system.

2

Overview

In the development of power systems, the update of power monitoring data collection strategies can process and update remote information in real-time, which helps to further promote the long-term development of power systems. In order to simplify the process of power grid information monitoring, literature [19] tries to introduce data mining technology into the intelligent monitoring system of power grid monitoring information in order to monitor the operation of the power grid in real time, and proposes an effective algorithm for data mining information monitoring. Literature [20] designed a framework for daily power usage pattern recognition and anomaly detection of building power usage data based on data mining, and took the time-series power usage data of three actual office buildings in Chongqing City as an example to verify the validity and feasibility of the proposed framework, which provides technical support for understanding the energy usage pattern and improving the energy management of buildings. Literature [21] designed a framework for identifying building power metering system operation strategies based on Classification Regression Tree (CART) and Weighted Association Rule Mining (WARM) methods, and conducted on-site investigations with three buildings in Shanghai, the results confirmed the effectiveness of the designed framework, which can accurately and automatically identify the building operation strategies, and help to improve the efficiency of the building operation strategy identification work. Literature [22] attempted to introduce data mining and IoT technology in Industry 4.0 smart grid monitoring and energy management, designed a smart grid monitoring platform integrating data mining and IoT technology, and verified the superior performance of the platform through empirical analysis, which is able to realize real-time monitoring and feedback of grid data. Literature [23] proposes a method for detecting the cause of power outage in distribution system based on mining association rules by Apriori technology, and verifies the feasibility of the method through experiments, which can effectively identify the factors related to power outage events, and has a reference value for the future planning and operation plan of the distribution network. In response to the problem of power electronic system faults that affect power detection data collection, literature [24] reviewed various PESs fault detection based literatures in recent years and analyzed the data mining techniques such as artificial neural networks, machine learning and deep learning algorithms applied therein, and the results showed that the deep learning based techniques are more effective in extracting the features from the measured signals than the other methods, which It helps to achieve reliable maintenance of power electronic systems. Literature [25] proposed a centralized heating substation fault detection and operation optimization method based on data mining techniques, and verified the scientific and applicability of the proposed method through empirical analyses, which can effectively extract potentially useful knowledge and thus provide reference value for fault detection and operation optimization of high-voltage substations.

3

Electricity monitoring data collection and prediction model construction

3.1

Big data analysis process

3.1.1

Electricity big data collection

The collection of big data has grown by leaps and bounds due to the widespread use of sensors and electronic components. The rapid development of computers and the Internet of Things (IoT) is even replacing computers to take over data collection. Including some traditional identification technologies including barcodes, QR codes and biometrics are contributing to the development of big data collection.

In the structural design of the data collection tool is mainly divided into three parts: the physical layer design, the access layer design, and the data collection layer design, the physical layer is the source of data collection in the power black box. In the access layer, the communication link mechanism between the host computer and the black box is realized through the serial protocol. The standards and protocols for transmitting data between modules are specified through the designation of transmission protocols, and the upper data acquisition layer is served through the protocols of this layer. The black box is located in the power field address resources are limited, generally take the dynamic division of the address of the black box to identify the method, listen to the command if all the black box to complete an address listening to the work of the black box to achieve a complete data collection, after receiving the confirmation command to start the second round of power equipment operation data listening to the work of the command.

3.1.2

Power data preprocessing

Data cleanup: the operation of power equipment data is multi-faceted, including the detection of power equipment data, current data, voltage data, temperature data, etc. will be recorded, the system in the early stage of the preliminary data can be done on the previous data to do a cleanup, so that the useful data is left to carry out the next step in the work of cleaning up the data of the specific workflow as shown in Figure 1.

1) Locating default values: locating default values are those null or empty values that appear when retrieving data.

2) Extract the replacement value: this approach uses the K nearest neighbor algorithm, the algorithm is to select the default value of the closest neighbor of the K sample instances as a class, after the removal of the default value of the highest frequency of occurrence of the sample is a reasonable replacement value.

3) Replacement of default value: this item is the default value is replaced by the replacement value in the previous step.

4) Locate outliers: because all the original data follow the rules of normal distribution by default, the arithmetic average of the entire data set should be calculated at the beginning, the formula is vi = xi – x, the remaining error is calculated by using this formula, and the standard error 8 is calculated according to the Bessel formula, the relationship between the residual error and the standard error is calculated, if the remaining error of the measured value is greater than three times the standard error, then the Rajda criterion can be satisfied, and the specific formula is expressed as |vb| = |xb – x| > 38. After determining that the formula is satisfied, this value is defined as an outlier.

5) Repeat steps 2) and 3) to extract the corresponding replacement value to replace the outliers.

6) Secondary detection: after many times of replacement of default values and abnormal values, the whole data set will be disturbed and the state will be changed, so it is necessary to carry out the work of determining and replacing the abnormal values by doing it again in order to ensure that the work is carried out normally.

7) After processing, the useful data left behind will be stored in the database, and these new data will not cover the original data, and the old and new data can be compared with each other, so that the subsequent work of data analysis can be carried out normally.

Data normalization process [26]: divided into standard conversion, polar deviation normalization conversion and square root standard method of three, the specific implementation is described below:

Standard transformation: a vector of dimension p is denoted by X = (X₁,X₂,⋯,X_p) and the observation matrix is as follows: (1) $X = [\begin{matrix} x_{11} & x_{12} & \dots & x_{1 p} \\ x_{21} & x_{22} & \dots & x_{2 p} \\ \dots & \dots & \dots & \dots \\ x_{n 1} & x_{n 2} & \dots & x_{n p} \end{matrix}]$ \[X=\left[ \begin{matrix} {{x}_{11}} & {{x}_{12}} & \cdots & {{x}_{1p}} \\ {{x}_{21}} & {{x}_{22}} & \cdots & {{x}_{2p}} \\ \cdots & \cdots & \cdots & \cdots \\ {{x}_{n1}} & {{x}_{n2}} & \cdots & {{x}_{np}} \\ \end{matrix} \right]\]

After standardized transformations: (2) $X^{*} = [\begin{matrix} x_{11}^{*} & x_{12}^{*} & \dots & x_{1 p}^{*} \\ x_{21}^{*} & x_{22}^{*} & \dots & x_{2 p}^{*} \\ \dots & \dots & \dots & \dots \\ x_{n 1}^{*} & x_{n 2}^{*} & \dots & x_{n p}^{*} \end{matrix}]$ \[{{X}^{*}}=\left[ \begin{matrix} x_{11}^{*} & x_{12}^{*} & \cdots & x_{1p}^{*} \\ x_{21}^{*} & x_{22}^{*} & \cdots & x_{2p}^{*} \\ \cdots & \cdots & \cdots & \cdots \\ x_{n1}^{*} & x_{n2}^{*} & \cdots & x_{np}^{*} \\ \end{matrix} \right]\] (3) $x_{i j}^{*} = \frac{x_{i j} - \bar{x_{i}}}{\sqrt{s_{i j}}}, i = 1, 2, \dots, n; j = 1, 2, \dots, p$ $x_{ij}^{*}=\frac{{{x}_{ij}}-\overline{{{x}_{i}}}}{\sqrt{{{s}_{ij}}}},i=1,2,\cdots ,n;j=1,2,\cdots ,p$ (4) $\sqrt{s_{i j}} = \sqrt{\frac{1}{p - 1} \sum_{j - 1}^{p} {(x_{i j} - \bar{x_{i}})}^{2}}$ \[\sqrt{{{s}_{ij}}}=\sqrt{\frac{1}{p-1}\sum\limits_{j-1}^{p}{{{\left( {{x}_{ij}}-\overline{{{x}_{i}}} \right)}^{2}}}}\] where $\bar{x_{i}}$ $\overline{{{x}_{i}}}$ represents the mean value obtained from the observations of variable X_i. The variance of variable X_i is s_ij and the standard deviation is $\sqrt{s_{i j}}$ \[\sqrt{{{s}_{ij}}}\]. After the normalization transformation, the mean of each row of the original matrix is 0 and the standard deviation is 1.

Polar deviation normalization transformation: the matrix of observations expressed in equation (1) X is subjected to the polar deviation normalization transformation operation, which can be obtained: (5) $X^{*} = [\begin{matrix} x_{11}^{R} & x_{12}^{R} & \dots & x_{1 p}^{R} \\ x_{21}^{R} & x_{22}^{R} & \dots & x_{2 p}^{R} \\ \dots & \dots & \dots & \dots \\ x_{n 1}^{R} & x_{n 2}^{R} & \dots & x_{n p}^{R} \end{matrix}]$ \[{{X}^{*}}=\left[ \begin{matrix} x_{11}^{R} & x_{12}^{R} & \cdots & x_{1p}^{R} \\ x_{21}^{R} & x_{22}^{R} & \cdots & x_{2p}^{R} \\ \cdots & \cdots & \cdots & \cdots \\ x_{n1}^{R} & x_{n2}^{R} & \cdots & x_{np}^{R} \\ \end{matrix} \right]\] (6) $x_{i j}^{R} = \frac{x_{i j} - min \underset{1 \leq k \leq p}{x_{i k}}}{max \underset{1 \leq k \leq p}{x_{i k}} - min \underset{1 \leq k \leq p}{x_{i k}}}, i = 1, 2, \dots, n; j = 1, 2, \dots, p$ \[x_{ij}^{R}=\frac{{{x}_{ij}}-\text{min}\underset{1\le k\le p}{\mathop{{{x}_{ik}}}}\,}{\text{max}\underset{1\le k\le p}{\mathop{{{x}_{ik}}}}\,-\text{min}\underset{1\le k\le p}{\mathop{{{x}_{ik}}}}\,},i=1,2,\cdots ,n;j=1,2,\cdots ,p\] Where $min \underset{1 \leq k \leq p}{x_{i k}}$ \[\text{min}\underset{1\le k\le p}{\mathop{{{x}_{ik}}}}\,\] denotes the minimum of the observations of variable x_i and $max \underset{1 \leq k \leq p}{x_{i k}} - min \underset{1 \leq k \leq p}{x_{i k}}$ \[\text{max}\underset{1\le k\le p}{\mathop{{{x}_{ik}}}}\,-\text{min}\underset{1\le k\le p}{\mathop{{{x}_{ik}}}}\,\] denotes the extreme deviation of observation x_i. After the extreme difference normalization transformation operation, all the elements of the matrix take values in the range between 0 and 1.

Square root standard method: the observations in equation (1) of the matrix X for the square root standard method operation.

(7)

x^{*} = [\begin{matrix} x_{11}^{s} & x_{12}^{s} & \dots & x_{1 p}^{s} \\ x_{21}^{s} & x_{22}^{s} & \dots & x_{2 p}^{s} \\ \dots & \dots & \dots & \dots \\ x_{n 1}^{s} & x_{n 2}^{s} & \dots & x_{n p}^{s} \end{matrix}]

${{x}^{*}}=\left[ \begin{matrix} x_{11}^{s} & x_{12}^{s} & \cdots & x_{1p}^{s} \\ x_{21}^{s} & x_{22}^{s} & \cdots & x_{2p}^{s} \\ \cdots & \cdots & \cdots & \cdots \\ x_{n1}^{s} & x_{n2}^{s} & \cdots & x_{np}^{s} \\ \end{matrix} \right]$

where $x_{i j}^{2} = \frac{x_{i j}}{\sqrt{x_{11}^{2} + x_{12}^{2} + \dots + x_{i p}^{2}}}$ $x_{ij}^{2}=\frac{{{x}_{ij}}}{\sqrt{x_{11}^{2}+x_{12}^{2}+\cdots +x_{ip}^{2}}}$, i = 1,2,⋯,n; j = 1,2,⋯,p After the above operations, the values of each element in matrix X^* are fixed between 0 and 1.

3.1.3

Electricity big data statistics and analysis

In the condition monitoring of electric power equipment, with the accumulation of time, a huge amount of condition monitoring historical data is gradually formed, and these historical data need to be quickly and effectively analyzed for condition assessment. Users can import these data into the database according to their own needs, and the system carries out the calculation and analysis of the data.

System algorithm realization process

1) Data preparation: Prepare power equipment operation status signal data, power equipment operation status signal data stored in HBase table; prepare a small amount of sample data of known categories collected in the laboratory environment and stored locally.

2) Signal Feature Extraction: Perform feature extraction on the signal data of power equipment operation status, and store the extraction results into sequence File.

3) Clustering center: extract features from a small number of samples of known categories, and then find the clustering center through the formula: (8) $C e n t e r_{j} = \frac{x_{1} + x_{2} + x_{3} \dots + x_{m}}{m}$ \[Cente{{r}_{j}}=\frac{{{x}_{1}}+{{x}_{2}}+{{x}_{3}}\cdots +{{x}_{m}}}{m}\] Where Center_j is the clustering center of category C_j, x_i is all the samples of category C_j, and m is the number of samples. Estimate the clustering center to get the required clustering center in the first iteration and store it in SequenceFile.

4) Perform clustering: Specify the path of the state signals and clustering centers of the extracted features, perform the clustering process, and the clustering results will be output to HDFS, also using the SequenceFile piece of storage.

5) Apply the KMeans output model to power equipment operation state evaluation: use the latest power equipment operation data to retrain the model to accurately reflect the operation state of power equipment in a timely manner.

3.2

Power anomaly detection method based on LOF algorithm

Cluster analysis technique [27] is an important tool in data mining, which can be used for data analysis as well as pre-processing the data first for other algorithms, which can improve the accuracy of data processing.

The main process of clustering generally includes the following aspects:

1) Data preparation: the main tasks include feature normalization and dimensionality reduction.

2) Feature selection and presentation: the main purpose is to select the most effective features and save them in the vector.

3) Feature extraction: by converting the selected features into new salient features.

4) Clustering: firstly, some kind of distance function suitable for the characteristics of the data type is selected as the similarity measure criterion, and then clustered or grouped.

5) Evaluation of clustering results: There are three main methods for evaluating clustering results: relevance test evaluation, internal usefulness evaluation, and external usefulness evaluation.

3.2.1

Power anomaly monitoring methods

Density based approach for power anomaly detection [28] focuses on the need to compute a numerical value for each data point indicating its degree of outlier similar to the distance based approach. This algorithm, for a given dataset, considers any data point to be a normal data point if the points in its local neighborhood are dense, while an outlier is a data point that is far away from the nearest neighbors of a normal data point, usually with a threshold value to define the distance. Among the density-based anomaly detection methods, the most typical is the local outlier factor [29], or LOF method.

As for the LOF algorithm, five concepts related to this algorithm must be mastered first: k distance, k distance neighborhood, reachable distance, local reachable density, and LOF.

1) k distance: for any point q in the data set, the distance to the k nearest point to the q point is called the distance to the point q, denoted as k – distance(q), and the distance referred to here is the Euclidean distance.

2) k Distance Neighborhood: For any point q in the data set, the neighborhood formed by all data object points whose distance is not greater than q the k distance is called the k distance neighborhood.

3) Reachable distance:

Let p,q be any two data points in the data set, then the reachable distance between data point q to data point p is defined as: (9) $r e a c h - d i s t (q, p) = max {d (q, p), k - d i s \tan c e (p)}$ \[reach-dist\left( q,p \right)=\text{max}\left\{ d\left( q,p \right),k-dis\tan ce\left( p \right) \right\}\]

The reachable distance between data points p,q is reach – dist_k(q·p), where d(p,q) denotes the Euclidean distance between points p,q while k – distance(p) is the k distance of data points p.

4) Locally Accessible Density: This is a measure of the local density of the q points and is therefore expressed as “density”. The locally accessible density is usually denoted as lrd and is defined as follows: (10) $l r d_{k} (q) = \frac{| N_{k} (q) |}{\sum_{p \in N_{k} (q)} r e a c h - d i s t (q, p)}$ $lr{{d}_{k}}\left( q \right)=\frac{\left| {{N}_{k}}\left( q \right) \right|}{\sum{_{p\in {{N}_{k}}\left( q \right)}reach-dist\left( q,p \right)}}$ Where N_k(q) denotes the set of k nearest points to the data point q, where |N_k(q)| one is the number of points in the k neighborhood of q. The local reachable density lrd_k(q) defined in the above equation measures the sparseness of the data point q within the set of its first k nearest points, and if the lrd_k(q) value is large, it indicates that the q point is more densely distributed among the k points, and hence is a normal point. Conversely, when the lrd_k(q) value is small, it indicates that the data point q is more sparsely distributed among k the points, and the data point is an outlier.

5) Local Outlier Factor LOF: The Local Outlier Factor characterizes the degree of outlier of a data point, and is also a measure of the likelihood of a data point being outlier, which is defined as follows: (11) $L O F_{k} (q) = \frac{\sum_{p \in N_{k} (q)} \frac{l r d_{k} (p)}{l r d_{k} (q)}}{| N_{k} (q) |}$ $LO{{F}_{k}}\left( q \right)=\frac{\sum{_{p\in {{N}_{k}}\left( q \right)}\frac{lr{{d}_{k}}\left( p \right)}{lr{{d}_{k}}\left( q \right)}}}{\left| {{N}_{k}}\left( q \right) \right|}$

The outlier factor LOF represents a density contrast that indicates a density difference between the data point q and the whole. Large studies have shown that if the LOF value is much greater than 1, which indicates that the density of point q is more different from the overall density of the data, then point q is considered an outlier. If the LOF value is close to 1, it means that the difference between point q and the whole of the data is small, and therefore point q can be considered as a normal point. If this ratio is smaller than 1, it means that comparing the points near p, their density is higher and p is a dense point.

By defining the density, it is able to detect the local outlier points and the algorithm has a good detection accuracy. However, it has a time complexity of O(N²), which is still relatively inefficient and is still affected by the parameters.

3.2.2

Integrated approach to power anomaly detection

iForest isolated forest [30] is a fast anomaly detection method based on an integrated approach. iForest does not utilize the use of distance or density measures for anomaly detection, which eliminates a large number of computations. iForest has linear time complexity and low memory requirements, but it does have high accuracy.

iForest contains a number of binary trees. These binary trees are called isolation trees, or iTrees for short. iTree trees are not exactly like decision trees in that they randomly select attributes and partition values to construct subspaces on branches.

Definition: isolation Tree. If T is an isolation tree. T is either an external node with no children or an internal node with two children (T_l,T_r). Randomly specify an attribute q and a split value p, p needs to lie between the maximum and minimum values of the specified attribute q, and then divide the data points into T_l and T_r.

Given a set of data X = {x₁,x₂,⋯,x_n} of n values, build isolated trees that recursively split X by randomly choosing an attribute value q and splitting the value P until (1) the tree reaches the height limit, and (2) there is only one piece of data in the child node. iTree is a binary tree, where each node of the tree has zero or two children. It is assumed that all instance data are distinct and when the iTree is generated, each data object is isolated by an external node, in this case the number of external nodes n, the number of internal nodes n – 1, and the number of all nodes and iTrees is 2n – 1, hence the space complexity is linear.

Anomaly detection requires the calculation of an anomaly score that reflects the degree of anomaly. In iForest algorithm is to calculate anomaly score on the basis of path path length and by this score the data points are ranked. Anomalies are the top ranked data points. Define the path length and anomaly score as follows:

Path length: the path length h(x) of point x is the number of edges when traversing the iTree from the root node until the end of encountering an external node.

Exception points: since the iTree has the same structure as a binary search tree, the path length of sample x is equal to the path length of a failed query in a binary search tree. For a given dataset D with n samples, the literature [28] gives the path length when the binary search tree fails the query: (12) $c (n) = 2 H (n - 1) - (2 (n - l) 1 / n)$ \[c\left( n \right)=2H\left( n-1 \right)-\left( 2\left( n-l \right)1/n \right)\] where H(i) is ln(i) + k (k is Euler’s constant), c(n) is the average of h(x) given n and is used to normalize h(x) the anomaly scores s of example data x defined as: (13) $s (x, n) = 2^{- \frac{E h (x)}{c (n)}}$ \[s\left( x,n \right)={{2}^{-\frac{Eh\left( x \right)}{c\left( n \right)}}}\] where E(h(x)) is the average of h(x) in the set of iTrees. When E(h(x)) → c(n), s →> 0.5: when E(hz(x)) → 0, s → 1; when E(h(x)) → n – 1, s → 0. where when 0 < s < 1, 0 < h(x) < n – 1 using the anomaly scores the following assessment can be made: A , if the anomaly score s of the instance data is close to 1, then it can be considered as anomalous; B, if the anomaly score s of the instance data is much less than 0.5, then there is a high probability that it can be considered as a normal value; C, if the anomaly score s of the instance data is approximately equal to 0.5 , then the abnormality of this sample is not significant.

3.3

Load Forecasting Method Based on Improved Transformer Modeling

3.3.1

Algorithm flow

The load forecasting process based on the improved Transformer model is shown in Figure 2. The collected power load data of substation stations are subjected to data preprocessing, and the samples and labels are formed after normalization and sliding sampling of the data, which are inputted into the improved Transformer model [31] for tuning, training, forecasting and model evaluation.

3.3.2

Encoder Structure and Design Ideas

1) Overall design. As shown in Fig. 2 above, the coding layer of the improved Transformer model consists of a CNN feature extractor, a location information generator, a mask matrix, and a multilayer multi-head attention unit. The design ideas of these structures are introduced one by one below:

2) CNN feature extractor. There are the following limitations in applying the Transformer model to power data: a. The position distribution of traditional word vectors in the semantic space implies certain semantic information, while the power load data does not contain any semantics; and the position encoding based on sine and cosine functions is not interpretable in the semantic space; b. The native model sheds recursive and convolutional structures in extracting the sequence features, which inevitably leads to the problem of information fragmentation, which weakens the model’s ability to capture local information and long-distance dependencies.

In order to solve the above problems, a convolutional neural network based feature extractor is introduced to do word embedding processing to improve the model’s ability to fit local dependencies. A convolutional kernel with 3 rows and 1 column, and 1 unit of edge padding are used; the number of convolutional kernels (dimensionality of word vectors) is an adjustable hyperparameter. For an m-dimensional grayscale map, a n-dimensional convolution kernel is used for sliding convolution, and the pixel values of the feature map at position (u,v) are shown below: (14) $z_{u, v}^{(l)} = σ (\sum_{i = 0}^{n} \sum_{j = 0}^{n} x_{i + u, j + v}^{(l - 1)} \cdot k_{i, j}^{(l)} + b^{(l)})$ $z_{u,v}^{\left( l \right)}=\sigma \left( \sum\limits_{i=0}^{n}{\sum\limits_{j=0}^{n}{x_{i+u,j+v}^{\left( l-1 \right)}\cdot k_{i,j}^{\left( l \right)}+{{b}^{\left( l \right)}}}} \right)$

The use of convolutional structure has the following benefits: a. The multidimensional convolutional kernel with shared weights not only notices the simple feature of power loads that are “near big and far small”, but also captures common patterns among neighboring locations.b. The convolutional structure can recognize the different patterns of letters in the local data in a multi-channel way and output them in the structure of multiple feature maps, thus it has a strong ability to extract local information.

3) Discard the padding mask structure. The native Transformer model uses padding mask to solve the problem of inconsistent input sequence length; while this study has avoided the problem of inconsistent input length by fixing the window length during data segmentation and labeling processing, this structure can be discarded.

4) Improvement of timing mask. The improved algorithm leaves the timing mask from the decoding layer and moves it to the coding layer. Based on this design, the model can automatically mask the information after the current processing time point when encoding, which makes the input space closer to the real application scenario.

The mask matrix is: $(\begin{matrix} a_{11} & \dots & a_{1 n} \\ ⋮ & ⋱ & ⋮ \\ a_{n 1} & \dots & a_{n m} \end{matrix})$ $\left( \begin{matrix} {{a}_{11}} & \cdots & {{a}_{1n}} \\ \vdots & \ddots & \vdots \\ {{a}_{n1}} & \cdots & {{a}_{nm}} \\ \end{matrix} \right)$ where $a_{i, j} = {\begin{cases} 0, i \geq j \\ - \infty, i < j \end{cases}$ ${{a}_{i,j}}=\left\{ \begin{align} & 0,i\ge j \\ & -\infty ,i<j \\ \end{align} \right.$

5) Position encoding

The role of position encoding is to allow the input sequence to carry position information so that the model can automatically capture the local dependencies associated with the position. The computational formula for position coding is: (15) $P E (p o s, 2 i) = sin ({p o s / 10000}^{2 i / d})$ \[PE\left( pos,2i \right)=\text{sin}\left( {{{pos}/{10000}\;}^{{2i}/{d}\;}} \right)\] (16) $P E (p o s, 2 i + 1) = c o s ({p o s / 10000}^{2 i / d})$ \[PE\left( pos,2i+1 \right)=cos\left( {{{pos}/{10000}\;}^{{2i}/{d}\;}} \right)\] where denotes PE for positional encoding, pos for offset position, and d for word embedding dimension.

6) Multihead multilayer self-attention unit.

(1) Working Mechanism of Attention Unit. The working mechanism of the attention unit is described as follows: first, the data are input to the multi-head self-attention mechanism unit; then the output results are subjected to residual correction and layer normalization; then the processed data are input into the multi-layer perception machine; the output results of the multi-layer perception machine are subjected to residual correction and layer normalization once again to output the results of the multi-layer self-attention unit; and the input data are calculated by the multi-layer processing structure After the input data is processed by the multilayer processing structure, the final result is output.

(2) Calculation formula of self-attention score:

Input: Data D, query matrix parameters W^q, key matrix parameters W^k, value matrix parameters W^v, its dimensions D ∈ R^{Day_z×batchsize×dmodel}, W^q ∈ R^{dmodel×dmodel}, W^k ∈ R^{dmodel×dmodel}, W^v ∈ R^{dmodel×dmodel}

Compute the query matrix Q, key matrix K, value matrix V: (17) $Q^{i} = D \times W^{q}, K^{i} = D \times W^{k}, V^{i} = D \times W^{v}$ \[{{Q}^{i}}=D\times {{W}^{q}},{{K}^{i}}=D\times {{W}^{k}},{{V}^{i}}=D\times {{W}^{v}}\]

Output: (18) $\begin{matrix} D^{'} = s o t max ((Q^{i} \times {(K^{i})}^{T} + M A S K) / \sqrt{d mod e l}) \\ \times V^{i} \in R^{D a y z \times b a t c h s i z e \times d \mod e l} \end{matrix}$ \[\begin{align} & {D}'=sot\text{ max}\left( {\left( {{Q}^{i}}\times {{\left( {{K}^{i}} \right)}^{T}}+MASK \right)}/{\sqrt{d\text{ mod}el}}\; \right) \\ & \times {{V}^{i}}\in {{R}^{Dayz\times batchsize\times d\bmod el}} \end{align}\]

(3) Benefits of multi-head multilayer attention cell design:

Supports parallel computation: the self-attention score can be solved in one step by matrix multiplication.

Retains memory cells intact: Attention weights are computed between every two temporal features, thus the model learns distance dependencies and local dependencies mainly characterized by “near-big-far-small” in an intact manner.

Shorter total signal distance: Compared with RNN and CNN networks, the self-attention network has the shortest paths between units, and more effective gradient information is retained, which solves the problems of gradient vanishing and gradient explosion to a certain extent.

The design of stochastic deactivation reduces the structural risk. The residual structure prevents the problem that the accuracy gradually decreases with the increase of the number of network layers after it reaches saturation.

4

Transformer model for power monitoring and its prediction accuracy analysis

4.1

Power anomaly monitoring effectiveness test

4.1.1

Algorithm validity test

In order to verify the effectiveness of this algorithm, this method is now used to deal with another set of actual data sets, the data used in this sample set comes from the wind power generation in Province X between 2019 and 2023, the sampling frequency is January, a total of 220 sample data, the annual load curve is 12, and the distribution curve of the 220 sample points is shown in Figure 3. It can be seen that most of the data is concentrated in a certain region, except for one obvious peak data that is obviously different from the normal data.Wind power generation is seasonal and greatly affected by the geographical environment.Province X has sufficient wind energy in winter, but there is a phenomenon of wind abandonment and power limitations. The so-called wind abandonment and limitation of power refers to the phenomenon of suspending part of the wind turbines due to the insufficient acceptance capacity of the power grid, the lesser electricity load and the unstable wind power in the case of large wind energy.

4.1.2

Comparison of decision map results for fast peaking algorithms

The decision diagram of the traditional fast peaking algorithm is shown in Figure 4. It can be seen that there are two sample points with large relative density and distance at the same time on the upper right, and the phenomenon of lassoing occurs, while the distribution of the remaining sample points is characterized by a smaller distribution, which indirectly indicates that the traditional Transformer algorithm has certain limitations when dealing with power data, a data type with large local density variations.

The decision diagram of the improved fast density peak clustering algorithm is shown in Figure 5. The figure can be the upper right corner of the nesting phenomenon has disappeared, and the clustering results are obviously better than before the improvement, the clustering center of the characteristics of more obvious.

4.1.3

Power outlier test based on LOF algorithm

Since the outliers in this dataset are distributed over different years, it is not possible to represent them visually with curves. The distribution of outlier data is shown in Table 1. The results show that this algorithm has detected all the outliers in the dataset, with a total of 9 outliers.This algorithm focuses on the longitudinal comparison of power generation data over different years within the same period. In this dataset, the power generation from January to February every year is very small, even less than half of the highest monthly power generation, which is an anomaly in the whole year, but this algorithm does not regard the data in January and February every year as anomalies, but focuses on the local changes of the data. Simulation and analysis of different power data, respectively, prove that the LOF proposed in this paper has a better effect on the detection of outliers, which illustrates the effectiveness of this algorithm.

Table 1.

Anomalous data distribution

Abnormal value label	Date
5	2022/05
21	2019/12
63	2019/10
96	2019/07
108	2019/05
126	2021/09
143	2022/12
187	2023/10
202	2023/06

4.2

Effectiveness of Deep Learning-based Electricity Load Forecasting Model Application

4.2.1

Comparative analysis of different models

By comparing with different models, the advantages and disadvantages of the models can be well determined, so this section compares with LSTM and BiLSTM-Attention, respectively, and verifies the accuracy of the experiments in this paper by comparing the evaluation indexes of different models. The comparison models are shown below:

1) LSTM method: the classical recurrent neural network LSTM solves the problem of gradient vanishing and explosion of RNN during the training process.

2) BiLSTM-Attention approach: the model is based on Bi LSTM, which is highly robust to load sequence data modeling, and then the attention mechanism can highlight key features that play an important role in load forecasting.

The prediction results of different models on public dataset 1 are shown in Fig. 6. From the figure, it can be seen that the method proposed in this paper to improve the Transformer is more closely related to the value of real power data compared to the other 2 models, so it can be judged that the model in this paper has better prediction results, and then the next step is to visually judge the prediction results by comparing the evaluation indexes of each model.

The evaluation metrics of different models on public dataset 1 are shown in Table 2. The MAPE of this paper’s model is 1.03%, which is improved by 65.2% and 61.13%, respectively, compared to other models. The R2 of this experiment reaches 99.84%, which is almost close to 1. All the evaluation metrics show that the model proposed in this paper maintains superior prediction results.

Table 2.

Evaluation indicators of different models in public data sets 1

Model	MAPE(%)	RMSE(MW)	MAE(MW)	R²
LSTM	2.96	281.0886	183.2736	0.9542
BiLSTM-Attention	2.65	227.4795	161.2114	0.9701
Improve transformer	1.03	85.4571	62.3925	0.9984

4.2.2

Comparative analysis of different data sets

The comparison of the experiments has been carried out using different models to verify that the present experiments have high accuracy. In the following, in order to verify the robustness and generalization ability of this method, another public dataset is selected to validate the model proposed in this paper. In this subsection, the power load dataset from public dataset 2 of the 9th Electrician’s Attribute Modeling Competition test is selected to validate the accuracy and generalization ability of the proposed model in this paper. The same comparison model as in public dataset 1 is selected for both training and prediction on the power load data of dataset 2. The prediction results of different models on public dataset 2 are shown in Fig. 7. The results show that the prediction results of the improved Transformer method proposed in this paper perfectly overlap with the true values, while the similarity of the other two comparison models is slightly worse than that of the method proposed in this paper.

The evaluation metrics of different models on public dataset 2 are shown in Table 3. The evaluation metrics of this paper’s method in dataset 2 are MAPE: 1.4%, RMSE: 124.5055 (MW), and MAE: 84.5468 (MW). The R2 of the predicted results reached 99.63%. The results of each evaluation index can be verified to show that the accuracy of this model can still maintain excellent results even when the dataset is replaced.

Table 3.

Different models of the evaluation indicators on public data set 2

Model	MAPE(%)	RMSE(MW)	MAE(MW)	R²
LSTM	3.67	320.4859	235.0074	0.9397
BiLSTM-Attention	3.02	235.8231	171.6748	0.9762
Improve transformer	1.40	124.5055	84.5468	0.9963

4.2.3

Validation experimental analysis of real datasets

The prediction results for the real dataset are shown in Figure 8. By comparing the results of the evaluation metrics of the different models with the values of the real loads, it can be seen that the model of this paper maintains accurate prediction results on the real dataset as well, in contrast to the other two models, which have a lower degree of fit to the real values.

The results of the evaluation metrics of the different models on the real dataset are shown in Table 4. The results show that the evaluation indexes are MAPE: 4.15%, RMSE: 496.1061 (MW), MAE: 356.6518 (MW), and the R2 of the prediction results reaches 97.71%. All the indicators are better than other models, so it can be put into the actual power load forecasting problems, and assist the power system to make scheduling plans and decisions.

Table 4.

The results of different models in the real data set

Model	MAPE(%)	RMSE(MW)	MAE(MW)	R²
LSTM	5.23	683.7404	473.8774	0.9483
BiLSTM-Attention	4.57	596.3083	426.4559	0.9583
Improve transformer	4.15	496.1061	356.6518	0.9771

5

Conclusion

In this paper, in the context of data mining, we propose a method for collecting and detecting data in the process of power monitoring, on the basis of which we construct an improved Transformer power load forecasting and verify the accuracy of the model. The primary conclusions are as follows:

1) The clustering result of the improved LOF algorithm is obviously better than that of the pre-improved one, and the characteristics of the clustering center are more obvious. And this algorithm detects all the abnormal power values in the dataset, which proves that the improved LOF algorithm proposed in this paper has a better effect on the detection of abnormal values, and shows the effectiveness of this algorithm.

2) Comparison of different models reveals that the prediction model in this paper has the lowest MAPE and is improved by 65.2% and 61.13% than other models, respectively, and the R2 is almost close to 1 (99.84%), and the model maintains more excellent prediction results. Comparison of different datasets shows that the MAPE and MAE values of this paper’s method in dataset 2 are smaller than those of other models, and the R2 of the prediction results reaches 99.63%, which indicates that the accuracy of this paper’s model is still extremely high under different datasets. The results of the validation experiments on the real dataset are similar to the comparison results on different datasets, and the model presented in this paper is still better than other models.Therefore, it can be included in the actual power load forecasting, which can help the power system to make scheduling plans and decisions.

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

A new strategy for power monitoring data collection based on data mining and its role in improving prediction accuracy

Junpeng Zhao

Yangrui Zhang

Hongying Wang

Yajie Zhang

Shaokang Feng

Published Online: Mar 19, 2025

Received: Nov 16, 2024

Accepted: Feb 19, 2025

DOI: https://doi.org/10.2478/amns-2025-0551

KeywordsLocal outlier method, Isolated forest method, Improved Transformer model, Power monitoring

© 2025 Junpeng Zhao et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Keywords
Local outlier method, Isolated forest method, Improved Transformer model, Power monitoring