Design and Performance Evaluation of Efficient Clustering Algorithms for Big Data Applications

With advances in digital sensors, communication and storage technologies, huge amounts of data are collected and stored every day, leading to a flood and explosion of data. In particular, data related to the Internet as well as the mobile Internet and data generated by sensor networks can often reach terabytes or even petabytes, and this data is still being generated in large quantities every minute of every day. The only way to make big data truly meaningful is to process the data through technology and turn them into data that can be read and understood according to people’s wishes [1-2].

While a database is a collection of data records where each data point represents a record and each record contains a number of attributes, multidimensional data is a collection of data records in a high-dimensional space, where these points can represent locations in space, and more generally, where each attribute of a record corresponds to a dimension [3-5]. Using data mining techniques, valid, novel and understandable patterns can be obtained from large amounts of data stored in databases [6-7]. Clustering is a process of dividing data into groups or sets by which objects within the same set have high similarity while objects within different sets are distinct from each other [8-10]. The clustering algorithm analysis can be used both as a standalone data mining tool to obtain a distribution over the data and as a preprocessing step for other data mining algorithms [11-13]. How to improve traditional clustering algorithms to ensure the quality and efficiency of clustering in the context of big data has become an important research topic in artificial intelligence and big data processing.

Clustering algorithms can be broadly grouped into five categories including division based clustering, density based clustering, grid based clustering hierarchical based clustering and graph theory based clustering [14-15]. Literature [16] showed that the performance of division-based clustering algorithms has an important relationship with the initial clustering centre, so genetic algorithms (GA) and teaching-learning optimization methods (TLBO) were used to solve the problem of convergence of the local minima and to compare the effect of the improvement of the clustering performance under the two methods. Literature [17] elucidated that in density-based clustering, data sets with high-density objects in the data space are separated by neighbouring regions of low-density objects, thus forming a cluster, and on the basis of this concept, some related algorithms based on density-based clustering are derived, and open challenges faced in designing the algorithms are presented. Literature [18] introduced a grid-based clustering algorithm that divides the data space into a limited number of grid-structured cells, which provides a better clustering performance for large-scale multidimensional datasets, but the algorithm is severely limited by the grid size, boundaries, and the density thresholds of the important cells. Literature [19] shows that hierarchical clustering methods can give multiple consistent divisions of the same data at different levels without re-running the clusters, therefore this type of clustering algorithm can better analyse the complex structure of the data, and different similarity metrics may lead to large differences in the hierarchical clustering results. Literature [20] investigated the graph theory based multi-view clustering approach, and the established generalised graph-based multi-view clustering system (GBS) generates a unified graph matrix by extracting the data feature matrices in the views and fusing them to form the final clustering results.

Clustering algorithm design is a dynamic research area in data mining that requires scholars to develop and improve on the original clustering algorithms. Literature [21] reviewed the research on the improvement of clustering algorithms based on Hadoop distributed platform with MapReduce programming paradigm, which gives the traditional division-based clustering algorithms the ability to obtain different clustering objectives on different datasets in order to reduce the computation time. Literature [22] proposed a two-stage clustering method, KMDD, to address the inefficiency exhibited by spatial data clustering methods on larger scale datasets by clustering the dataset into sub-clusters through a division-based algorithm, and then assigning a local density to each sub-cluster, and merging the sub-clusters under a density-based algorithm. Literature [23] proposed a Random Sample Partitioning-based Clustering Integration Algorithm (RSP-CE), which generates base clustering results on the entire large dataset and performs the Maximum Mean Difference Criterion (MMCDC) harmonisation and simplification, so that the base clustering results on the different dataset subsets can approximate the clustering results on the entire large dataset. Literature [24] states that Density Based Clustering Algorithms (DBCLAs) are capable of identifying clusters of arbitrary shapes and sizes with different densities, and can also be subdivided according to the density definition, parameter sensitivity, execution patterns and data properties for better application in various domains. Literature [25] highlights the advantages of grid-based clustering algorithms with fast processing time and the ability to discover clusters of arbitrary shapes, and for the problem of proper selection of the number of grid cells in the dataset, an optimisation scheme based on the k-distance function is proposed to improve the performance of grid-based clustering algorithms. Literature [26] discusses how to use MapReduce programming model to improve the operation efficiency of density-based hierarchical clustering algorithm on large datasets, and proposes a single-chain parallel clustering scheme based on HDBSCAN* algorithm combined with data bubble summarisation technique, which is validated by experimental evaluation for its effectiveness and scalability. Literature [27] combines data dimensionality reduction techniques with graph theory tools to propose a local graph-based relevance clustering method (LGBACC), which combines hierarchical clustering with PCA to discover complex hierarchical structures in high-dimensional data, and uses a graph model to visualise and display high-quality clustering results.

The article optimises and improves the traditional clustering algorithm, firstly, the data attributes will be initially screened based on the information entropy value of the attributes, after reducing the redundant attributes, the extracted attributes will be subjected to kernel principal component analysis in order to achieve the dimensionality reduction of the data, and then the k-means algorithm will be implemented on the dimensionality reduced data. The article introduces the entropy method to calculate attribute weights, thus proposing a weighted K-means algorithm based on optimizing the initial clustering center. The article evaluates the performance of the algorithm designed in this paper at three levels: efficiency, scalability, and parameter sensitivity. Finally, the clustering algorithm designed in this paper is used to cluster and analyse the population density of A city, which provides certain data support for the urban planning of A city and at the same time tests the scientificity and feasibility of the algorithm proposed in this paper.

2

Efficient clustering algorithm design for big data applications

2.1

k-means clustering algorithm based on dimensionality reduction

2.1.1

Information entropy

Every day, we receive a variety of information through phone calls, tweets, e-mails, books, etc. Even a phrase such as “word for word, nonsense for nonsense” has a different value in its own right. The primary function of information is to eliminate uncertainty, and the quantity of information determines the value of that information. The amount of information is logarithmically related to the probability log₂1/p (equivalently −log₂ p) and is called the information function. Where p represents the probability of an event occurring, the base of the logarithm can be chosen from different values, the base is different, corresponding to different units of information, to 2, e, 10 as the base corresponding to the unit of information is bit, nat, Hart. to throw the dice cunning point game, for example, the prompt has three words: the dice points are greater than 0, points are greater than 3, the number of points is 5, the number of points is greater than 0, the probability of the event occurring as 1, so the amount of information is 0bit. Point is greater than 3, the probability is 1/2, eliminating half of the uncertainty, the amount of information is 1. Point is 5, the probability is 1/6, corresponding to the amount of information is 2. 58bit. you can see that the higher the probability of the event occurring, the less information it contains, because the more it helps us to eliminate the uncertainty [28].

Information entropy was introduced by Shannon in 1948 and is often used to quantify the amount of information contained in a system. Information entropy describes how complex or chaotic a system is. For a discrete random variable X, its information entropy is defined as the cumulative sum of the product of the amount of information provided by the occurrence of all events within a system and the probability of its occurrence. (1) $H (x) = - \sum_{x \in X} p (x) \log p (x)$

Where p(x) denotes the probability of an event occurring. If a system consists of a large number of small probability events, then its information entropy is large.

In addition, information entropy has three important properties, namely monotonicity, non-negativity, and cumulativity, which play an important role in practical applications. Monotonicity means that the smaller the probability of an event occurring, the greater the amount of information it implies. Non-negativity means that information entropy cannot be negative. Additivity refers to the fact that the cumulative sum of the uncertainties of multiple random events occurring simultaneously in a system is equivalent to the cumulative sum of the uncertainties of each individual event.

2.1.2

Principal component analysis

Due to the large amount of data, resulting in more cumbersome calculations, we choose the principal component analysis (PCA for short) to process the data. The basic idea is to replace multiple indicators with certain correlation by a set of comprehensive indicators with non-overlapping information and less quantity, so that they can reflect the information represented by the original indicators to the maximum extent, i.e., to achieve the purpose of compression of data. Since the Z-Score standardization of the original data using the traditional PCA algorithm cannot fully show the differences between the variables, we will standardize the sample data with Min-Max to improve the effect of dimensionality reduction. The steps of principal component analysis are shown below:

Assuming that the input data is a matrix of X_n×p, where n is the number of samples and p is the indicator variable, the main steps of the Improved Principal Component Analysis (IPCA for short) are as follows:

Set the data matrix: (2) $X_{n \times p} = [\begin{matrix} x_{11} & x_{12} & \dots & x_{1 p} \\ x_{21} & x_{22} & \dots & x_{2 p} \\ ⋮ & ⋮ & \dots & ⋮ \\ x_{n 1} & x_{n 2} & \dots & x_{n p} \end{matrix}]$

where x_ij is the j rd value of the i nd sample and i = 1, ⋯ n, j = 1, ⋯ p. 1)

Min-Max normalisation of the X_n×p matrix yields Y_n×p (3) $Y_{n \times p} = [\begin{matrix} y_{11} & y_{12} & \dots & y_{1 p} \\ y_{21} & y_{22} & \dots & y_{2 p} \\ ⋮ & ⋮ & \dots & ⋮ \\ y_{n 1} & y_{n 2} & \dots & y_{n p} \end{matrix}]$

where y_ij = (x_ij − min x_j)/(max x_j − min x_j), i = 1, 2, ⋯n, j = 1, 2, ⋯p 2)

Correlation coefficient matrix R (4) $r_{i j} = \frac{\sum_{k = 1}^{n} (y_{k i} - \bar{y_{i}}) (y_{k j} - \bar{y_{j}})}{\sqrt{\sum_{k = 1}^{n} {(y_{k i} - \bar{y_{i}})}^{2} \sum_{k = 1}^{n} {(y_{k j} - \bar{y_{j}})}^{2}}}$

(5)

R = [\begin{matrix} r_{11} & r_{12} & \dots & r_{1 p} \\ r_{21} & r_{22} & \dots & r_{2 p} \\ ⋮ & ⋮ & \dots & ⋮ \\ r_{n 1} & r_{n 2} & \dots & r_{n p} \end{matrix}]

The eigenvalues of matrix R were calculated and the m eigenvalues λ_i of matrix R were arranged in descending order. 3)

Cumulative contribution of principal components (6) $φ_{k} = \frac{λ_{i}}{\sum_{k = 1}^{p} λ_{k}}, i = 1, 2 \dots p$ (7) $ϕ_{p} = \frac{\sum_{k = 1}^{i} λ_{k}}{\sum_{k = 1}^{p} λ_{k}}, i = 1, 2 \dots p$

Generally, the eigenvalues with a cumulative contribution rate of 85% or more are taken, and λ₁, λ₂, ⋯ λ_m(m ≤ p) is the eigenvalue corresponding to the corresponding m principal components. 4)

Principal component loading (8) $l_{i j} = p (z_{i}, x_{j}) = \sqrt{λ_{i}} e_{i j} (i, j = 1, 2 \dots p)$

where e_ij denotes the j rd component of vector e_i. 5)

The score of each principal component (9) $Z = [\begin{matrix} z_{11} & z_{12} & \dots & z_{1 m} \\ z_{21} & z_{22} & \dots & z_{2 m} \\ ⋮ & ⋮ & ⋮ & ⋮ \\ z_{n 1} & z_{n 2} & \dots & z_{n m} \end{matrix}]$

Principal component analysis was performed on the feature matrix, resulting in a final matrix of 2347 × 3.

2.1.3

Kernel principal component analysis

The principal component analysis discussed above is the main method of data dimensionality reduction in the academic world, although this method can solve part of the “data disaster”, but at the same time, we should pay attention to the problem that there is a nonlinear relationship between the various indicators of the fundamentals selected for this study in addition to the linear relationship, which requires a nonlinear method of dimensionality reduction –KPCA (Kernel Principal Component Analysis) method, we can regard it as a natural extension of the general principal component analysis, the specific principle of the method is to map the nonlinearly differentiable variables to the high-dimensional feature space through the transformation of the kernel function [29]. The specific procedure of kernel principal component analysis is as follows: 1)

Assume first that there is n sample, p variables, i.e., there is a dataset X_n × p(X₁, X₂, …, X_n), and there exists a nonlinear mapping ϕ:X → F, where the feature space F consists of ϕ(X₁), ϕ(X₂), …, ϕ(X_n).

Assume that the observations are already centred, i.e., there is: (10) $\sum_{k = 1}^{n} ϕ (X_{k})$

2)

Secondly the covariance array of F can be calculated there: (11) $\bar{C} = \frac{1}{n} \sum_{k = 1}^{n} ϕ (X_{k}) ϕ {(X_{k})}^{T}$

3)

Next find the eigenvalue λ and eigenvector V of $\bar{C}$ , which are satisfied: (12) $λ V = \bar{C} V$

Since V belongs to the space generated by ϕ(X₁), ϕ(X₂), …, ϕ(X_n), there are equivalent equations: (13) $λ (ϕ (X_{k}) \cdot V) = (ϕ (X_{k}) \cdot \bar{C} V) (k = 1, 2, ..., n)$

and there exists a coefficient of α₁, α₂, …, α_n, there: (14) $V = \sum_{k = 1}^{n} α_{k} ϕ (X_{k})$ 4)

Comprehensive collation of the previous equations can be obtained: (15) $\frac{1}{n} \sum_{i = 1}^{n} α_{i} (ϕ (X_{k}) \sum_{j = 1}^{n} ϕ (X_{j})) (ϕ (X_{j}) ϕ (X_{i})) = λ \sum_{i = 1}^{n} α_{i} (ϕ (X_{i}) ϕ (X_{k}))$

Based on the above equation, define and matrix K_n×n with each of its elements K_ij as: (16) $K_{i j} = (ϕ (X_{i}) \cdot ϕ (X_{j})) (i, j = 1, 2, ..., n)$

Then the above equation can be expressed as: (17) $n λ K α = K^{2} α$

Since K is a non-negative definite matrix, the simplification leads to: (18) $n λ α = K α$ 5)

Combining the above, if the eigenvalues and eigenvectors of K/n are obtained, the solution of $λ V = \bar{C} V$ can also be obtained. Calculating the eigenvalue of K/n, we have λ₁ ≥ λ₂ ≥ ⋯ ≥ λ_n, and the corresponding eigenvector α¹, α², …, αⁿ. Combining the above derivations and (V^k · V^k) = 1(k = 1, 2, …, n) (V^k denotes the k th eigenvector of $\bar{C}$ ), we can obtain: (19) $λ_{k} (α^{k} \cdot α^{k}) = 1 (k = 1, 2, ..., n)$

Then in feature space F, principal component extraction for an input sample x requires only the projection of the image of that sample onto the feature vector: (20) $(V^{k} \cdot ϕ (x)) = \sum_{i = 1}^{n} α_{i}^{k} (ϕ (x_{i}) \cdot ϕ (x))$

Finally, the steps of kernel principal component analysis are summarised as follows: a. Calculate the kernel matrix K. b. Calculate the eigenvalues and corresponding eigenvectors of K/n and extract the top r eigenvalue (according to the magnitude of the cumulative contribution), and normalise the eigenvectors. c. Calculate the projection of the samples.

The kernel function can be equated to the inner product of the feature space, which has the advantage that we do not need to know what the exact expression of ϕ is, but only the specific form of the kernel function. The kernel function is not randomly chosen, according to Mercer’s theorem, function K can be used as a kernel function is sufficiently necessary: the use of the input dataset K is semi-positive definite, common kernel functions include the following four: 1)

Linear kernel function: (21) $K (x, y) = x \cdot y$

2)

Polynomial kernel functions: (22) $K (x, y) = {(x \cdot y + θ)}^{q}$

3)

Gaussian kernel function: (23) $K (x, y) = \exp (- \frac{ǁ x - y ǁ^{2}}{2 σ^{2}})$

4)

Sigmoid kernel function: (24) $K (x, y) = \tanh (x \cdot y + θ)$

2.1.4

k-means clustering algorithm based on dimensionality reduction

1)

The main idea of k-means clustering algorithm based on dimensionality reduction

Kernel Principal Component Analysis (KPCA) is a commonly used method for dimensionality reduction of high dimensional data, but since the computational complexity of KPCA does not increase with the dimension of the transformed space but is only related to the dimension of the input space, the computational volume of the algorithm is greatly increased with the increase in the dimension of the data when dealing with large-scale high-dimensional data, so this paper improves it by firstly, based on the information entropy value of the attributes and the specified threshold value, each attribute whose attribute entropy value is greater than the specified threshold value is initially screened, the attribute with small contribution is removed, and the useful attribute information is extracted, and then the extracted attributes are subjected to kernel principal component analysis, so as to achieve the dimensionality reduction of the data, and finally the k-means algorithm is implemented on the dimensionality-reduced data, so as to improve the computational efficiency.

2)

Steps of k-means clustering algorithm based on dimensionality reduction

Let the dataset to be clustered containing n sample data X = {x₁, x₂, ⋯, x_n} ∈ R^p, where x_j = (x_j1, x_j2, ⋯, x_jp)^r is a p-dimensional vector and ε denotes the prescribed information entropy threshold.

According to the above discussion, the steps of k-means clustering algorithm based on dimensionality reduction are as follows:

Step 1: Calculate the value of the degree of fluctuation of the data in each attribute according to the information entropy formula H(x_i).

Step 2: According to the specified threshold value ε, the attribute H(x_i) ≥ ε is selected and the filtered data is X′, which is recorded as $X^{'} = {(x_{1}^{'}, x_{2}^{'}, \dots, x_{n}^{'})}^{T}$ , where $x_{i}^{'} = (x_{i 1}^{'}, x_{i 2}^{'}, \dots, x_{i m}^{'}), i = 1, 2, \dots, n, m \leq p$ .

Step 3: Dimensionality reduction of the data using kernel principal component analysis algorithm.

(1)

Place the input n m-dimensional data into data matrix X′: (25) $X' = {(x_{1}^{'}, x_{2}^{'}, \dots, x_{n}^{'})}^{T} = (\begin{matrix} x_{11}^{'} & \dots & x_{1 m}^{'} \\ ⋮ & ⋱ & ⋮ \\ x_{n 1}^{'} & \dots & x_{n m}^{'} \end{matrix})$

(2)

Select the Gaussian radial basis kernel function according to Eq. and compute the kernel matrix K.

(3)

Compute the modified kernel matrix $\bar{K}$ according to Eq. Compute the eigenvalues of matrix $\bar{K}$ and the corresponding eigenvectors, and arrange the eigenvalues in descending order, and set the arrangement to λ₁ ≥ λ₂ ≥ ⋯ ≥ λ_n and the corresponding eigenvectors to ν₁, ν₂, ⋯, ν_n.

(4)

Orthogonalise the eigenvector units and set to α₁, α₂, ⋯, α_n.

(5)

Calculate the cumulative contribution t of the eigenvalues according to Eq. For a given extracted contribution r, extract the eigenvectors α₁, α₂, ⋯, α₁, l ≤ n corresponding to the l eigenvalues with cumulative contribution greater than r.

(6)

Calculate the projection $Y = \bar{K} \cdot α, i = 1, 2, \dots, l$ of $\bar{K}$ on the extracted l eigenvectors, and record the projection value Y as the dimensionality reduced data.

Step 4: Execute the k-means algorithm on the dimensionality reduced data Y and output the final clustering results by class.

2.2

Weighted k-means algorithm based on optimising initial clustering centres

Due to the randomness and sensitivity of the selection of the initial clustering centre, the traditional k-means clustering algorithm is prone to fall into local optimums and unstable clustering results, and consumes a large amount of time and computationally significant amount of time in processing large-scale high-dimensional data. In this paper, we first use simple random sampling to simplify the data set, obtain a small number of samples with the original data set distribution is basically the same data set, and according to the tightness of the spatial distribution of the sampling samples, the use of the minimum variance optimization to achieve the k-means algorithm for the initial clustering centre selection, to reduce the uncertainty factors such as anomalies and other factors on the initial clustering centre of the adverse effects, so that the clustering results have been strengthened. Secondly, in order to overcome the influence of different attributes of the sample data on the clustering results in the clustering calculation process, the entropy method is introduced to calculate the attribute weights to improve the clustering accuracy, so as to put forward the weighted k-means algorithm based on the optimization of the initial clustering centre, and analyse and compare the numerical experiments, which verifies the feasibility and validity of the algorithm in this paper.

2.2.1

Optimising the initial clustering centre algorithm

1)

Data sampling preprocessing

This paper uses simple random sampling method line sampling, this is because the overall distribution of the samples obtained from the sampling is similar to the original data, the resulting samples can well represent the distribution of the original data. As the number of samples with anomalies and outliers in the sample is very small, the probability of anomalies and outliers being extracted is very low, which can effectively avoid the selection of outliers as the initial clustering centre, and greatly reduce the impact on the clustering results.

Simple random sampling method is to draw m samples (m < n) from n initial data according to the principle of randomness, and in each sampling process, each data is drawn with the same probability. Let the data set to be clustered containing n sample data X = {x₁, x₂, ⋯, x_n} ∈ R′, where x_j = (x_j1, x_j, ⋯, x_ip)^T is a p-dimensional vector, the data capacity is most n, the sampling data capacity is m, the sampling ratio is $f = \frac{m}{n}$ , and the sampling data is obtained after simple random sampling $\bar{X} = {x_{1}, x_{2}, \dots, x_{m}} \in R^{p}$ . The data distribution and the sampling data distribution are shown in Fig. 1.

2)

Basic Concepts

Let x_i = (x_i1, x_i2, ⋯, x_n)^T, x_j = (x_j1, x_j2, ⋯, x_jp)^T be any two samples in the sample space X and give the following definitions for sample data x_i and x_j.

Define the distance d(x_i, x_j) between two points in the sample space X, which is given by: (26) $d (x_{i}, x_{j}) = \sqrt{{(x_{i 1} - x_{j 1})}^{2} + {(x_{i 2} - x_{j 2})}^{2} + \dots + {(x_{i p} - x_{j p})}^{2}}$

Denote by MeanDist(X) the average distance between all samples in X, which is calculated as: (27) $M e a n D i s t (X) = \frac{\sum d (x_{i}, x_{j})}{m (m - 1)}$

Define the variance var(x_i) of sample x_i, which is calculated as: (28) $var (x_{i}) = \frac{1}{m - 1} \sum_{j = 1}^{m} {(M e a n D i s t - d (x_{i} - x_{j}))}^{2}$ 3)

Steps of the algorithm for optimising the initial clustering centre

According to the above discussion of the basic idea of optimising the initial clustering centre, X = {x₁, x₂, ⋯, x_m} ∈ R^p is used to denote the data set to be clustered, m represents the number of sample data, p represents the number of dimensions of the sample data, and x_i = (x_i1, x_i2, ⋯, x_in)^T, x_j = (x_j1, x_j2, ⋯, x_jn)^T(i, j = 1, 2, ⋯, m) is any two samples in the sample space.

Based on the previous description, the specific steps of the algorithm are as follows:

Step 1: Calculate the distance d(x_i, x_j) between any two points in the sample space X.

Step 2: Calculate the average distance between all samples MeanDist(S).

Step 3: Calculate the variance var(x_i) of all sample objects in X and form the set D = {var(x_i)|i = 1, 2, ⋯, m}.

Step 4: Find the one with the smallest variance from the set D of variances of the sample data as the 1st clustering centre, and remove the variances of the points whose distances from the 1st clustering centre are less than the average distances of the sample data objects from the set D of variances.

Step 5: Repeat steps 3 and 4 until k clustering centres are found.

2.2.2

Weighted Euclidean distance

In this paper, we propose a weighted Euclidean distance that assigns a weight to each sample involved in clustering based on its importance, thereby reducing the impact of factors such as outliers on the clustering results [30].

Define weight λ = [λ₁, λ₂, ⋯, λ_n] ∈ R^n×p, where λ_j = (λ_j1, λ_j2, ⋯, λ_jp)^T is a p-dimensional vector.

For the introduction of weights λ to represent the relationship between each sample data and the clustering centre, the improvement is: (29) $d_{λ} (x_{j}, c_{i}) = ǁ x_{j} - c_{i} ǁ_{λ} = {(\sum_{l = 1}^{p} λ_{j l} | x_{j l} - c_{i l} |^{2})}^{\frac{1}{2}}$

Therefore the improved objective function is: (30) $J (X, C) = \sum_{i = 1}^{k} \sum_{x_{j} e s_{i}} d_{λ} (x_{j}, c_{i}) = \sum_{i = 1}^{k} \sum_{x_{j} \in s_{i}} {(\sum_{i = 1}^{p} λ_{j i} | x_{j i} - c_{i i} |^{2})}^{\frac{1}{2}}$

Where X = {x₁, x₂, ⋯, x_n} ∈ R^p is the dataset to be clustered, x_j = (x_j1, x_j2, ⋯, x_jn)^T is a p-dimensional vector, and C = {c₁, c₂, ⋯, c_k} is a set of k clustering centres. 1)

Entropy value method for calculating attribute weights

Next, the entropy value method is used to calculate the weight of each attribute according to the contribution of each attribute.

(1)

Data Normalisation. Construct the attribute value matrix of the clustering data set to be tested (31) $\bar{X} = (\begin{matrix} x_{11} & x_{12} & \dots & x_{1 n} \\ x_{21} & x_{22} & \dots & x_{2 n} \\ ⋮ & ⋱ & ⋮ & ⋮ \\ x_{p 1} & x_{p n} & \dots & x_{p n} \end{matrix})$

where n denotes the number of sample data and p denotes the dimension of the attributes of the sample object. In order to make the data with different measures comparable, we standardise the data according to the formula and compress the data on the interval [0, 1]. (32) $M_{i j} = \frac{x_{i j}}{\sum_{i = 1}^{n} x_{i j}}$

Where, M_ij is the attribute value weight, x_ij is the attribute value, and i = 1, 2, ⋯, n, j = 1, 2, ⋯, p. (2)

Calculate the entropy value of the attribute. Calculate the entropy value of the attribute in dimension j according to Eq. (33) $H_{j} = - p \sum_{i = 1}^{n} M_{i j} \ln M_{i j}$

Where, H_j is the attribute entropy value, $p = \frac{1}{\ln n}$ . When M_ij = 0, there is M_ij ln M_ij = 0. If x_ij is equal for all of the given j, then $M_{i} j = \frac{x_{i j}}{\sum_{i j}^{n} x_{i j}} = \frac{1}{n}$ , at which point H, takes on an extremely large value. (3)

Calculate the coefficient of variation of the attribute. Calculate the coefficient of variation of the attribute in dimension j according to Eq: (34) $q_{j} = 1 - H_{j}$

where q_j is the coefficient of variation. For a given j, when the difference in the attribute values of the data object is smaller, the larger H_j is, the less useful the attribute is for clustering. When the difference of attribute values of data objects is larger, H_j is smaller, and the clustering effect of the attribute is larger. When the attribute values are all equal, H_j = H_max = 1, the clustering effect of the attribute is 0. In summary, when q_j is larger, the attribute is more important. (4)

Calculate the weights of attributes. Calculate the weights of the attributes in dimension j according to Eq: (35) $λ_{j} = \frac{q_{j}}{\sum_{i = 1}^{n} q_{j}}$

The Euclidean distance after assigning weights to $0 \leq λ_{j} \leq 1, \sum_{j = 1}^{p} λ_{j} = 1, j = 1, 2, \dots, p$ is $d_{λ} (x_{o}, x_{b}) = \sqrt{\sum_{i = 1}^{p} λ_{j} {(x_{a j} - x_{b j})}^{2}}$ . That is, according to the weights of attribute j, it is appropriately deflated, so that attributes with large weights play a greater role in clustering, and those with small weights play a smaller role in clustering, which truly reflects the role of attributes played in the actual clustering. 2)

Basic steps of weighted k-means algorithm based on optimising the initial clustering centre

According to the above discussion, X = {x₁, x₂, ⋯, x_n} ∈ R^p represents the data set to be clustered, p represents the dimension of the sample data, and $\bar{X} = {x_{1}, x_{2}, \dots, x_{m}} \in R^{p}$ is obtained after simple random sampling.

In summary, the steps of weighted k-means clustering algorithm based on optimised initial centres are described as follows:

Step 1: Simple random sampling for $\bar{X}$ is performed on the data set X to be clustered based on a given sampling ratio f.

Step 2: Apply the initial clustering centre algorithm based on minimum variance to the sampled data $\bar{X}$ and select the K initial clustering centres c₁, c₂, ⋯, c_k of the k-means algorithm

Step 3: Calculate the weights of each attribute of the data object according to the formula.

Step 4: Divide the data samples into the closest class clusters according to the distance from the clustering centre according to the formula.

Step 5: Calculate the values of the k centres by iteratively updating them according to Eq.

Step 6: Verify whether the algorithm stops or not, if it satisfies the objective function value (4-5) is minimum or remains unchanged, then the sending generation ends. Otherwise, perform step 4.

3

Performance evaluation and analysis of results

3.1

Efficiency assessment

Based on seven datasets of quite different sizes (from 82MB to 2560MB), the algorithms proposed in this paper, Par2PK-Means and ParCLARA algorithms are executed in a parallel environment (Hadoop cluster consisting of 8 nodes) and the mean value algorithm is run in a stand-alone environment, respectively, to evaluate whether the algorithms proposed in this paper are able to handle larger datasets and achieve greater efficiency. The comparison of execution time for seven datasets is shown in Table 1. From the table, it can be seen that the serial K-mean algorithm in a centralized single-computer environment cannot handle datasets with data sizes larger than 1280MB, or else it will suffer from the “memory overflow” phenomenon. However, in a distributed environment (i.e., a “big data” environment), the parallel algorithm in this paper can effectively handle large-scale datasets exceeding 1280MB. In particular, this algorithm is more efficient than the ParCLARA and Par2PK-Means algorithms. The reason is that: in this paper’s algorithm, the clustering initialization strategy of mean is optimized. Meanwhile, we can find that the execution time of the mean value algorithm is shorter than that of this paper’s algorithm when dealing with small-sized datasets. The reason is that the communication and interaction between the nodes in the distributed parallel environment occupies a considerable part of the execution time of the total time consumption, which leads to the overall running time being longer than the computation time of the actual task of this paper’s algorithm. More importantly, we can observe that as the size of the dataset continues to expand (i.e., the size of the dataset gradually increases), the processing efficiency of this paper’s algorithm increases exponentially, and the superiority becomes more and more obvious.

Table 1.

Execution time comparison on seven data sets

Data sets	Size(MB)	Execution time(s)
Data sets	Size(MB)	K-means	ParCLARA	Par2PK-Means	Ours
Taxi trajectory	80	78	658	432	310
	150	100	755	480	375
	310	190	892	578	400
	650	1120	1175	830	615
	1280	2365	1665	1350	1025
	2560	-	2735	2412	1875
Iris	80	30	610	300	210
	150	55	690	375	275
	310	120	815	450	300
	650	680	1035	640	405
	1280	1750	1200	1001	765
	2560	-	2455	2015	1420
Haberman’s survival	80	50	576	312	295
	150	65	677	400	310
	310	125	834	480	375
	650	715	981	675	425
	1280	2020	1325	1250	975
	2560	-	2445	2200	1725
Ecoli	80	35	620	350	281
	150	60	715	400	326
	310	125	842	455	370
	650	755	1123	670	485
	1280	1925	1635	1230	765
	2560	-	2578	2210	1420
Hayes-roth	80	40	567	312	275
	150	55	640	390	340
	310	120	725	470	385
	650	725	962	650	435
	1280	1985	1475	1200	735
	2560	-	2430	2100	1577
Lenses	80	30	620	300	295
	150	62	705	385	325
	310	135	925	450	375
	650	720	1052	540	430
	1280	1890	1345	1080	815
	2560	-	2375	2085	1475
Wine	80	55	625	360	310
	150	76	700	410	355
	310	125	834	460	400
	650	732	1000	600	425
	1280	2200	1345	1240	845
	2560	-	2564	2250	1625

3.2

Scalability assessment

To verify the scalability of the algorithm proposed in this paper, it was verified whether it can handle larger datasets when more nodes are involved in the computation. For this purpose, in the scalability experiment, the computational comparison of this paper’s algorithm was performed on different datasets by increasing the size of the dataset (from 150MB to 1280MB) and the number of nodes (from 1 node to 8 nodes) in the same proportion as shown in Fig. 2(a). Meanwhile, the scalability performance of this paper’s algorithm is further verified by comparing it with ParCLARA algorithm and Par2PK-Means algorithm using Iris dataset (from 150MB to 1280MB).The comparison of this paper’s algorithm, Par2PK-Means and ParCLARA is shown in Fig. 2(b). From Fig. 2(a), it can be observed that as the number of nodes increases in the same proportion as the size of the dataset, the scalability values of this paper’s algorithm are less than 1. Specifically, the values of its scalability are all higher than 0.62. In addition, it can be observed from Fig. 2(b) that this paper’s algorithm obtains a better scalability performance than that of Par2PK-Means algorithm, ParCLARA algorithm (The scalability values of this paper’s algorithm, Par2PK-Means, and ParCLARA are higher than 0.66, 0.59, and 0.53, respectively). The above results show that the parallel algorithm proposed in this paper has excellent scalability and strong adaptability in processing large-scale datasets on the Hadoop platform using the MapReduce framework.

3.3

Parameter sensitivity analysis of the algorithm

Before executing the algorithm, two very important parameters need to be set in advance: c and γ, of which c is the number of initial cluster centres, clustering itself is an unsupervised method, in the number of digging for clustering processing, often can not be obtained before the category of the data and the characteristics of the model, the most commonly used method for this problem is to set the initial number of cluster centres based on the experience of the digging, and then change the number of the initial family centres, and determine the optimal number of initial cluster centres according to the results of the comparison of the efficiency of the algorithm. Repeatedly run the algorithm to determine the optimal number of initial clustering centers according to the results of the algorithm’s operating efficiency comparison, which is a problem faced by all clustering algorithms. This is a problem for all clustering algorithms. γ is a threshold value which is used to limit the minimum value of the number of trajectories contained in a cluster of trajectories, and the choice of the value of γ will also have a great impact on the running efficiency of the algorithm. The impact of different initial cluster numbers on the clustering quality is shown in Fig. 3(a). The impact of the minimum value of the trajectory clusters on the operational efficiency of the algorithm in this paper and the CLUR algorithm is shown in Fig. 3(b). From Fig. 3(a), it can be seen that when the value of the initial number of clusters set is close to 20, the value of intra-cluster variance obtained after the algorithm is run is also close to the minimum, indicating that the algorithm has a better clustering effect at this time. From Figure 3(b), it can be seen that with the increase of the minimum value of the set trajectory clusters, the running time of this paper’s algorithm is slightly reduced, and the change is smooth, while in contrast, the CLUR algorithm’s running time has a large drop, indicating that the sensitivity of this paper’s algorithm for this parameter is much lower than that of the CLUR algorithm for the sensitivity of this parameter.

3.4

Empirical testing

This section mainly applies the improved algorithms in the paper to the actual big data analysis, analysing the data from the GPS location information of the mobile users in a city, mainly from the aspects of time, location, etc. To analyze the areas that the citizens of this city like to go to during a certain period of time and the areas that the citizens often go to. Therefore, it can be better planned for the development of the city and the improvement and implementation of traffic and safety measures in commercial areas to provide a better living environment for the citizens.

3.4.1

Data structure analysis and extraction

The available data left after screening mainly contains four columns: ID number, time of data creation (upload), and latitude and longitude. The data structure is shown in Table 2.

Table 2.

Data structure

Name	Type	Remark
Id_Num	String	Id Number
Create_Time	Date	Create (Upload) Time
Lng	Double	Longitude
Lat	Double	Latitude

Since some citizens set the GPS function on their mobile phones to be not connected to the Internet, resulting in the inability to upload accurate data, and it is also possible that the system automatically uploads the default GPS data, there is a lot of location information positioned to some other regions of the country abroad alive. Points like these that are not within the area of this city need to be pre-cleaned out to avoid unnecessary waste of computing resources, but also to improve the efficiency of the analysis.

The original data has too many fields, because the data needed in this paper does not need so much, so just keep the data fields in the table, if the data is further analysed in the future, more fields can be filtered out. The data is mainly stored on HDFS with four columns of data separated by commas in each row.

3.4.2

Presentation of experimental results

This section is mainly based on the data in the previous subsection and the data analysis programme to carry out a series of data calculations obtained by the data results are presented in the form of charts and graphs to facilitate urban planning as a certain reference, the next is mainly on the urban area for a long period of time to analyse the density of population activities and special festivals of the analysis of the density of the population’s active.

The following is the name of the geographical location of the city that corresponds to the centre point of the first few data rows in the clusters according to the improved clustering algorithm of this paper.The top six areas of the population gathering density in city A are shown in Fig. 4. The data in the figure is based on more than 20 million pieces of data statistics and screened out the proportion of GPS points, due to the top six areas of the number of people more often and relatively concentrated, it was chosen to rank the top six areas, as can be seen from the figure, the proportion of the highest is the A area, is the six areas of the thirty-one per cent, which is sufficient to prove that the number of user activities in this area is very large, the city government’s proposal is to be more in this area of the transport The recommendation to the municipality is to have more traffic in this area as an emergency programme and as a key target for fire and disaster prevention. Such data is also the effect of clustering statistics under a large amount of data, and can not accurately or very accurately reflect the real situation, but since it is a statistical conclusion, there must be a relative reference value, and the results of this statistics can be combined with the results of other indicators to facilitate the leadership of the leaders to make reasonable decisions.

Aiming at the number of user activities in area A is relatively large, and then analyze the 24-hour data in area A, divide the day into 24 parts by time, and then gather the time of each section, the specific method is to take only the hour in the field “creation time” of the two characters, and other characters filtered out, and then the data will only contain the 24 characters 00-23, and then grouped by these 24 characters to count the number of each group. The data will only contain 24 characters from 00-23, and then, according to the 24 types of characters for grouping statistics, statistics on the number of each group. Twenty-four hours of population change in area A as shown in Figure 5. From the figure, it can be found that the number of people is higher in two time periods every day, respectively, from ten to eleven and from thirteen to seventeen. Based on such statistics, it can be known at what point of time in Area A the number of people is more and less, which is also a strong reference for the safe traveling in the city.

Then for the special date for a day of the number of users in the A region to do analysis of the flow of the number of users in the A region, the following figure is the A region 2017 Tanabata twenty-four hour number of people change chart, A region 2023 Tanabata twenty-four hour number of people change as shown in Figure 6. As can be seen from the figure, from zero to six basically nothing user flow, at this time the user is still basically sleeping, in line with the reality of real life, from seven, awareness began to slowly increase until eleven the number of people to reach the highest value in the morning, the number of people at noon for a period of time was a downward trend, the possible situation is that the user of this time to end of the morning for a period of time some fatigue will choose to go home Or go to other places, but then after the user is rising period, this is the afternoon user activity is more active time, until seventeen o’clock when the user activity gradually declined, this chart can fully demonstrate the crowd in the A area of the activity situation, the municipal government departments, the Public Security Bureau A and the fire department can be based on such data to arrange personnel reasonable arrangements for the evacuation of personnel and security hazards in the defence.

The distribution of centroids in the clustered clusters on the map, only the centroids of the first ten clusters out of fifty clusters were taken and placed on the map, and finally the clustered centroids were displayed on the map as shown in Figure 7.

4

Conclusion

In the era of “Internet+” to “big data”, big data has become the focus of great attention in the scientific and technological community, industry and government departments. The paper examines the efficient clustering algorithm that is based on big data applications, and the conclusions are as follows:

In the parameter sensitivity analysis of the algorithm, the algorithm designed in this paper with the increase of the minimum value of the set trajectory clusters, the time is slightly reduced, and the running time of the algorithm designed in this paper is much smaller than the CLUR algorithm compared to the algorithm, which indicates that the algorithm of this paper for the parameter of the sensitivity of the parameter is lower than the CLUR algorithm.

In the cluster analysis of human flow in a city, it was found that the two time periods of ten to eleven and thirteen to seventeen in the area have the most human flow.

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

Design and Performance Evaluation of Efficient Clustering Algorithms for Big Data Applications

Ping Dai

Jinhua Wu

Hao He

Published Online: Feb 05, 2025

Received: Sep 22, 2024

Accepted: Dec 31, 2024

DOI: https://doi.org/10.2478/amns-2025-0056

KeywordsBig data applications, K-means clustering algorithm, Principal component analysis, Dimensionality reduction, Information entropy

© 2025 Ping Dai et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Keywords
Big data applications, K-means clustering algorithm, Principal component analysis, Dimensionality reduction, Information entropy