Design and Performance Evaluation of Efficient Clustering Algorithms for Big Data Applications
Published Online: Feb 05, 2025
Received: Sep 22, 2024
Accepted: Dec 31, 2024
DOI: https://doi.org/10.2478/amns-2025-0056
Keywords
© 2025 Ping Dai et al., published by Sciendo
This work is licensed under the Creative Commons Attribution 4.0 International License.
With advances in digital sensors, communication and storage technologies, huge amounts of data are collected and stored every day, leading to a flood and explosion of data. In particular, data related to the Internet as well as the mobile Internet and data generated by sensor networks can often reach terabytes or even petabytes, and this data is still being generated in large quantities every minute of every day. The only way to make big data truly meaningful is to process the data through technology and turn them into data that can be read and understood according to people’s wishes [1-2].
While a database is a collection of data records where each data point represents a record and each record contains a number of attributes, multidimensional data is a collection of data records in a high-dimensional space, where these points can represent locations in space, and more generally, where each attribute of a record corresponds to a dimension [3-5]. Using data mining techniques, valid, novel and understandable patterns can be obtained from large amounts of data stored in databases [6-7]. Clustering is a process of dividing data into groups or sets by which objects within the same set have high similarity while objects within different sets are distinct from each other [8-10]. The clustering algorithm analysis can be used both as a standalone data mining tool to obtain a distribution over the data and as a preprocessing step for other data mining algorithms [11-13]. How to improve traditional clustering algorithms to ensure the quality and efficiency of clustering in the context of big data has become an important research topic in artificial intelligence and big data processing.
Clustering algorithms can be broadly grouped into five categories including division based clustering, density based clustering, grid based clustering hierarchical based clustering and graph theory based clustering [14-15]. Literature [16] showed that the performance of division-based clustering algorithms has an important relationship with the initial clustering centre, so genetic algorithms (GA) and teaching-learning optimization methods (TLBO) were used to solve the problem of convergence of the local minima and to compare the effect of the improvement of the clustering performance under the two methods. Literature [17] elucidated that in density-based clustering, data sets with high-density objects in the data space are separated by neighbouring regions of low-density objects, thus forming a cluster, and on the basis of this concept, some related algorithms based on density-based clustering are derived, and open challenges faced in designing the algorithms are presented. Literature [18] introduced a grid-based clustering algorithm that divides the data space into a limited number of grid-structured cells, which provides a better clustering performance for large-scale multidimensional datasets, but the algorithm is severely limited by the grid size, boundaries, and the density thresholds of the important cells. Literature [19] shows that hierarchical clustering methods can give multiple consistent divisions of the same data at different levels without re-running the clusters, therefore this type of clustering algorithm can better analyse the complex structure of the data, and different similarity metrics may lead to large differences in the hierarchical clustering results. Literature [20] investigated the graph theory based multi-view clustering approach, and the established generalised graph-based multi-view clustering system (GBS) generates a unified graph matrix by extracting the data feature matrices in the views and fusing them to form the final clustering results.
Clustering algorithm design is a dynamic research area in data mining that requires scholars to develop and improve on the original clustering algorithms. Literature [21] reviewed the research on the improvement of clustering algorithms based on Hadoop distributed platform with MapReduce programming paradigm, which gives the traditional division-based clustering algorithms the ability to obtain different clustering objectives on different datasets in order to reduce the computation time. Literature [22] proposed a two-stage clustering method, KMDD, to address the inefficiency exhibited by spatial data clustering methods on larger scale datasets by clustering the dataset into sub-clusters through a division-based algorithm, and then assigning a local density to each sub-cluster, and merging the sub-clusters under a density-based algorithm. Literature [23] proposed a Random Sample Partitioning-based Clustering Integration Algorithm (RSP-CE), which generates base clustering results on the entire large dataset and performs the Maximum Mean Difference Criterion (MMCDC) harmonisation and simplification, so that the base clustering results on the different dataset subsets can approximate the clustering results on the entire large dataset. Literature [24] states that Density Based Clustering Algorithms (DBCLAs) are capable of identifying clusters of arbitrary shapes and sizes with different densities, and can also be subdivided according to the density definition, parameter sensitivity, execution patterns and data properties for better application in various domains. Literature [25] highlights the advantages of grid-based clustering algorithms with fast processing time and the ability to discover clusters of arbitrary shapes, and for the problem of proper selection of the number of grid cells in the dataset, an optimisation scheme based on the k-distance function is proposed to improve the performance of grid-based clustering algorithms. Literature [26] discusses how to use MapReduce programming model to improve the operation efficiency of density-based hierarchical clustering algorithm on large datasets, and proposes a single-chain parallel clustering scheme based on HDBSCAN* algorithm combined with data bubble summarisation technique, which is validated by experimental evaluation for its effectiveness and scalability. Literature [27] combines data dimensionality reduction techniques with graph theory tools to propose a local graph-based relevance clustering method (LGBACC), which combines hierarchical clustering with PCA to discover complex hierarchical structures in high-dimensional data, and uses a graph model to visualise and display high-quality clustering results.
The article optimises and improves the traditional clustering algorithm, firstly, the data attributes will be initially screened based on the information entropy value of the attributes, after reducing the redundant attributes, the extracted attributes will be subjected to kernel principal component analysis in order to achieve the dimensionality reduction of the data, and then the k-means algorithm will be implemented on the dimensionality reduced data. The article introduces the entropy method to calculate attribute weights, thus proposing a weighted K-means algorithm based on optimizing the initial clustering center. The article evaluates the performance of the algorithm designed in this paper at three levels: efficiency, scalability, and parameter sensitivity. Finally, the clustering algorithm designed in this paper is used to cluster and analyse the population density of A city, which provides certain data support for the urban planning of A city and at the same time tests the scientificity and feasibility of the algorithm proposed in this paper.
Every day, we receive a variety of information through phone calls, tweets, e-mails, books, etc. Even a phrase such as “word for word, nonsense for nonsense” has a different value in its own right. The primary function of information is to eliminate uncertainty, and the quantity of information determines the value of that information. The amount of information is logarithmically related to the probability log21/
Information entropy was introduced by Shannon in 1948 and is often used to quantify the amount of information contained in a system. Information entropy describes how complex or chaotic a system is. For a discrete random variable
Where
In addition, information entropy has three important properties, namely monotonicity, non-negativity, and cumulativity, which play an important role in practical applications. Monotonicity means that the smaller the probability of an event occurring, the greater the amount of information it implies. Non-negativity means that information entropy cannot be negative. Additivity refers to the fact that the cumulative sum of the uncertainties of multiple random events occurring simultaneously in a system is equivalent to the cumulative sum of the uncertainties of each individual event.
Due to the large amount of data, resulting in more cumbersome calculations, we choose the principal component analysis (PCA for short) to process the data. The basic idea is to replace multiple indicators with certain correlation by a set of comprehensive indicators with non-overlapping information and less quantity, so that they can reflect the information represented by the original indicators to the maximum extent, i.e., to achieve the purpose of compression of data. Since the Z-Score standardization of the original data using the traditional PCA algorithm cannot fully show the differences between the variables, we will standardize the sample data with Min-Max to improve the effect of dimensionality reduction. The steps of principal component analysis are shown below:
Assuming that the input data is a matrix of
Set the data matrix:
where Min-Max normalisation of the
where Correlation coefficient matrix
The eigenvalues of matrix Cumulative contribution of principal components
Generally, the eigenvalues with a cumulative contribution rate of 85% or more are taken, and Principal component loading
where The score of each principal component
Principal component analysis was performed on the feature matrix, resulting in a final matrix of 2347 × 3.
The principal component analysis discussed above is the main method of data dimensionality reduction in the academic world, although this method can solve part of the “data disaster”, but at the same time, we should pay attention to the problem that there is a nonlinear relationship between the various indicators of the fundamentals selected for this study in addition to the linear relationship, which requires a nonlinear method of dimensionality reduction –KPCA (Kernel Principal Component Analysis) method, we can regard it as a natural extension of the general principal component analysis, the specific principle of the method is to map the nonlinearly differentiable variables to the high-dimensional feature space through the transformation of the kernel function [29]. The specific procedure of kernel principal component analysis is as follows:
Assume first that there is Assume that the observations are already centred, i.e., there is:
Secondly the covariance array of Next find the eigenvalue
Since
and there exists a coefficient of Comprehensive collation of the previous equations can be obtained:
Based on the above equation, define and matrix
Then the above equation can be expressed as:
Since Combining the above, if the eigenvalues and eigenvectors of
Then in feature space
Finally, the steps of kernel principal component analysis are summarised as follows: a. Calculate the kernel matrix K. b. Calculate the eigenvalues and corresponding eigenvectors of K/n and extract the top
The kernel function can be equated to the inner product of the feature space, which has the advantage that we do not need to know what the exact expression of Linear kernel function:
Polynomial kernel functions:
Gaussian kernel function:
Sigmoid kernel function:
The main idea of k-means clustering algorithm based on dimensionality reduction Kernel Principal Component Analysis (KPCA) is a commonly used method for dimensionality reduction of high dimensional data, but since the computational complexity of KPCA does not increase with the dimension of the transformed space but is only related to the dimension of the input space, the computational volume of the algorithm is greatly increased with the increase in the dimension of the data when dealing with large-scale high-dimensional data, so this paper improves it by firstly, based on the information entropy value of the attributes and the specified threshold value, each attribute whose attribute entropy value is greater than the specified threshold value is initially screened, the attribute with small contribution is removed, and the useful attribute information is extracted, and then the extracted attributes are subjected to kernel principal component analysis, so as to achieve the dimensionality reduction of the data, and finally the k-means algorithm is implemented on the dimensionality-reduced data, so as to improve the computational efficiency. Steps of k-means clustering algorithm based on dimensionality reduction Let the dataset to be clustered containing According to the above discussion, the steps of k-means clustering algorithm based on dimensionality reduction are as follows: Step 1: Calculate the value of the degree of fluctuation of the data in each attribute according to the information entropy formula Step 2: According to the specified threshold value Step 3: Dimensionality reduction of the data using kernel principal component analysis algorithm. Place the input Select the Gaussian radial basis kernel function according to Eq. and compute the kernel matrix Compute the modified kernel matrix Orthogonalise the eigenvector units and set to Calculate the cumulative contribution Calculate the projection Step 4: Execute the k-means algorithm on the dimensionality reduced data
Due to the randomness and sensitivity of the selection of the initial clustering centre, the traditional k-means clustering algorithm is prone to fall into local optimums and unstable clustering results, and consumes a large amount of time and computationally significant amount of time in processing large-scale high-dimensional data. In this paper, we first use simple random sampling to simplify the data set, obtain a small number of samples with the original data set distribution is basically the same data set, and according to the tightness of the spatial distribution of the sampling samples, the use of the minimum variance optimization to achieve the k-means algorithm for the initial clustering centre selection, to reduce the uncertainty factors such as anomalies and other factors on the initial clustering centre of the adverse effects, so that the clustering results have been strengthened. Secondly, in order to overcome the influence of different attributes of the sample data on the clustering results in the clustering calculation process, the entropy method is introduced to calculate the attribute weights to improve the clustering accuracy, so as to put forward the weighted k-means algorithm based on the optimization of the initial clustering centre, and analyse and compare the numerical experiments, which verifies the feasibility and validity of the algorithm in this paper.
Data sampling preprocessing This paper uses simple random sampling method line sampling, this is because the overall distribution of the samples obtained from the sampling is similar to the original data, the resulting samples can well represent the distribution of the original data. As the number of samples with anomalies and outliers in the sample is very small, the probability of anomalies and outliers being extracted is very low, which can effectively avoid the selection of outliers as the initial clustering centre, and greatly reduce the impact on the clustering results. Simple random sampling method is to draw Basic Concepts Let Define the distance

Data distribution and sample data distribution
Denote by
Define the variance Steps of the algorithm for optimising the initial clustering centre According to the above discussion of the basic idea of optimising the initial clustering centre, Based on the previous description, the specific steps of the algorithm are as follows: Step 1: Calculate the distance Step 2: Calculate the average distance between all samples Step 3: Calculate the variance Step 4: Find the one with the smallest variance from the set Step 5: Repeat steps 3 and 4 until
In this paper, we propose a weighted Euclidean distance that assigns a weight to each sample involved in clustering based on its importance, thereby reducing the impact of factors such as outliers on the clustering results [30].
Define weight
For the introduction of weights
Therefore the improved objective function is:
Where Entropy value method for calculating attribute weights Next, the entropy value method is used to calculate the weight of each attribute according to the contribution of each attribute. Data Normalisation. Construct the attribute value matrix of the clustering data set to be tested
where
Where, Calculate the entropy value of the attribute. Calculate the entropy value of the attribute in dimension
Where, Calculate the coefficient of variation of the attribute. Calculate the coefficient of variation of the attribute in dimension
where Calculate the weights of attributes. Calculate the weights of the attributes in dimension
The Euclidean distance after assigning weights to Basic steps of weighted k-means algorithm based on optimising the initial clustering centre According to the above discussion, In summary, the steps of weighted k-means clustering algorithm based on optimised initial centres are described as follows: Step 1: Simple random sampling for Step 2: Apply the initial clustering centre algorithm based on minimum variance to the sampled data Step 3: Calculate the weights of each attribute of the data object according to the formula. Step 4: Divide the data samples into the closest class clusters according to the distance from the clustering centre according to the formula. Step 5: Calculate the values of the Step 6: Verify whether the algorithm stops or not, if it satisfies the objective function value (4-5) is minimum or remains unchanged, then the sending generation ends. Otherwise, perform step 4.
Based on seven datasets of quite different sizes (from 82MB to 2560MB), the algorithms proposed in this paper, Par2PK-Means and ParCLARA algorithms are executed in a parallel environment (Hadoop cluster consisting of 8 nodes) and the mean value algorithm is run in a stand-alone environment, respectively, to evaluate whether the algorithms proposed in this paper are able to handle larger datasets and achieve greater efficiency. The comparison of execution time for seven datasets is shown in Table 1. From the table, it can be seen that the serial K-mean algorithm in a centralized single-computer environment cannot handle datasets with data sizes larger than 1280MB, or else it will suffer from the “memory overflow” phenomenon. However, in a distributed environment (i.e., a “big data” environment), the parallel algorithm in this paper can effectively handle large-scale datasets exceeding 1280MB. In particular, this algorithm is more efficient than the ParCLARA and Par2PK-Means algorithms. The reason is that: in this paper’s algorithm, the clustering initialization strategy of mean is optimized. Meanwhile, we can find that the execution time of the mean value algorithm is shorter than that of this paper’s algorithm when dealing with small-sized datasets. The reason is that the communication and interaction between the nodes in the distributed parallel environment occupies a considerable part of the execution time of the total time consumption, which leads to the overall running time being longer than the computation time of the actual task of this paper’s algorithm. More importantly, we can observe that as the size of the dataset continues to expand (i.e., the size of the dataset gradually increases), the processing efficiency of this paper’s algorithm increases exponentially, and the superiority becomes more and more obvious.
Execution time comparison on seven data sets
Data sets | Size(MB) | Execution time(s) | |||
---|---|---|---|---|---|
K-means | ParCLARA | Par2PK-Means | Ours | ||
Taxi trajectory | 80 | 78 | 658 | 432 | 310 |
150 | 100 | 755 | 480 | 375 | |
310 | 190 | 892 | 578 | 400 | |
650 | 1120 | 1175 | 830 | 615 | |
1280 | 2365 | 1665 | 1350 | 1025 | |
2560 | - | 2735 | 2412 | 1875 | |
Iris | 80 | 30 | 610 | 300 | 210 |
150 | 55 | 690 | 375 | 275 | |
310 | 120 | 815 | 450 | 300 | |
650 | 680 | 1035 | 640 | 405 | |
1280 | 1750 | 1200 | 1001 | 765 | |
2560 | - | 2455 | 2015 | 1420 | |
Haberman’s survival | 80 | 50 | 576 | 312 | 295 |
150 | 65 | 677 | 400 | 310 | |
310 | 125 | 834 | 480 | 375 | |
650 | 715 | 981 | 675 | 425 | |
1280 | 2020 | 1325 | 1250 | 975 | |
2560 | - | 2445 | 2200 | 1725 | |
Ecoli | 80 | 35 | 620 | 350 | 281 |
150 | 60 | 715 | 400 | 326 | |
310 | 125 | 842 | 455 | 370 | |
650 | 755 | 1123 | 670 | 485 | |
1280 | 1925 | 1635 | 1230 | 765 | |
2560 | - | 2578 | 2210 | 1420 | |
Hayes-roth | 80 | 40 | 567 | 312 | 275 |
150 | 55 | 640 | 390 | 340 | |
310 | 120 | 725 | 470 | 385 | |
650 | 725 | 962 | 650 | 435 | |
1280 | 1985 | 1475 | 1200 | 735 | |
2560 | - | 2430 | 2100 | 1577 | |
Lenses | 80 | 30 | 620 | 300 | 295 |
150 | 62 | 705 | 385 | 325 | |
310 | 135 | 925 | 450 | 375 | |
650 | 720 | 1052 | 540 | 430 | |
1280 | 1890 | 1345 | 1080 | 815 | |
2560 | - | 2375 | 2085 | 1475 | |
Wine | 80 | 55 | 625 | 360 | 310 |
150 | 76 | 700 | 410 | 355 | |
310 | 125 | 834 | 460 | 400 | |
650 | 732 | 1000 | 600 | 425 | |
1280 | 2200 | 1345 | 1240 | 845 | |
2560 | - | 2564 | 2250 | 1625 |
To verify the scalability of the algorithm proposed in this paper, it was verified whether it can handle larger datasets when more nodes are involved in the computation. For this purpose, in the scalability experiment, the computational comparison of this paper’s algorithm was performed on different datasets by increasing the size of the dataset (from 150MB to 1280MB) and the number of nodes (from 1 node to 8 nodes) in the same proportion as shown in Fig. 2(a). Meanwhile, the scalability performance of this paper’s algorithm is further verified by comparing it with ParCLARA algorithm and Par2PK-Means algorithm using Iris dataset (from 150MB to 1280MB).The comparison of this paper’s algorithm, Par2PK-Means and ParCLARA is shown in Fig. 2(b). From Fig. 2(a), it can be observed that as the number of nodes increases in the same proportion as the size of the dataset, the scalability values of this paper’s algorithm are less than 1. Specifically, the values of its scalability are all higher than 0.62. In addition, it can be observed from Fig. 2(b) that this paper’s algorithm obtains a better scalability performance than that of Par2PK-Means algorithm, ParCLARA algorithm (The scalability values of this paper’s algorithm, Par2PK-Means, and ParCLARA are higher than 0.66, 0.59, and 0.53, respectively). The above results show that the parallel algorithm proposed in this paper has excellent scalability and strong adaptability in processing large-scale datasets on the Hadoop platform using the MapReduce framework.

Expansionary assessment
Before executing the algorithm, two very important parameters need to be set in advance:

Parametric sensitivity analysis
This section mainly applies the improved algorithms in the paper to the actual big data analysis, analysing the data from the GPS location information of the mobile users in a city, mainly from the aspects of time, location, etc. To analyze the areas that the citizens of this city like to go to during a certain period of time and the areas that the citizens often go to. Therefore, it can be better planned for the development of the city and the improvement and implementation of traffic and safety measures in commercial areas to provide a better living environment for the citizens.
The available data left after screening mainly contains four columns: ID number, time of data creation (upload), and latitude and longitude. The data structure is shown in Table 2.
Data structure
Name | Type | Remark |
---|---|---|
Id_Num | String | Id Number |
Create_Time | Date | Create (Upload) Time |
Lng | Double | Longitude |
Lat | Double | Latitude |
Since some citizens set the GPS function on their mobile phones to be not connected to the Internet, resulting in the inability to upload accurate data, and it is also possible that the system automatically uploads the default GPS data, there is a lot of location information positioned to some other regions of the country abroad alive. Points like these that are not within the area of this city need to be pre-cleaned out to avoid unnecessary waste of computing resources, but also to improve the efficiency of the analysis.
The original data has too many fields, because the data needed in this paper does not need so much, so just keep the data fields in the table, if the data is further analysed in the future, more fields can be filtered out. The data is mainly stored on HDFS with four columns of data separated by commas in each row.
This section is mainly based on the data in the previous subsection and the data analysis programme to carry out a series of data calculations obtained by the data results are presented in the form of charts and graphs to facilitate urban planning as a certain reference, the next is mainly on the urban area for a long period of time to analyse the density of population activities and special festivals of the analysis of the density of the population’s active.
The following is the name of the geographical location of the city that corresponds to the centre point of the first few data rows in the clusters according to the improved clustering algorithm of this paper.The top six areas of the population gathering density in city A are shown in Fig. 4. The data in the figure is based on more than 20 million pieces of data statistics and screened out the proportion of GPS points, due to the top six areas of the number of people more often and relatively concentrated, it was chosen to rank the top six areas, as can be seen from the figure, the proportion of the highest is the A area, is the six areas of the thirty-one per cent, which is sufficient to prove that the number of user activities in this area is very large, the city government’s proposal is to be more in this area of the transport The recommendation to the municipality is to have more traffic in this area as an emergency programme and as a key target for fire and disaster prevention. Such data is also the effect of clustering statistics under a large amount of data, and can not accurately or very accurately reflect the real situation, but since it is a statistical conclusion, there must be a relative reference value, and the results of this statistics can be combined with the results of other indicators to facilitate the leadership of the leaders to make reasonable decisions.

A region of the first six of the population of a city
Aiming at the number of user activities in area A is relatively large, and then analyze the 24-hour data in area A, divide the day into 24 parts by time, and then gather the time of each section, the specific method is to take only the hour in the field “creation time” of the two characters, and other characters filtered out, and then the data will only contain the 24 characters 00-23, and then grouped by these 24 characters to count the number of each group. The data will only contain 24 characters from 00-23, and then, according to the 24 types of characters for grouping statistics, statistics on the number of each group. Twenty-four hours of population change in area A as shown in Figure 5. From the figure, it can be found that the number of people is higher in two time periods every day, respectively, from ten to eleven and from thirteen to seventeen. Based on such statistics, it can be known at what point of time in Area A the number of people is more and less, which is also a strong reference for the safe traveling in the city.

The change of people’s mouth in a region
Then for the special date for a day of the number of users in the A region to do analysis of the flow of the number of users in the A region, the following figure is the A region 2017 Tanabata twenty-four hour number of people change chart, A region 2023 Tanabata twenty-four hour number of people change as shown in Figure 6. As can be seen from the figure, from zero to six basically nothing user flow, at this time the user is still basically sleeping, in line with the reality of real life, from seven, awareness began to slowly increase until eleven the number of people to reach the highest value in the morning, the number of people at noon for a period of time was a downward trend, the possible situation is that the user of this time to end of the morning for a period of time some fatigue will choose to go home Or go to other places, but then after the user is rising period, this is the afternoon user activity is more active time, until seventeen o’clock when the user activity gradually declined, this chart can fully demonstrate the crowd in the A area of the activity situation, the municipal government departments, the Public Security Bureau A and the fire department can be based on such data to arrange personnel reasonable arrangements for the evacuation of personnel and security hazards in the defence.

The number of people in a region of 2023
The distribution of centroids in the clustered clusters on the map, only the centroids of the first ten clusters out of fifty clusters were taken and placed on the map, and finally the clustered centroids were displayed on the map as shown in Figure 7.

The cluster center is shown on the map
In the era of “Internet+” to “big data”, big data has become the focus of great attention in the scientific and technological community, industry and government departments. The paper examines the efficient clustering algorithm that is based on big data applications, and the conclusions are as follows:
In the parameter sensitivity analysis of the algorithm, the algorithm designed in this paper with the increase of the minimum value of the set trajectory clusters, the time is slightly reduced, and the running time of the algorithm designed in this paper is much smaller than the CLUR algorithm compared to the algorithm, which indicates that the algorithm of this paper for the parameter of the sensitivity of the parameter is lower than the CLUR algorithm.
In the cluster analysis of human flow in a city, it was found that the two time periods of ten to eleven and thirteen to seventeen in the area have the most human flow.