Research on Computational Load Balancing for Massively Parallel Tasks Based on Adaptive Iterative Algorithm
Published Online: Sep 29, 2025
Received: Feb 06, 2025
Accepted: May 10, 2025
DOI: https://doi.org/10.2478/amns-2025-1104
Keywords
© 2025 Wenwen Huang, Xiaofang Hu and Fengfei Yang, published by Sciendo.
This work is licensed under the Creative Commons Attribution 4.0 International License.
With the development of computing technology, computers have been applied to an increasingly wide range of fields. In addition to the computer automation applications that we often see in our daily life, there are also some very complex fields, such as large-scale scientific computing, massive data information processing, etc., which usually have higher requirements on the processing power of computers, and the commonly used PCs simply can not meet their needs [1-3]. So people must strive to build more powerful computer systems to solve these complex problems. However, the application is endless, after solving a complex problem, more complex problems will emerge, putting higher requirements on computers [4-6].
Throughout the history of computer development, people have been in the constant pursuit of high computer performance. From the history of computer development, it is easy to see that a large number of hardware design solutions have appeared in the past forty to fifty years to improve computer speed. For example, increasing the CPU frequency, using multi-level buffering and instruction prediction, improving the access speed of memory and disk, improving the process to increase integration and reduce power consumption, etc. However, due to the physical characteristics of the hardware itself, such improvements are not only costly, but also bound to go to its limit [7-10]. Meanwhile, with the development of Ethernet, there exists a large amount of free computing resources on the network. Therefore, a more feasible approach is to use load balancing techniques to reschedule the tasks that have been assigned to each node and make the load of each node roughly equal through process migration (also known as task migration), so that the processors of all nodes can execute the tasks consecutively, avoiding the coexistence of idle and busy waits, and effectively improving the resource utilization rate, decreasing the average response time of the tasks, and realizing the execution time minimization goal [11-14]. Its greatest significance lies in its ability to fully utilize the existing idle CPUs in the network, providing computing power far exceeding that of the equivalent serial computers, while at the same time the price is lower than that of the equivalent minicomputers. If run properly, very good results can be obtained [15-17].
Writing parallel programs in a cluster environment is different from programming on a single machine, and the programmer has to deal with a great deal of application-independent drudgery. To make parallel programs run efficiently on clusters, a very critical issue is to ensure that all computers get the right amount of load, avoiding the situation where some computers are overloaded while others are idle [18-20]. In professional terms, this means that load balancing is required. The impact of load balancing on the overall performance of the cluster is very obvious, and exploring reasonable load balancing strategies is very important for improving cluster efficiency and shortening parallel computing time [21-23].
In this paper, the structural model and key techniques of task scheduling in cloud computing environment are first described, after which a mathematical model of task scheduling is obtained in terms of task and resource nodes, resource load and task scheduling time span. An improved adaptive iterative based algorithm (AGAPD) is obtained after processing using techniques such as fitness function, selection, crossover and mutation operators. The convergence of the AGAPD algorithm is then solved based on Markov chain solving for the equilibrium state of a massively parallel task, and the process of solving the parallel test task scheduling problem with the AGAPD algorithm is demonstrated. Finally, the solution time and CPU occupancy of the 2 traditional and this paper’s algorithms are tested by large and small scale use cases, and the performance of the algorithm’s load balancing is analyzed with the help of large scale use cases.
The cloud computing environment based on virtualization technology is transforming the cluster of physical nodes into a cluster of resource nodes, and then the attribute information of the resource nodes is handed over to the data center agent. The users submit their respective tasks to the data center agent through the user interface and the data center agent is responsible for the specific scheduling process of the tasks. Task scheduling in cloud computing environment is to schedule different tasks to the appropriate resource nodes for execution and evaluate the advantages and disadvantages of the scheduling scheme with the overall execution benefits.
Tasks in a cloud computing environment are generally categorized into tasks that have dependencies on each other and tasks that are independent of each other. Measurement indexes of resource capacity on resource nodes generally include computing capacity, storage capacity, transmission capacity, etc. To ensure that the overall execution time of the system is minimized and the load of the cloud computing system is balanced, the resource nodes should be described by at least combining the two aspects of computing capacity and load rate, and this paper, in order to simplify the problem, focuses on designing the task scheduling model from the computational capacity of the resource nodes and the load rate. In this paper, in order to simplify the problem, we focus on designing the task scheduling model from the computing power and load rate of the resource nodes, and assume that the tasks to be scheduled are independent meta-tasks, i.e., each task is a nondivisible atomic task.
Representation Model of Tasks and Resource Nodes Scheduling n mutually independent meta-tasks to be executed on m resource nodes (
Where, Resource load balancing model Denote by
can be deduced:
Define the fitness function for the resource load rate:
Where Task scheduling time span modeling Assuming that
Definitions:
The above equation indicates that the lesser the task execution time, the greater the value of adaptation.
Coding and population initialization Individuals are generated by randomly ordering the genes, and the initialization of the population is completed by generating individuals several times until the number of individuals reaches the population size. The random ordering is to make the individuals as random as possible, so that the initial population is uniformly distributed in the whole solution space. Design of fitness function The fitness function [24] varies according to the optimization direction of the specific problem, the goal of parallel test task scheduling research in this paper is to determine the scheduling plan with the shortest total test time, so the total test time is used as a criterion for evaluating the strengths and weaknesses of individuals. The acceleration ratio
Where: Selection operator The elite retention strategy of the selection operation selects a portion of the excellent individuals directly and inherits them to the offspring, solving the problem that the excellent individuals may not be selected or destroyed by the crossover and mutation operations when using the roulette algorithm. The selection probability obtained using the roulette algorithm is:
Where: Crossover operator The crossover operation gets the offspring by crossing part of the gene segments of 2 parents, and inherits part of its own characteristics to the offspring to increase the diversity of the population and improve the global search ability. The two-point crossover method is adopted, and the specific steps are as follows: Step 1 Randomly select the parents to be crossover and the crossover gene fragments. Step 2 Exchange the gene fragments, and use “X” to occupy the gene positions that do not need to be exchanged. Step 3 In order to prevent coding conflicts, delete the genes in parent Mutation operator Mutation operation is used to produce offspring by changing some genes in the parent to increase the diversity of the population and to avoid local optimization to some extent. Reverse-sequence mutation is used, the specific steps are as follows: step 1 randomly select the parent and the mutated gene fragments that need to be mutated. Step 2 Generate the offspring by reverse ordering the gene fragments. Adaptive crossover and mutation probability In this paper, the population diversity is expressed in terms of the degree of dissimilarity, and the degree of dissimilarity
Where:
The degree of dissimilarity
Individuals are categorized into inferior and superior individuals based on the average fitness value of the current population. For inferior individuals, i.e., those with small fitness values, they are defined with crossover and mutation probabilities:
where:
According to equations (15) and (16), the change images of self-adaptation crossover [25-26] and mutation probability are plotted as shown in Figures 1 and 2. It can be seen that for individuals whose fitness value is smaller than the average fitness value, their crossover and variation probabilities are maximized; for individuals whose fitness value is larger than the average fitness value, the crossover and variation probabilities are determined based on their fitness value and the ratio of the current population dissimilarity

Adaptive crossover probability

Adaptive mutation probability
During the iteration process, when the diversity of the population is high, the population dissimilarity
When the diversity of the population is low and the population dissimilarity
The initial population is randomly distributed throughout the solution space, when the population is most diverse and most dissimilar.
The flow of solving the parallel test task scheduling problem based on AGAPD algorithm is shown in Fig. 3. The basic steps are as follows:

Flow chart of improved genetic algorithm
Step 1: Initialize the algorithm parameters. Obtain the test task information.
Step 2: Use integer coding to encode the test tasks and generate individuals in a randomized order.
Step 3: Calculate the dissimilarity of the initial population based on the population obtained from initialization.
Step 4: Calculate the fitness value of each individual.
Step 5: Determine whether the termination condition is satisfied, here the termination condition is whether the number of iterations reaches the maximum number of iterations, if yes then execute step 9, otherwise execute step 6.
Step 6: Use a roulette algorithm with an elite retention strategy to perform a selection operation.
Step 7: First, calculate the dissimilarity degree of the current population; then, calculate the current crossover probability of the population; finally, carry out the crossover operation.
Step 8: First, the dissimilarity degree of the current population is calculated; then, the current mutation probability of the population is calculated; finally, the mutation operation is performed and step 4 is executed.
Step 9: According to the individual fitness value, the individual with the largest fitness value is output as the result of task scheduling.
In this paper, the performance of the algorithms is compared through small-scale and large-scale use cases, respectively. The small-scale case is used to test the algorithm’s ability to handle general test task scheduling, while the large-scale case is used to explore the algorithm’s extreme performance. The test task completion time and CPU utilization rate are used as the evaluation indexes of the algorithms.
The small-scale test case 1 is shown in Table 1, the scale: the number of test tasks is 7, the number of resources is 7. The algorithms in the table are solved by calling the integer planning exact algorithm (method 1), the genetic algorithm (method 2), and the AGAPD algorithm (method 3) in this paper, respectively. 3 algorithms can find the optimal solution of the case, and the difference of their solving efficiencies is not big.
Small-scale test case 1
Test number | R1 | R2 | R3 | R4 | R5 | R6 | R7 | Ri |
---|---|---|---|---|---|---|---|---|
Resource number | ||||||||
1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 6 |
2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 15 |
3 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 5 |
4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 4 |
5 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 22 |
6 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 7 |
7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 43 |
8 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 14 |
The solution results of the 3 algorithms are shown in Fig. 4.

The solution of the three algorithms
The solution time and CPU occupancy of the three algorithms are shown in Fig. 5. It can be seen that the solving time of Method 1, Method 2 and Method 3 algorithms are 1.67s, 0.85s and 0.56s respectively, which is a big difference, and their CPU occupancy are 55%, 68% and 40% respectively, compared with Method 3, which has the lowest solving time and CPU occupancy, and has the best effect.

The solution time and the cpoccupancy rate of the three algorithms
Large-scale example: the number of test tasks is 150, and the number of resources is 15. The results of the large-scale case test are shown in Fig. 6. When the test scale is enlarged to a certain degree, the integer planning exact algorithm and the genetic algorithm can barely accomplish the task, and their solution times reach 5944s and 3782s respectively, and their CPU occupancy rates are as high as 93.5% and 84.6% respectively.

Test results for large-scale use cases
In contrast, the AGADP algorithm (Adaptive Genetic Algorithm) proposed in this paper can search for the optimal solution in only 357 seconds, and its CPU occupancy rate is only 49.3% at this time. Thus, compared with the traditional heuristic algorithm, the AGADP algorithm proposed in this paper has a great advantage in large-scale testing tasks.
Figure 7 shows the input task statistics of this experiment, the system randomly generates 2000 tasks between 150 and 650 as test input. In this paper, the generated dataset in the simulated environment is used uniformly, so that although it does not guarantee that the AGAPD algorithm has the same performance in the real environment, it can ensure comparability under the premise of only considering the factor of computational performance, and lay the foundation for more in-depth research in the future.

Load statistics for tasks
The convergence of the algorithm needs to be examined during the experiments in this subsection, i.e., whether the load equalization metrics show a decreasing trend as the number of iterations grows, and the comparison of the convergence speeds at different diffusion radii. The convergence of the degree of load imbalance is shown in Figure 8. The stability of the algorithm needs to be improved, especially by the time the system is about to end, as many processes are already in idle state, and the remaining tasks of the processes that have not yet completed the computation tasks are not much, so the tasks can only be appropriately relocated, but still can not change the situation of the load imbalance at this time.

The result of the convergence of load inequality
Algorithm in the early stage is a steady decline and tend to stabilize the state, the difference is that the larger the diffusion radius (such as diffusion radius of 5), the sooner the system tends to stabilize the state. At this point, because the processes are in a state of information gap in the early stage, so in the first few iterations, the radius of information diffusion determines how quickly or slowly the system will make the decision of load balancing.
By the middle of the process, most of the processes in the system have gained information about other processes, the difference is only whether this information is up-to-date or not, but this does not affect the execution of the optimization algorithm. Because its design is adapted to the asynchronous update state, the system is at a steady state at this point.
At a later stage, on the basis of the original load distribution, not much balancing is needed, just let them each complete the remaining tasks, but the evaluation index will therefore tend to increase, considering that the effect of migration at this time is not very large, so the increase in the degree of load imbalance during this period can be ignored.
The experiments in this subsection focus on testing the propagation efficiency of the message diffusion protocol in terms of the number of diffusion times required to diffuse a message to 99% coverage in heterogeneous cluster systems of different sizes and with different diffusion radii. The number of diffusion times required for a message to diffuse to 99% of the processes is shown in Figure 9. It can be seen that as the system size increases, the number of diffusion times shows an extremely slow growth trend, which indicates that the message diffusion protocol used in the AGAPD algorithm has good scalability, and the number of diffusion times stays around a stable value no matter how the system grows. A larger diffusion radius will require fewer diffusions, as there are more objects exchanging information in each round, so the number of diffusions is relatively small to achieve 99% coverage.

The number of times required to spread to 99% of the process
For the load balancing algorithms in the simulation experiments, the measurement of their overhead is relatively simple, i.e., the amount of tasks for which migration occurs; in fact, a more accurate measurement would also need to include the information sent and received by the information diffusion, but since the information diffusion of the AGAPD algorithm is designed to be asynchronous, and the delay in its communication does not have much impact on the final result, it is not included in the measurement. The additional overhead of the different algorithms at different system sizes is shown in Figure 10. The additional overhead of the AGAPD algorithm has been kept very low (0.18), while the rest of the algorithms have more system overhead. This may be related to the fact that the AGAPD algorithm fully takes into account the possibility of repeated task migrations and learns the structure of the probabilistic graph by reorganizing the system into a bipartite graph and a neutral set of nodes, thus reducing unnecessary task migrations.

Different algorithms are overhead for different systems
The purpose of the experiments in this subsection is to test the speedup ratio improvement that the AGAPD algorithm brings to the application compared to other algorithms under different sizes of heterogeneous cluster systems. The speedup ratios of different algorithms at different system sizes are shown in Fig. 11. The results show that the acceleration ratio increases with the increase of system size, and on the whole, AGAPD algorithm brings better acceleration to the system than other algorithms, which shows a gradual incremental trend with the growth of the number of processes. This shows that a good load balancing algorithm can indeed bring efficiency improvement to parallel computing applications. The rest of the algorithms have a mediocre performance, the reason is that the scope of the collection of load information is limited to the processes to understand the system, although the local load balancing but the global effect is not ideal.

Different algorithms are accelerating in different systems
To solve the efficiency bottleneck caused by dynamic load fluctuations in massively parallel computing, this study proposes a task scheduling algorithm based on the improved adaptive iterative algorithm (AGAPD). This algorithm is used to compare with the traditional algorithm, and the application performance of the model in this paper is verified through experiments. The results show that the method proposed in this paper saves 90.56%-93.99% of time and reduces 41.73%-47.27% of hardware resources compared to the traditional solver, which improves the efficiency of solving this type of problems. Under heterogeneous systems it is more likely to lead to load imbalance and information about computational performance is crucial for load balancing algorithms, so it is necessary to notify the rest of the processes in a timely manner. The AGAPD algorithm ensures enough information while reducing the communication overhead.