Distributed Database Optimization Techniques Combining Computer Network and Algorithm Design
Published Online: Mar 21, 2025
Received: Oct 13, 2024
Accepted: Feb 09, 2025
DOI: https://doi.org/10.2478/amns-2025-0611
Keywords
© 2025 Lihua Pan et al., published by Sciendo
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
With the rapid development of Internet technology and the rise of the smart city concept, the amount of information data has shown unprecedented exponential growth in production, life and work [1-2]. In particular, the rapid development of social networks and video sharing platforms, as well as the widespread deployment of urban video surveillance networks, have made the shape of data far beyond the scope of simple textual information [3-4]. Traditional relational databases, structured to manage data in the form of rows and columns, predefine the structure of the table and deposit data according to the individual fields of the table [5-6]. Modern data, on the other hand, not only includes traditional text, but also covers images, videos, audios, and diverse data from various sensor networks, which are usually categorized as unstructured data, and can be transformed into more structured vector data through embedding techniques or other transformations, which can then achieve efficient similarity retrieval of images, audios, and other contents [7-9].
In today's data processing industry, the rapid growth and diversity of data puts higher requirements on distributed database systems, making them an indispensable part. In this scenario, vector indexing techniques have been emphasized for their significant role in improving the efficiency of data retrieval, especially in the areas of image search, natural language processing and recommender systems [10-13]. Integrating this technique into distributed database systems not only improves the speed of querying, but also speeds up the process of data retrieval and analysis, which is extremely crucial for handling huge datasets and meeting the demands of real-time applications. The applications of vector data in various fields such as similarity search, recommender systems, natural language understanding and computer vision demonstrate its wide impact and important role in the current data processing field [14-16]. Therefore, exploring and optimizing vector indexing in distributed databases not only has a wide range of application potentials, but also is of key significance in realizing the efficient operation of vector indexing in database systems, which provides a strong technical support to cope with the complex challenges faced in the current data processing field [17-18].
Distributed database systems can be obtained by using centralized database technology as a base and then combining it with computer networks. The difference between the data in a distributed database and a centralized database is that the data is stored in a decentralized manner in different sites of the network, and the databases in all sites have the ability to be processed independently [19-20]. And each site needs to be involved in the execution of the global application, which utilizes the results of the existing network topology for the purpose of communication and access to the data dispersed in each site [21-22]. However, due to the actual application and operation links, the distributed network will not be felt, but the operation does belong to the whole database system, so it leads to the fact that although the distributed database will be physically dispersed in each site, but logically it still belongs to the same database system's dataset, and this leads to a certain degree of complexity in terms of query processing [23-25].
Therefore, exploring and optimizing vector indexing in distributed databases not only has a wide range of application potentials, but also is of key significance in realizing the efficient operation of vector indexing in database systems, which provides a strong technical support to cope with the complex challenges faced in the current data processing field [26-27].
The article firstly gives a brief introduction to the definition of distributed database system and its structural characteristics, then analyzes the cost estimation of system query optimization on the basis of distributed data query optimization objective, and then elaborates the distributed database query optimization process under the global optimization level from the perspectives of query search space and query search strategy. Then the bio-immune property is combined with a genetic algorithm to propose a new distributed database connection query optimization algorithm based on an improved immunogenetic algorithm. In the experimental tests, the algorithm proposed in this paper is applied in a computer network system, and it is analyzed and compared in simulation experiments. In a random network topology with 30 nodes and 58 nodes, the immunogenetic algorithm and the basic algorithm are also used to find the optimal path for querying, compare the convergence curve, the final communication cost, and the mean and standard deviation of the same number of times of running, so as to verify the superiority of the algorithm proposed in this paper.
Distributed database system is usually a number of smaller and independent computer systems distributed in a computer network on a number of sites, each computer system is a relatively independent of the complete whole, which can be regarded as a centralized database, with its own independent database, database management system and the underlying hardware and software, etc., and are able to independently complete the work of the local site, the various sites of the computer system through the computer network connected to form a logical whole similar to the centralized database operation. Computer systems at each site are connected through a computer network to form a logical whole similar to the operation of a centralized database.
Distributed database systems have the following characteristics:
Distributed database system, each site has a local data management system, CPU, memory, hard disk and other equipment, in addition to storing the corresponding data, there are independent handling of the corresponding transactions and management of its site data capabilities.
Distributed database data distribution is mainly reflected in the various functional components and their data are scattered and stored in different sites, these nodes are connected by the computer network, the master node is responsible for the management and control. Users in the process of operating the distributed database, can not see the data specifically stored in which subsystems, how the internal data distribution, but also not clear which operating system is used at each site of the database. This physical distribution of data decentralized users cannot feel the characteristics of the data shared through the network, which makes the user experience and operation of centralized databases no different.
According to user needs and developers' practical considerations, the data in distributed database are scattered and stored in different network sites, in which each site is similar to the subsystem of centralized distributed database, which has its own local data management system, CPU, memory, hard disk and other equipments, and it is a relatively independent database system, which is not only able to carry out the self-management and maintenance of subsystems, but also able to It is a database system with relative independence, not only capable of self-management and maintenance of sub-systems, but also capable of handling data transmission between sites. In order to ensure the consistency and integrity of data, the database management system carries out unified management of subsystems distributed on each site through network communication [28]. The distributed database system network structure can be seen in Fig. 1.

Network structure of the distributed database system
Data redundancy is unique to distributed databases, centralized databases do not have, because distributed databases are based on the requirements of the data stored in different sites in the network, and through the computer network connected to each site to achieve unified control, so the redundancy of the data is critical to distributed databases need to focus on consideration. Distributed database data redundancy enhances the reliability of the distributed database to a certain extent, when a network site has a problem that causes the system can not run normally, the distributed database system through the network communication can be accessed from other sites to obtain the relevant data to ensure the normal operation of the entire system. When the user accesses the data, the distributed database system can effectively select data copies according to the query strategy, improving data access efficiency. The network architecture of the distributed database system is shown in Figure 2.

Network architecture of the distributed database system
In a distributed database system, since the databases on each network site store different data and a query operation generally involves data transfer across sites, the efficiency of a distributed database query optimization can generally be measured from two aspects for distributed databases. On the one hand, it can be measured from the query cost generated by the query execution plan, i.e., the communication cost and site I/O cost and CPU cost generated by the inter-site data transfer of the query execution strategy can be considered. On the other hand, it can be considered from the length of the response time of the user query request, that is, the user issued a query request to the completion of the query plan to return the query results generated by the time, in general, the shorter the query response time, the higher the query efficiency, this kind of query evaluation criteria has a very important practical needs of the significance of the user is generally also extremely concerned about.
In distributed databases, the communication transmission cost between sites is generally high, and to minimize the total query cost and minimize the total response time, it is necessary to focus on the communication cost.
In distributed database query operations, the query cost that affects the query response time is usually:
The communication cost of inter-site data transfer is usually:
where
In the distributed database data query process, a query optimization process often includes four parts: data query decomposition, data localization, data global optimization and data local optimization. The distributed database query optimization structure is shown in Figure 3.

Distributed database query optimization layer structure
The cost model for data query optimization is set according to user requirements and network conditions. With the set cost model, quantitative query computation can be carried out on all solution sets in the search space as a way to provide data basis for selecting a better query execution plan. The search strategy of the query execution process determines the final order of the query execution plan based on the search space, and specifies the data processing strategy, so as to achieve the purpose of reducing the cost of query execution and shortening the query response time. The distributed database query optimization process is shown in Figure 4.

Query optimization process of distributed database
Genetic Algorithm is an adaptive global probabilistic search algorithm, which uses probabilistic optimization method to find the best, and can automatically acquire and optimize the search space, adaptively adjust the search direction according to the environment, which is a bottom-up optimization method. Genetic algorithms represent complex problems with simple coding, process the coding of a set of variables, and guide learning and determine the direction of the search through genetic manipulation of a set of coded representations and the natural selection of the best and the worst.
Immunogenetic algorithm is a kind of optimization algorithm that combines the advantages of the basic genetic algorithm with the characteristics of biological immunity theory, and it is a kind of optimization algorithm that combines and penetrates the knowledge of multiple disciplines and fields. Immunogenetic algorithm process is:
1) Set initialization parameters and randomly generate a certain number of individuals to form the initial parent population
2) Select a certain number of individuals with high adaptation from the initial population as vaccine extraction based on prior knowledge.
3) Judge whether the current population has contained the best individuals, if so, the algorithm ends. Otherwise, enter the following steps.
4) Operate on the individuals according to the concentration and fitness of the individuals in the population to replicate the high-quality individuals into the next generation.
5) Perform crossover operations on individuals in the current
6) Perform mutation operation on individuals in population
The ordered string encoding (real number encoding) method operates by first numbering the n relations contained in the query starting from 1, and then forming the chromosome in the numbering order of the bottom-up leaf nodes of the left linear tree [30]. The antibody coding diagram is shown in Fig. 5. The left linear tree as shown in the figure is genetically encoded as (R2, R3, R1, R4).

Antibody coding diagram
Setting the antibody evaluation criteria focuses on the fitness of the solution to the problem. When using immunogenetic algorithms, it is very important to select suitable antibodies as well as to determine a realistic antibody expectation, and here we use the sum of records of the operation results to determine the cost of the query execution plan. For this problem, the following fitness function is used.
Where
When the immune system is invaded by an antigen, it will secrete the corresponding antibody through B cells to defend against the invading antigen and eventually destroy the antigen, and this stress response behavior of the immune system is based on the concentration of antibodies. Accordingly we correspond antigen, B cells and antibodies to the problem to be solved, a feasible solution to the problem to be solved
From the relationship between concentration and vector distance, the concentration of antibody
The selection probability based on antibody concentration can be derived from Eq:
Genetic operators mimic the process of biological inheritance and natural evolution to achieve the superiority of individuals in a population. There are mainly 3 kinds of genetic operators: selection, crossover and mutation. The following are the arithmetic rules of the 3 types of operators designed for multi-connected query trees.
1) Selection operator
It is shown that the probability that the genetic algorithm with the strategy of preserving the best individual converges to the optimal solution is 1. Meter with the optimal preservation selection strategy, the least costly connected tree in the current population does not participate in the crossover and mutation operations, and the costly individuals are eliminated [31]. This makes the current optimal individual pattern will not be destroyed by the crossover and mutation operations, which ensures the convergence of the immunogenetic algorithm.
2) Crossover operator
The crossover operator is to recombine the genes of both parents selected in the selection operation. A new generation of individuals can be obtained through the crossover operation, and the new individuals combine the characteristics of their parent individuals. Crossover embodies the idea of information exchange. Crossover operation generates new individuals by exchanging part of the genes of two parent individuals under certain probability.
In this paper, two-point crossover is used, and when two parent individuals crossover, the individual is generated by selecting a part of parent 1 and preserving the relative order of the table in parent 2.
For example, there are two parent individuals P1 and P2 encoded using the ordered string coding scheme, and two crossover points “|” are randomly selected.
First, the segment before the first intersection is kept constant to obtain:
Then, parent individual 1 starts encoding backward from the first bit after the first intersection, and when it reaches the second intersection, it continues to encode backward from the first bit of the second intersection until the end of the table, so that the coding arrangement of the linkage table encoded by parent individual 1 from the first bit after the first intersection can be obtained as R5-R6-R7-R8-R9-R1-R2-R3-R4. For the parent individual 2, there are already connection table codes R4, R8, R9, R1, remove them from the connection table code arrangement of the parent individual 1 to get the arrangement R5-R6-R7-R2-R3 and then copy this arrangement to the parent individual 1, the starting point of copying is also from the first bit after the first intersection point, so as to determine the unknown code X in the corresponding position of the child individual 1, which is then born as individual 1:
Similarly, subindividual 2 can be generated as:
Duplication and crossover cannot produce new genes although they can produce new strings of genes, and if all gene strings are identical at a given position, the characteristics characterized by that gene do not change.
The probability formula for crossover is:
where
3) Mutation operator: the introduction of mutation operator makes up for the shortcomings of selection and crossover operation that can only produce new gene strings but not new genes, and ensures the continued evolution of biological populations. Mutation operations are mainly in the following ways: uniform mutation, single-point random mutation, and multiple pairs of locus transposition. In this paper, uniform mutation is adopted. The mutation probability function is:
In this paper, we use single point gene transposition for gene mutation. The single point gene transposition operation is to take two positive integers
The basic idea of vaccination is to immunize and immunoselect the antibody population on the basis of reasonable extraction of vaccine, to overcome the blindness of crossover and mutation operations in the genetic algorithm, to improve the antibody adaptation value and to inhibit the degradation of the population. In order to suppress the phenomenon of population degradation in the genetic process and speed up the convergence of the algorithm, the immunogenetic algorithm in this study is based on the vaccine mechanism of IGA, with the core of the three steps of extracting vaccine, vaccinating vaccine, and immune selection operation.
The execution efficiency of the experimental system varies in different operating environments. The experimental environment used in this paper consists of the mainframe computer running the system (Server1) and two database servers (Server2 and Server3), and the three machines are interconnected to become a small local area network (LAN).
Immunogenetic algorithm, like other non-exhaustive search strategies, different parameters have a great impact on the performance of the algorithm, so the optimization strategy based on genetic algorithm needs to carry out the optimization of the algorithm's own parameters first. So we first explore the relationship between the average values of the optimal solutions in the last 10 generations obtained when the genetic algorithm evolves to the corresponding number of generations under different parameters.
First, let's investigate the performance of the genetic algorithm under different crossover rates. When we set the crossover rate Pc of the immunogenetic algorithm to be 0.4, 0.6, 0.8 and the number of connections to be 8, the performance of the immunogenetic algorithm under different crossover rates is shown in Fig. 6. In the results shown, we can see that when the crossover rate Pc = 0.4, the convergence of the genetic algorithm is better. When Pc =0.8, the average value of the optimal solution for the last 200 generations is worse than the other two cases and the genetic algorithm converges slower. When Pc =0.6, the average value of the optimal solution and the convergence speed are better than the other two cases. The crossover rate Pc=0.6 is the best choice when considering both convergence speed and the average value of the optimal solution.

The performance of the immune genetic algorithm in different cross-rates
Next, we examine the performance of the genetic algorithm with the initial population size, pop_size, varying in size. In general, if the initial population is small, it can improve the efficiency of the immunogenetic algorithm operation, but due to the reduction of the diversity of the population, it is possible that the algorithm will be premature. And when the initial population is larger, although it reduces the efficiency of the immunogenetic algorithm, but it is conducive to the algorithm to solve in a larger space. In this paper, when the number of connections is set to 8, the initial population size of the immunogenetic algorithm is 30, 50 and 70 respectively, and the execution cost of the algorithm is shown in the figure below. The performance of the immunogenetic algorithm with different initial population sizes is shown in Fig. 7. The performance of the algorithm with an initial population size size of 30 is better than the population size of 50 and 70 cases, and the performance of the algorithm with a population size of 50 is intermediate between the population size of 30 and 70. So, the initial population size size taking the value of 30 is a better choice.

The immune genetic algorithm performs in different initial population sizes
In this paper, we compare the performance of dynamic programming algorithms and immunogenetic algorithms by conducting comparative simulation experiments. The control parameters are set according to the initial population pop_size=30, crossover rate 0.6cP=, and mutation rate 0.08mP= obtained from the previous experiment. Input the parameters into the genetic algorithm program, each query is executed 10 times, and the query execution time is taken as the average of the 10 executions, when the number of relations we want to optimize is 4, the final genetic algorithm and the dynamic programming algorithm get an average execution time of 5 and 6 seconds, which is not a big difference. However, when the number of relations starts from 12, there is a significant difference in the query execution cost, and the comparison of the optimization performance of the immune genetic algorithm and the dynamic programming algorithm is shown in Figure 8. According to the analysis of the results in the figure, when the number of relations connected by the query is less than 10, the cost of the two algorithms is basically the same. However, when the number of relations connected to the query is greater than 10, the immunogenetic algorithm query optimization is significantly better than the dynamic programming algorithm, in the number of relations connected to the query optimization of the immunogenetic algorithm are 16 and 20, respectively, the cost of the execution of the query optimization is better than the cost of the dynamic programming algorithm execution of the cost of 22% and 37.5%, respectively.

The optimization performance comparison
Although the experiments in this thesis mainly focus on the query optimization problem for distributed type databases, due to the limitations of the experimental environment, the simulation experiments were ultimately conducted on only one computer.
In the course of this simulation experiment the focus is on checking the network cost characteristics of this algorithm and the stability of the algorithm on the distributed database query optimization problem. The tests are carried out in a number of 30 as well as 58 network nodes respectively. It is assumed that the network node issuing the query command is 3, i.e., the source node S, and all the query data required to complete the query command is distributed in 5 nodes, 4, 6, 8, 10, and 12, i.e., the target node E. How to connect from the source node S to the target nodes through a certain path with the lowest possible communication cost is the focus of this experiment. The performance is also to be compared. So in this simulation, in the same environment and the same network topology relationship, the basic ant colony algorithm is used and also simulation is done so that the optimization algorithm can be compared. The results of the simulation experiment are as follows:

Network topology diagram
The optimal crawling path (30 nodes) for the immunogenetic algorithm is shown in Figure 10. As can be seen from the figure, the solid dots indicate the target nodes, and the thicker lines connected together represent one of the best paths searched by the algorithm.

The optimal crawling path of immune genetic algorithm
In order to prove the advantage of the algorithm, simulation experiments with the basic ant colony algorithm were also done in the same environment used for comparison. The basic immunogenetic algorithm optimal solution crawling path is shown in Fig. 11.

The basic immune genetic algorithm is optimal
Separately, both algorithms have been utilized to find the target node in the same environment, so how do they perform in terms of communication cost respectively? In this simulation experiment, since it is operated on one computer experimentally, the I/O cost and CPU cost are the same for both algorithms, so we only compare the communication cost. The communication cost here refers to the communication time, including the time within the node and the time on the link. When the network topology is generated in advance, the communication time on all the paths is set using a randomized method, and at the same time, the communication time on each node is also effectively set in milliseconds. In Immunogenetic Algorithm, the division is carried out while 50 effective iterations are performed. The immunogenetic algorithm query convergence curve (30 nodes) is shown in Fig. 12. The optimal communication cost obtained during the optimization of the immunogenetic algorithm is: ans=359.0032. The basic immunogenetic algorithm query convergence curve (30 nodes) is shown in Fig. 13. The optimal communication cost obtained during optimization as per the basic algorithm is: ans=475.1463.From the figure, it is clear that the optimal solution found by the immunogenetic algorithm is less in terms of communication cost than the solution found by the basic immunogenetic algorithm. The basic ant colony algorithm starts out with a communication cost of about 560, which decreases relatively quickly in the early stages of the algorithm, but when four iterations have been made, the rate of decrease in communication cost slows down significantly, and by about generation 5, it is close to convergence and falls into a local optimum, and the communication cost no longer decreases, at about 472. The immune genetic algorithm in the beginning of the communication cost reduction rate is not as fast as the basic ant colony, in the 3rd generation there is a short stagnation of the algorithm, with the operation of the smoothing mechanism and learning mechanism, the algorithm jumps out of the stagnation state, continue to search, to about 10 generations before it begins to converge, at this time the communication cost has been reduced to 361, in this aspect of performance compared to the basic algorithms is relatively good.

Immune genetic algorithm query convergence curve

Basic algorithm query convergence curve
Using the same algorithm and the same parameters, the same simulation was conducted on the simulated network topology of 58 nodes, and the results are as follows. The network topology relation diagram (58 nodes) is shown in Fig. 14.

Network topology diagram
The final solution found using the multiple ant colony genetic algorithm is listed in the figure below. The optimal solution crawling path (58 nodes) for the immune genetic algorithm is shown in Figure 15.

The immune genetic algorithm is optimal
Similarly, the basic genetic algorithm has been used for the related simulations. The basic genetic algorithm optimal solution crawling path is shown in Fig. 16.

The basic algorithm is optimal
How does the convergence performance of the two algorithms compare for the case of a network topology with 58 nodes? The convergence curves of the immunogenetic algorithm and basic genetic algorithm queries (58 nodes) are shown in Figures 17 and 18. The optimal communication cost obtained by the immunogenetic algorithm optimization is: ans=455.5255, and the optimal communication cost obtained by the basic genetic algorithm optimization is: ans=463.2485. It can be seen from the figure that the two algorithms begin to converge at the beginning of the basic algorithm performs better, but this is the basic algorithm falls into the local optimal performance too early. The immune genetic algorithm obtains the optimal solution in the 23rd generation after a short period of stagnation in the 5th, 10th, and 15th generations. In terms of the communication cost of the final optimal path, the immunogenetic algorithm performs better, but the advantage does not seem to be particularly significant.

Immune genetic algorithm query convergence curve

Basic algorithm query convergence curve
The stability of the immunogenetic algorithm and the basic algorithm was again tested and compared in the same environment of 60 nodes, both algorithms were run 10 times, since both the immunogenetic algorithm and the basic algorithm randomly select routes with some probability when finding routes, the optimal paths found could be different even though they were run 10 times in the same environment. The stability of the two algorithms is compared using mean and standard deviation. Comparison of the standard deviation and mean value of the running results of the immune genetic algorithm and the basic ant colony algorithm are shown in Fig. 19 and Fig. 20. The algorithm's stability increases as the standard deviation decreases, indicating a smaller fluctuation of the objective function value obtained by the algorithm each time. The smaller the mean value, the better the algorithm's optimization performance. From the figure, it can be clearly seen that the immunogenetic algorithm is superior to the basic algorithm in terms of stability, and the optimization seeking performance is also slightly better. So although the generation of this network topology is random and the performance of the two algorithms will have some changes in each run, overall, the immunogenetic algorithm outperforms the basic genetic algorithm in terms of communication cost and is more stable.

The algorithm runs the comparison of the standard deviation

The algorithm runs the mean comparison
In this paper, the application of immunogenetic algorithm in the multi-connection query optimization problem is studied in depth, and finally it is concluded that the algorithm in this paper has certain advantages over other basic algorithms through simulation experiments and performance analysis.
In the environment of simulation experiments, a set of optimal parameter values applicable to computer network systems are obtained through continuous experiments, for example, when the crossover rate Pc=0.6 is the optimal choice from the consideration of two factors, namely, the convergence speed and the average value of optimal solutions. By applying the experimentally derived parameter values to optimize the distributed multi-connection query, the expected optimization effect is achieved.
Under the same conditions, the basic ant colony algorithm is tested and it is found that the optimal communication cost obtained during the optimization of the immunogenetic algorithm is 359.0032.The optimal communication cost obtained during the optimization of the basic genetic algorithm is 475.1463.It can be concluded that the algorithm designed in this paper has a more optimal solution and better stability.
