Construction of data security protection model in archive informatization system based on deep learning
Published Online: Mar 21, 2025
Received: Nov 06, 2024
Accepted: Feb 09, 2025
DOI: https://doi.org/10.2478/amns-2025-0597
Keywords
© 2025 Min Feng, published by Sciendo
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Data security refers to a security measure to protect data from illegal access, tampering, destruction or leakage. In modern society, data has become an indispensable resource for enterprises and individuals, therefore, the importance of data security has been increasingly emphasized. With the popularization of the Internet and the development of informationization, data security and personal information security have become a topic of great concern [1-4]. In this information age, the leakage of data and personal information has become a common phenomenon, which brings great trouble to people’s life and work. Therefore, data security protection has become a very important task [5-8].
Archival information system in the flow of the dissemination process, due to network insecurity and information system vulnerability, to the hacker’s invasion, virus contagion, illegal theft of information, etc. provides an opportunity to take advantage of, making the archives of information security rapprochement extreme vulnerability, but also to the national security and the interests of the people has brought great losses [9-12]. Archives management security and confidentiality work, the main means is information technology, computer technology and network technology, which is mainly to archive resources as the object, to manage the archives of the relevant theory as a guide, relying on the management, combined with the national archives management and the relevant requirements of the information society department, and to carry out the collection, sorting, development, preservation, and use of archives of the work of [13-16]. In order to effectively strengthen the security and confidentiality management of data in the archive informatization system, it is necessary to start from improving the professional quality level of management personnel and perfecting the security management system, and applying informatization technology to ensure the data security of the archive informatization system [17-19].
Literature [20] discusses the definition, classification, and application of cloud storage, and analyzes the challenges and requirements faced by cloud storage systems in terms of data security and privacy protection, and outlines data encryption techniques and protection methods. Literature [21] reveals many challenges faced by document management systems and emphasizes the importance of security protection of computer-aided management information systems. Multiple aspects of the security of computer-aided records management information systems are pointed out, including theft and destruction of computer information and computer viruses, emphasizing different measures for different situations. Literature [22] proposes a unique security system that combines multiple encryption algorithms and steganography. Effective methods were developed to encrypt and decrypt data by using fast and secure symmetric key algorithms designed to enhance data security. Literature [23] proposed a series of systematic protection measures based on security threats such as data leakage and cyber-attacks, aiming to provide a safe and reliable data environment for universities. It is mentioned that the computerization of archives is conducive to safeguarding the integrity and availability of information resources, thus improving the efficiency of university management. Literature [24] aims to improve the literature research, examine the importance of enterprise archive management under the background of informationization, and put forward optimization strategies for its problems, so as to provide theoretical and practical references for the digital transformation of enterprise archives. Literature [25] introduces the important role played by computers in the field of document management, describes the many security problems exposed by the document management system in the computer environment, and shows that the security maintenance of computerized document management system has become an important issue. Literature [26] discusses the advantages of the application of blockchain technology in the security protection of digital archive information. It also provides security protection strategies for the application of digital archives from the aspects of regulations, standards, and facility construction in order to improve the intelligent level of digital archive resource management. Literature [27] based on the importance of optimizing the information service of college archives, elucidated the dilemma of colleges and universities in optimizing the information service of archives in the information age, and put forward the optimization initiatives, with a view to providing references for the development of archive management in colleges and universities.
In this paper, we first provide an overview of the basics of deep learning and the nature of privacy protection. Then, starting from the security protection of data in the archive information system, it introduces a key conversion server and uses homomorphic re-encryption in the asynchronous stochastic gradient descent process of deep model training, in order to address the possible leakage of data privacy of the federated learning participants in the deep learning model training process. Finally, a deep federated learning model (PDEC) training technique is constructed that can protect the data security of other participants in case a federated learning participant colludes with the parameter server, which provides technical support for the security protection of the data in the archive informatization system.
Deep learning is a machine learning method that achieves automatic learning and prediction of complex patterns and relationships by simulating the human brain’s learning process through deep network models. The workflow of deep learning includes two main phases: training and prediction.
Training phase: it is the process in which the deep learning model learns patterns and relationships through a large amount of labeled data. In this phase, the model gradually improves its performance by iteratively adjusting the weights to minimize the difference between the predicted output and the real labels. This requires iterative training and optimization to enable the model to make accurate predictions on diverse input data.
Prediction phase: is the process in which the deep learning model [28] applies what it has already learned to make inferences and predictions about new data. In this phase, the model does not need to adjust the weights anymore, but uses the patterns and relationships obtained from training to directly generate output results. The powerful inference capabilities of deep learning models enable them to accurately process unseen data.
Neural network mainly includes convolutional layer, activation layer, pooling layer and fully connected layer. They are described in detail below.
Convolutional Layer. The convolutional layer of a neural network is a commonly used layer in deep learning and is particularly suited for processing image data. The convolutional layer performs convolutional operations at each location by sliding a convolutional kernel over the input data. This layer helps the neural network to learn the spatial hierarchy and features in the input input data. Assuming that the input input layer data is represented as
where
Activation Layer. The activation layer of a neural network is responsible for introducing nonlinear transformations to increase the expressive power of the network. The activation function is usually applied to the output of each neuron to map the result of a linear combination to a nonlinear range. In neural networks, the activation layer is often nested after a convolutional or fully-connected layer that performs an activation operation on its output. A common activation function is the ReLU function, which is mathematically represented as:
In addition, the Sigmoid function [29] is also commonly used as an activation function, which is mathematically represented as:
Pooling Layer. The pooling layer of a neural network is used to extract the main features of the input data, reduce the spatial dimensions of the data, and reduce the computational complexity while retaining the main features. Common pooling operations include maximum pooling and average pooling. Let the input data be X. The maximum pooling operation can be expressed as:
where
Accordingly, the average pooling can be expressed as:
Fully Connected Layer. The fully connected layer of a neural network is usually located in the last layer of the neural network and maps the learned features to the final output space, such as the category scores for a classification problem or the predicted values for a regression problem. Let the output of the previous layer be
where · denotes the matrix multiplication,
Transformer is an encoder-decoder architecture [30] in which the two parts have a similar structure. Therefore, the main focus in the following description is on the encoder part. The encoder consists of a bunch of identical blocks, each containing two sub-layers, i.e., a multi-head self-attention function and a feedforward network. In addition, residual connectivity and layer normalization are employed between these two sublayers. The specific architecture and operations within each block in the encoder are described below.
Self-attention function. An attention function can be described as mapping a query
where
where
Feedforward network. A fully connected feedforward network consists of two linear transformations that are bridged between these two linear changes using the GELU activation function, where GELU is a Gaussian error linear unitary function. The formal representation of this network is as follows:
In addition to the encoder-decoder structure described above, the Transformer model uses an embedding layer at the beginning to transform the input
Differential privacy distinguishes itself from previous privacy-preserving schemes in that its main contribution is to give a mathematical definition of individual privacy leakage that maximizes the usability of query results while ensuring that an individual user’s privacy leakage does not exceed a predefined domain value.
Paradigms are usually used to measure the length or size of an object in a certain vector space [31], and they are more widely used in differential privacy deep learning, e.g., paradigm tailoring on the model weight gradient parameter, which is given below in the form of the definition of
Definition 1 (
where
For Positive definiteness: ║ Positive chirality: ║ Subadditivity: ║
In differential privacy, the
Definition 2 (Database Distance). Let D be a database whose
where database D is considered as a collection from full set D, D ∈ ℝ|D|, ℝ are non-negative integers. ║D║1 describes the size of database D, i.e., the number of records contained in this database.
Definition 3 (Neighboring databases). For any two databases D1 and D2, D1,D2 ∈ ℝ|D|, if the equation is satisfied:
Then D1 and D2 are said to be neighboring databases. ║D1–D2║1 denotes the distance between D1 and D2, i.e. they differ only in one type of data.
Differential privacy [32] is able to maximize the accuracy of data queries while limiting the leakage of private personal data, as well as defending against the attacker’s maximum background knowledge. Its contents are defined below.
Definition 4 ((
Then the randomized algorithm M satisfies (
The properties of the differential privacy preserving model are described below.
Property 1 (Postprocessing). Let a randomized algorithm M satisfy (
The post-processing property shows that if a randomized mechanism satisfies (
Property 2 (group privacy). If randomization mechanism M satisfies (
A randomized mechanism M satisfies (
Property 3 (Serial Combinations). If the randomized algorithm M
Property 4 (Parallel Combinations). If randomized algorithm M
The advantage of differential privacy is that the combination of algorithms automatically protects data privacy without any other special operations by the database administrator, which allows the data to be somewhat privacy-protected by the combination of algorithms.
The data to be processed by differential privacy techniques in realizing data protection is divided into two main categories: numerical and non-numerical queries. The elements of the differential privacy mechanism are defined next.
Definition 4 (Sensitivity). For any two neighboring data sets D1,D2 ∈ D, deterministic real-valued function
Definition 5 (Laplace mechanism). For data whose retrieval results are numerical, any function
where
Definition 6 (Gaussian mechanism). For any function
where
The Laplace and Gaussian mechanisms achieve differential privacy protection by adding artificial random noise to the numeric data output of the query. However, for non-numerical data, the output of the query is the elements in the discrete data {
Definition 7 (Exponential Mechanism). For non-numeric data, the exponential mechanism ternary M
where
Measuring the dissimilarity between two distributions is one of the core topics in the study of machine learning, and since the method of estimating the distance between two distributions based on the Wasserstein Distance (WD) is more stable in terms of training the model compared to the previous methods, in this paper, we use the WD-based method to calculate the distance between two distributions.
Definition 8 (Wasserstein distance). Let the probability distribution functions of any two random variables be
where
Definition 9 (Lipschitz continuum). For any two metric spaces (ℝ1,
holds and
This section proposes a deep federated learning modeling approach (PDEC) for efficient privacy data collaboration based on the federated learning framework, which learns that each learning participant ( Generate its own key pair and disclose its public key to the other members of the scheme. Randomly select a certain number of samples from the local training dataset as the training data for this loop. Download the encryption weights updated in the previous round from the DSP. Decrypt the weights ciphertext with its own private key. Calculate the new gradient based on the data obtained in step 2) and the weights obtained in step 4). Encrypt the newly computed gradient with
The key transformation server KTS performs the following main operations:
Selects a security parameter Computes First generate the KTS’s own key pair and then negotiate the Diffie-Hellman key The KTS receives ciphertexts from the learning participants and performs FPRE on these ciphertexts. Send the FPRE-processed ciphertexts to the corresponding computational units
The data service provider DSP performs the following operations:
First generate the DSP’s own key pair and then negotiate the Diffie-Hellman key The DSP receives ciphertexts from the KTS and performs SPRE on these ciphertexts. Use the encryption gradient obtained in step 2) to update the weights. Store the updated encryption weights to the corresponding computation unit
Parameter generation: (
First KTS chooses a security parameter
Key generation: (
Each learning participant generates its key pair (
And KTS and DSP also generate their key pairs (
Then KTS and DSP negotiate to generate the Diffie-Hellman key
Since the scheme presented in this chapter uses model parallelism, the deep neural network is split into
Before encrypting weights and vectors using homomorphic re-encryption, they need to be processed for data encoding. A real number
Take the first round of weight updates (
First Phase Re-Encryption (FPRE) The KTS performs first phase re-encryption on ciphertexts
Eventually the KTS sends the re-encrypted ciphertexts
Second Stage Re-Encryption (SPRE) and Homomorphic Addition The DSP receives
It is worth noting that
The final
From the second round of weight update process, each learning participant downloads the updated weight vector from the corresponding computational unit of the DSP. In other words, learning participant
where
When the training has not yet reached the end condition, each learning participant repeats steps 4) through 8) above to iterate the weight update process. In other words, learning participant
The scheme proposed in this paper is mainly based on the problem of the computational difficulty of DDH on
It is assumed here that KTS and DSP do not collude, but KTS or DSP may collude with one of the federal learning [33] participants. Next, we first prove that the scheme proposed in this chapter is secure in the presence of a non-colluding semi-honest adversary (
This section analyzes and evaluates the performance of the KTS scheme in terms of functionality, accuracy, and resource overhead. In this experiment, a Lenovo server with Ubuntu 18.04 operating system is used to simulate a cloud server, and its hardware configuration is “RAM: 16GB, SSD: 256, CPU: 2.10GHZ”, meanwhile, Java1 is used in this experiment to build the system of this scheme. In addition, a 5-layer fully connected neural network, which consists of one input layer, one output layer, and three hidden layers, is used to train and test the model using data from the dataset 1 database. In addition, this scheme uses
In the experimental analysis in this section, this paper analyzes the user-side resource overhead around three important parameters. These three parameters include the total number of users participating in gradient aggregation |
Analyzed from the perspective of time complexity, the computational overhead of a single user is divided into four parts: (1) generating a signature, which consumes
The variation of single-user elapsed time with the growth of the number of users is shown in Fig. 1. As the number of users increases, the computational overhead of a single user tends to increase linearly. This is because in the process of signature generation and decryption, the parameter |

The single use of households takes time to change the number of users
The variation of single-user elapsed time with the growth of the number of single-user gradients is shown in Fig. 2. The computational overhead of a single user increases linearly with the number of single-user gradients. The reason for this is that during the encryption process, each user’s gradient

The change of the number of individual gradient and the time consuming
Fixing |

The single use of households takes time to change the rate of lag
Fig. 4, Fig. 5 and Fig. 6 show the variation of single-user data transmission with the growth of the number of users; the variation of single-user data transmission with the growth of the number of single-user gradients; and the variation of single-user data transmission with the growth of the drop rate, respectively. When the number of users

The number of single accounts is said to change the number of users

Changes in the growth of the average gradient and the amount of data transfer

The number of single accounts is said to change the rate of drop
Computational overhead
In terms of computational complexity, the computational overhead of the cloud server includes: implementing secure aggregation of gradients in cipher mode, consuming
The results of the computational overhead of the cloud server are shown in Table 1. The computational overhead of the server tends to increase linearly as the number of users increases. This is because the more users there are, the more gradients are aggregated in the cloud server. In addition, the number of users also participates in the modulo exponential operation in the decryption process as part of the exponent. In addition, the computational overhead of the server tends to increase linearly as the number of gradients per user increases. The reason for this is that the more gradients a single user has, the more gradients the cloud server needs to aggregate. The computational overhead of the server decreases linearly with the increase in the drop rate. This is because the greater the dropout rate, the fewer users are online, which causes the cloud server to aggregate fewer gradients in the aggregation process.
Calculation overhead of cloud server
| |
| |
| |
||||
---|---|---|---|---|---|---|
Number of users | Overall running time(×105ms) | Single-use household gradient | Overall running time(×105ms) | Drop rate | Overall running time(×104ms) | |
Server overhead calculation | 100 | 0.501 | 1000 | 0.498 | 0.00 | 5.033 |
150 | 0.712 | 1500 | 0.738 | 0.05 | 4.731 | |
200 | 0.900 | 2000 | 0.958 | 0.10 | 4.459 | |
250 | 1.134 | 2500 | 1.191 | 0.15 | 4.218 | |
300 | 1.346 | 3000 | 1.461 | 0.20 | 4.021 | |
350 | 1.565 | 3500 | 1.638 | 0.25 | 3.765 | |
400 | 1.822 | 4000 | 1.883 | 0.30 | 3.507 | |
450 | 2.011 | 4500 | 2.116 | 0.35 | 3.221 | |
500 | 2.222 | 5000 | 2.400 | 0.40 | 2.995 |
Communication overhead
From the perspective of communication complexity, the computational overhead of the cloud server consists of three parts: receiving the proof of correctness from
The results of the communication overhead of the cloud server are shown in Table 2. As the number of users increases, the server communication overhead shows a linear increase. This is because the higher the number of users, the higher the overall amount of messages sent and received by the cloud server. The communication overhead increases linearly as the number of gradients per user increases. This is because the more the number of gradients for a single user, the more the overall amount of messages sent and received by the cloud server. The communication overhead decreases linearly as the drop rate increases. This is due to the fact that the higher the dropout rate, the fewer the remaining users, and therefore, the smaller the total amount of messages sent and received by the cloud server.
Communication overhead of cloud servers
| |
| |
| |
||||
---|---|---|---|---|---|---|
Number of users | Total transmission data(×108Byte) | Single-use household gradient | Total transmission data(×108Byte) | Drop rate | Total transmission data(×107Byte) | |
Server communication overhead | 100 | 0.707 | 1000 | 0.738 | 0.00 | 7.812 |
150 | 1.089 | 1500 | 1.100 | 0.05 | 7.532 | |
200 | 1.422 | 2000 | 1.463 | 0.10 | 7.275 | |
250 | 1.805 | 2500 | 1.785 | 0.15 | 6.995 | |
300 | 2.167 | 3000 | 2.147 | 0.20 | 6.739 | |
350 | 2.510 | 3500 | 2.510 | 0.25 | 6.459 | |
400 | 2.872 | 4000 | 2.882 | 0.30 | 6.217 | |
450 | 3.255 | 4500 | 3.265 | 0.35 | 5.938 | |
500 | 3.597 | 5000 | 3.597 | 0.40 | 5.674 |
Since the data held by each device involved in data collaboration is usually non-independently and identically distributed (non-IID), local updates from some devices may become outliers that deviate from the global convergence trend. One of the innovations of this method is to propose a gradient space sparsification strategy, which enables the model to reduce unnecessary communication overhead without losing accuracy by removing irrelevant local updates in the upload phase. In order to verify the effectiveness of this strategy for non-IID data, the following experimental setup is performed: (a) firstly, another version of this method, called PDEC-C, is obtained by omitting the gradient space sparsification step in this method and keeping the rest of the steps unchanged; (b) in addition to the datasets that satisfy the non-IID distribution, the datasets 1 and 2 are obtained by dividing them homogeneously and allocating them to 100 devices to obtain the dataset satisfying the IID distribution; (c) comparing the communication overhead and accuracy of PDEC and PDEC-C on the datasets with both IID and non-IID distributions.
The results of a comparison of convergence speed across different datasets are shown in Fig. 7, where (a) and (b) represent dataset 1 and dataset 2, respectively. It can be found that both PDEC and PDEC-C perform better on the IID dataset than on the non-IID data. For example, a single device in PDEC needs to consume 892 MB to achieve 99.39% accuracy on the IID dataset, while it needs to consume 969 MB to achieve 99.12% accuracy on the non-IID data. Its similar experimental results on dataset 2. This is mainly due to the fact that the data held by each device on the non-IID data is highly heterogeneous in terms of type and number of samples, resulting in the need for more rounds of co-training among multiple devices to finally converge.

Comparison of PDEC and DEC-C on different data sets
In addition, observing the training curves on dataset 1 and dataset 2, it can be observed that the communication overhead and model accuracy of PDEC and PDEC-C are relatively similar on IID data, while they differ greatly on non-IID data. For example, for the IID dataset of dataset 2, PDEC consumes 1621 MB to reach 83.98% accuracy and PDEC-C consumes 1680 MB to reach 81.6% accuracy, which are similar in performance. While for the non-IID data of dataset 2, PDEC consumes 2114MB to reach 79.45% accuracy, while PDEC-C consumes 2265MB to reach only 75.6% accuracy. This indicates that PDEC significantly outperforms PDEC-C on the non-IID data. The main reason is that there are a large number of irrelevant local updates on the non-IID data, which seriously affects the performance of the model. The number of devices whose local updates are frequently deemed irrelevant during model training is counted for this reason. For dataset 1, there are a total of 6 devices on its IID data version that have more than 200 rounds of local updates discriminated as irrelevant during the training process, while this number rises to 17 devices on its non-IID data version. For dataset 2, there are a total of 9 devices on its IID data version that have more than 1000 rounds of local updates during training that are judged irrelevant, while this number rises to 25 devices on its non-IID data version. It can be seen that by removing irrelevant updates from these devices, PDEC achieves less communication overhead and higher model accuracy than PDEC-C.
This section compares the convergence speed of PDEC with the other three methods on two datasets, and the results of the convergence speed comparison on different datasets are shown in Fig. 8. It can be seen that (a) CPFED has the fastest convergence speed and converges after 400 and 1800 rounds of training on dataset 1 and dataset 2 datasets, respectively. But CPFED accelerates the convergence by increasing the amount of local computation at the device side, and it is not applicable to those edge devices with limited computational performance. Moreover, CPFED finally achieves 98.23% and 73.45% accuracy on dataset 1 and dataset 2, respectively, which is lower than 99.57% and 79.03% for PDEC and 98.27% and 74.97% for FEDOPT, because CPFED adds differential privacy noise to the model parameters, which reduces the model accuracy. (b) PDEC outperforms FEDOPT in terms of model accuracy and convergence speed, with PDEC achieving 99.75% accuracy after 500 rounds of training on the dataset 1 dataset and 79.22% on the dataset 2 dataset after 2,748 rounds of training. This is because the variance of the compression operator used by PDEC is smaller than the variance of the compression operator used by FEDOPT, and PDEC also filters out irrelevant gradient updates from affecting the model accuracy. (c) Comparing the two datasets, these methods converge more slowly on dataset 2 because the LSTM has a more complex network structure and a more uneven distribution of training data. However, even in this case, PDEC is still able to achieve higher accuracy at a convergence rate close to that of the benchmark method PPNPC.

The convergence velocity of different data sets is compared
For federated learning-based data collaboration, numerous scholars have studied the communication efficiency and privacy protection issues involved in recent years. In this section, three representative methods, PPNPC, CPFED, FEDOPT and PDEC, are compared in terms of six aspects, including communication efficiency, privacy protection, and resistance to conspiracy attacks, and the results of comparing the characteristics of several methods are shown in Table 3. PPNPC adopts secure multi-party computation to protect the data privacy, but it does not take any measures to solve the problem of high communication overhead; CPFED reduces the overall communication overhead by periodical averaging reduces the number of required training rounds, which reduces the overall communication overhead, but the secure aggregation protocol designed by CPFED fails once some of the devices drop out during the training period, and the differential privacy it uses reduces the accuracy of the data collaboration; the working principle of FEDOPT is most similar to that of PDEC, which improves the communication efficiency through the use of compression arithmetic, and homomorphic encryption protocol to achieve privacy protection. However, the later experimental part will prove that PDEC performs better than FEDOPT in terms of communication efficiency and convergence speed. In addition, PPSMC, CPFED, and FEDOPT cannot be applied to non-independent homomorphic datasets because they do not take into account the impact of non-independent homomorphic data on the model update of the device. PDEC, on the other hand, identifies and removes irrelevant local updates, which mitigates the adverse effects of non-independently and identically distributed data on model accuracy and convergence speed.
The characteristics of several methods compare the results
Characteristic | PPNPC | CPFED | FEDOPT | PDEC |
---|---|---|---|---|
Low communication cost | + | + | + | |
Privacy protection | + | + | + | + |
Complicity | + | + | + | |
Apply the IID data | + | |||
Device drop | + | + | + | |
Fast convergence | + | + | + |
In this paper, in order to solve the problem of data security protection in the archive information system, the deep federal learning model based on archive information security is constructed by introducing and deep learning model and differential privacy technology. The results show that the computational overhead of a single user increases linearly with the number of users and the number of single-user gradients, but the increase of the drop rate