Construction of data security protection model in archive informatization system based on deep learning

Data security refers to a security measure to protect data from illegal access, tampering, destruction or leakage. In modern society, data has become an indispensable resource for enterprises and individuals, therefore, the importance of data security has been increasingly emphasized. With the popularization of the Internet and the development of informationization, data security and personal information security have become a topic of great concern [1-4]. In this information age, the leakage of data and personal information has become a common phenomenon, which brings great trouble to people’s life and work. Therefore, data security protection has become a very important task [5-8].

Archival information system in the flow of the dissemination process, due to network insecurity and information system vulnerability, to the hacker’s invasion, virus contagion, illegal theft of information, etc. provides an opportunity to take advantage of, making the archives of information security rapprochement extreme vulnerability, but also to the national security and the interests of the people has brought great losses [9-12]. Archives management security and confidentiality work, the main means is information technology, computer technology and network technology, which is mainly to archive resources as the object, to manage the archives of the relevant theory as a guide, relying on the management, combined with the national archives management and the relevant requirements of the information society department, and to carry out the collection, sorting, development, preservation, and use of archives of the work of [13-16]. In order to effectively strengthen the security and confidentiality management of data in the archive informatization system, it is necessary to start from improving the professional quality level of management personnel and perfecting the security management system, and applying informatization technology to ensure the data security of the archive informatization system [17-19].

Literature [20] discusses the definition, classification, and application of cloud storage, and analyzes the challenges and requirements faced by cloud storage systems in terms of data security and privacy protection, and outlines data encryption techniques and protection methods. Literature [21] reveals many challenges faced by document management systems and emphasizes the importance of security protection of computer-aided management information systems. Multiple aspects of the security of computer-aided records management information systems are pointed out, including theft and destruction of computer information and computer viruses, emphasizing different measures for different situations. Literature [22] proposes a unique security system that combines multiple encryption algorithms and steganography. Effective methods were developed to encrypt and decrypt data by using fast and secure symmetric key algorithms designed to enhance data security. Literature [23] proposed a series of systematic protection measures based on security threats such as data leakage and cyber-attacks, aiming to provide a safe and reliable data environment for universities. It is mentioned that the computerization of archives is conducive to safeguarding the integrity and availability of information resources, thus improving the efficiency of university management. Literature [24] aims to improve the literature research, examine the importance of enterprise archive management under the background of informationization, and put forward optimization strategies for its problems, so as to provide theoretical and practical references for the digital transformation of enterprise archives. Literature [25] introduces the important role played by computers in the field of document management, describes the many security problems exposed by the document management system in the computer environment, and shows that the security maintenance of computerized document management system has become an important issue. Literature [26] discusses the advantages of the application of blockchain technology in the security protection of digital archive information. It also provides security protection strategies for the application of digital archives from the aspects of regulations, standards, and facility construction in order to improve the intelligent level of digital archive resource management. Literature [27] based on the importance of optimizing the information service of college archives, elucidated the dilemma of colleges and universities in optimizing the information service of archives in the information age, and put forward the optimization initiatives, with a view to providing references for the development of archive management in colleges and universities.

In this paper, we first provide an overview of the basics of deep learning and the nature of privacy protection. Then, starting from the security protection of data in the archive information system, it introduces a key conversion server and uses homomorphic re-encryption in the asynchronous stochastic gradient descent process of deep model training, in order to address the possible leakage of data privacy of the federated learning participants in the deep learning model training process. Finally, a deep federated learning model (PDEC) training technique is constructed that can protect the data security of other participants in case a federated learning participant colludes with the parameter server, which provides technical support for the security protection of the data in the archive informatization system.

2

Deep federal learning model construction based on archive information security protection

2.1

Deep Learning Basics

2.1.1

Deep Learning Workflow

Deep learning is a machine learning method that achieves automatic learning and prediction of complex patterns and relationships by simulating the human brain’s learning process through deep network models. The workflow of deep learning includes two main phases: training and prediction.

1)

Training phase: it is the process in which the deep learning model learns patterns and relationships through a large amount of labeled data. In this phase, the model gradually improves its performance by iteratively adjusting the weights to minimize the difference between the predicted output and the real labels. This requires iterative training and optimization to enable the model to make accurate predictions on diverse input data.

2)

Prediction phase: is the process in which the deep learning model [28] applies what it has already learned to make inferences and predictions about new data. In this phase, the model does not need to adjust the weights anymore, but uses the patterns and relationships obtained from training to directly generate output results. The powerful inference capabilities of deep learning models enable them to accurately process unseen data.

2.1.2

Neural network model architecture

Neural network mainly includes convolutional layer, activation layer, pooling layer and fully connected layer. They are described in detail below.

Convolutional Layer. The convolutional layer of a neural network is a commonly used layer in deep learning and is particularly suited for processing image data. The convolutional layer performs convolutional operations at each location by sliding a convolutional kernel over the input data. This layer helps the neural network to learn the spatial hierarchy and features in the input input data. Assuming that the input input layer data is represented as X, the output of the convolutional layer Y can be represented by the following mathematical formula: (1) $Y [i, j, k] = \sum_{a = 1}^{F_{1}} \sum_{b = 1}^{F_{2}} \sum_{c = 1}^{C} W [a, b, c, k] \cdot X [i + a - 1, j + b - 1, c] + b [k]$

where Y[i,j,k] is the (i,j,k) nd element of the output of the convolutional layer, W is the convolutional kernel parameter, b is the bias term, F₁ and F₂ are the height and width of the convolutional kernel, respectively, and C is the number of channels of the input input data.

Activation Layer. The activation layer of a neural network is responsible for introducing nonlinear transformations to increase the expressive power of the network. The activation function is usually applied to the output of each neuron to map the result of a linear combination to a nonlinear range. In neural networks, the activation layer is often nested after a convolutional or fully-connected layer that performs an activation operation on its output. A common activation function is the ReLU function, which is mathematically represented as: (2) $Re L U (x) = \max (0, x)$

In addition, the Sigmoid function [29] is also commonly used as an activation function, which is mathematically represented as: (3) $S i g m o i d (x) = \frac{1}{1 + e^{- x}}$

Pooling Layer. The pooling layer of a neural network is used to extract the main features of the input data, reduce the spatial dimensions of the data, and reduce the computational complexity while retaining the main features. Common pooling operations include maximum pooling and average pooling. Let the input data be X. The maximum pooling operation can be expressed as: (4) $M a x P o o l i n g {(X)}_{i, j, k} = \max_{a, b} X_{i \times s + a, j \times s + b, k}$

where s is the step size of the pooling operation, i and j are the coordinates of the output data, and k is the number of channels. Maximum pooling selects the maximum value of the input data within the pooling window as the output.

Accordingly, the average pooling can be expressed as: (5) $A v e r a g e P o o l i n g {(X)}_{i, j, k} = \frac{1}{s^{2}} \sum_{a, b} X_{i \times s + a, j \times s + b, k}$

Fully Connected Layer. The fully connected layer of a neural network is usually located in the last layer of the neural network and maps the learned features to the final output space, such as the category scores for a classification problem or the predicted values for a regression problem. Let the output of the previous layer be X, the weight of the fully connected layer be W, and the bias term be b. The output of the fully connected layer Y can be expressed as: (6) $Y = W \cdot X + b$

where · denotes the matrix multiplication, W is the weight matrix, and b is the bias vector. The fully connected layer combines the features of the previous layer into a higher level representation applicable to the target task by learning the weights and biases.

2.1.3

Transformer Model Architecture

Transformer is an encoder-decoder architecture [30] in which the two parts have a similar structure. Therefore, the main focus in the following description is on the encoder part. The encoder consists of a bunch of identical blocks, each containing two sub-layers, i.e., a multi-head self-attention function and a feedforward network. In addition, residual connectivity and layer normalization are employed between these two sublayers. The specific architecture and operations within each block in the encoder are described below.

Self-attention function. An attention function can be described as mapping a query X_Q and a set of key-value pairs (X_K,X_V) to a weighted sum of the values of the set, where the weights are computed from the query’s metric with respect to the corresponding keys. The formal representation of the function is as follows: (7) $A t t e n t i o n (X_{Q}, X_{K}, X_{V}) = S o f t \max (X_{Q} \cdot X_{K}^{T} / \sqrt{d}) \cdot X_{V}$

where X_Q,X_K,X_V is a different linear projection of input X, i.e., X_Q = X·W_Q, X_K = X·W_K, X_V = X·W_V, d are the dimensions of the query vector. Instead of performing a single attention function in the existing Transformer-based model, a variant of multi-head attention, called the multi-head attention mechanism, is used. This mechanism extends the above attention function to H parallel attention layers. Specifically, it can be represented as: (8) $\begin{array}{l} M u l t i H e a d A t t e n \\ = C o n c a t (A t t e n t i o n (X_{Q_{j}, j}, X_{K, j}, X_{V_{j} j}), j \in [H]) \cdot W_{O} \end{array}$

where H is the number of heads and, for j ∈ [H], there are X_Q,j = X_Q·W_Q,j, X_K,j = X_K·W_K,j, and X_V,j = X_V·W_V,j. The main intuition for the mechanism is that multi-head attention allows the model to simultaneously attend to information from different representation subspaces as well as information from different locations.

Feedforward network. A fully connected feedforward network consists of two linear transformations that are bridged between these two linear changes using the GELU activation function, where GELU is a Gaussian error linear unitary function. The formal representation of this network is as follows: (9) $F e e d F o r w a r d (X) = G E L U (X \cdot W_{1} + b_{1}) \cdot W_{2} + b_{2}$

In addition to the encoder-decoder structure described above, the Transformer model uses an embedding layer at the beginning to transform the input X_input into a continuous feature vector representation. This layer can be formalized as X = X_input·W_E, where W_E is the embedding lookup table.

2.2

Differential Privacy Basis

Differential privacy distinguishes itself from previous privacy-preserving schemes in that its main contribution is to give a mathematical definition of individual privacy leakage that maximizes the usability of query results while ensuring that an individual user’s privacy leakage does not exceed a predefined domain value.

2.2.1

Definition of Paradigm Distance

Paradigms are usually used to measure the length or size of an object in a certain vector space [31], and they are more widely used in differential privacy deep learning, e.g., paradigm tailoring on the model weight gradient parameter, which is given below in the form of the definition of ℓ_p -paradigm.

Definition 1 (ℓ_p -paradigm). Let x be a n -dimensional vector over the real number field, x the ℓ_p -paradigm ║·║_p is defined as follows: (10) ${‖ x ‖}_{p} = {(\sum_{i = 1}^{n} {| x_{i} |}^{p})}^{1 / p}$

where p is any positive integer.

For ℓ_p -paradigm numbers, when p is taken to be 1,2,∞, the ℓ_p -paradigm numbers degenerate to the sum of the absolute values of the components, the Euclidean distance, and the maximum of the absolute values of the components of x. Meanwhile, the ℓ_p -paradigm numbers have the following properties: 1)

Positive definiteness: ║x║ ≥ 0, and ║x║ = 0 ⇔ x = 0;

2)

Positive chirality: ║cx║ = |c|║x║;

3)

Subadditivity: ║x+y║≤║x║+║y║.

In differential privacy, the ℓ₁ -parameter distance is commonly used to measure the difference between two databases D and D′. Before giving the basic theory of differential privacy, the definitions of database distance and neighboring databases are introduced separately.

Definition 2 (Database Distance). Let D be a database whose ℓ₁ paradigm ║D║₁ is defined as: (11) ${‖ D ‖}_{1} = \sum_{i = 1}^{D} | d_{i} |$

where database D is considered as a collection from full set D, D ∈ ℝ^|D|, ℝ are non-negative integers. ║D║₁ describes the size of database D, i.e., the number of records contained in this database.

Definition 3 (Neighboring databases). For any two databases D₁ and D₂, D₁,D₂ ∈ ℝ^|D|, if the equation is satisfied: (12) ${‖ D_{1} - D_{2} ‖}_{1} = 1$

Then D₁ and D₂ are said to be neighboring databases. ║D₁–D₂║₁ denotes the distance between D₁ and D₂, i.e. they differ only in one type of data.

2.2.2

Differential Privacy Definition

Differential privacy [32] is able to maximize the accuracy of data queries while limiting the leakage of private personal data, as well as defending against the attacker’s maximum background knowledge. Its contents are defined below.

Definition 4 ((ε,δ)– Differential Privacy). For a randomized algorithm M with definitional domain ℝ^|D|, any S ⊆ Range(M) with D₁,D₂ satisfying ║D₁–D₂║₁ ≤ 1 has: (13) $\Pr [M (D_{1}) \in S] \leq \exp (ε) \cdot \Pr [M (D_{2}) \in S] + δ$

Then the randomized algorithm M satisfies (ε,δ)– differential privacy. Where ε is a non-negative real number, called the differential privacy budget, δ ensures that for all neighboring databases D₁,D₂ the absolute value of the privacy loss will converge to ε with (1–δ) probability. If δ = 0 then the randomized algorithm M satisfies (ε,0)– differential privacy.

The properties of the differential privacy preserving model are described below.

Property 1 (Postprocessing). Let a randomized algorithm M satisfy (ε,δ)–differential privacy and M: ℝ^|D|→R, if we let g: R → R′ denote an arbitrary random mapping, then g∘M: ℝ^|D| → R′ also satisfies (ε,δ)–differential privacy.

The post-processing property shows that if a randomized mechanism satisfies (ε,δ)–differential privacy, the result obtained still satisfies (ε,δ)–differential privacy regardless of any subsequent operations performed on it by the attacker.

Property 2 (group privacy). If randomization mechanism M satisfies (ε,δ)–differential privacy when the group size is k, i.e., ║D₁–D₂║₁≤k, randomization mechanism M satisfies (kε,k exp^(t−1)εδ)–differential privacy.

(14)

\Pr [M (D_{1}) \in S] \leq \exp (k ε) \cdot \Pr [M (D_{2}) \in S] + k \exp^{(t - 1) ε} δ

A randomized mechanism M satisfies (kε,k exp^(t−1)ε δ)–differential privacy for cluster size k, and if δ = 0, then a randomized mechanism M satisfies (kε, 0)–differential privacy.

Property 3 (Serial Combinations). If the randomized algorithm M_i: ℝ^|D|→R_i satisfies (ε_i,δ_i)–differential privacy, for serial combinations M(D) = (M₁(D),M₂(D),…,M_k(D)) acting on the same data set D, M(D) satisfies $(\sum_{i = 1}^{k} ε_{i}, \sum_{i = 1}^{k} δ_{i})$ .

Property 4 (Parallel Combinations). If randomized algorithm M_i: ℝ^|D|→R_i satisfies (ε_i,0)–differential privacy for any partition of database D such that $D = \cup_{i = 1}^{k} D_{i}$ , and D_i∩D_j = ∅, ∀i,j ≤ k,i ≠ j. Then parallel combination M(D) = (M₁(D),…,M_k(D_k)) satisfies (max{ε_i},0).

The advantage of differential privacy is that the combination of algorithms automatically protects data privacy without any other special operations by the database administrator, which allows the data to be somewhat privacy-protected by the combination of algorithms.

2.2.3

Differential Privacy Mechanisms

The data to be processed by differential privacy techniques in realizing data protection is divided into two main categories: numerical and non-numerical queries. The elements of the differential privacy mechanism are defined next.

Definition 4 (Sensitivity). For any two neighboring data sets D₁,D₂ ∈ D, deterministic real-valued function f: D→R, the sensitivity s_f of function f is defined as: (15) $s_{f} = \max_{D_{1}, D_{2} \in D} | f (D_{1}) - f (D_{2}) |$

Definition 5 (Laplace mechanism). For data whose retrieval results are numerical, any function f: D →ℝⁿ, the Laplace mechanism M is defined as the following ternary M_L(D,f(·),ε): (16) $M_{L} (D, f (\cdot), ε) = f (\cdot) + (Z_{1}, \dots, Z_{n})$

where Z_i ~ Lap(s_f/ε) is a random variable sampled independently and identically distributed. Let the scale parameter b = s_f/ε of the Laplace distribution, with ε held constant, s_f be proportional to the random noise scale parameter b, and there is a relationship between the three knowing two to one.

Definition 6 (Gaussian mechanism). For any function f: →ℝⁿ, the Gaussian mechanism M is defined as a triad M_L(D,f(·),ε): (17) $M_{G} (D, f (\cdot), ε) = f (\cdot) + (Z_{1}, \dots, Z_{n})$

where $Z_{i} ~ N (0, s_{f}^{2} σ^{2})$ is a random variable sampled independently and identically distributed.

The Laplace and Gaussian mechanisms achieve differential privacy protection by adding artificial random noise to the numeric data output of the query. However, for non-numerical data, the output of the query is the elements in the discrete data {R₁,R₂,…,R_n}. Therefore, an exponential mechanism is introduced to achieve privacy protection for non-numeric data. Instead of deterministically outputting the query result R_i, the exponential mechanism outputs it with a certain probability value, and the probability value of the output is determined by the scoring function.

Definition 7 (Exponential Mechanism). For non-numeric data, the exponential mechanism ternary M_E(D,q,R) outputs R_i with the following probability positive. Then there are: (18) $M_{E} (D, q, R) ~ e^{\frac{ε_{q} (D, R_{i})}{A_{0}}}$

where q, denotes the sensitivity, and the scoring function is q(D,R_i), D denotes the specified dataset, and q(D,R_i) denotes the score of a particular output R_i, $q v = \max_{D_{1}, D_{2}} {‖ q (D_{1}, R_{i}) - q (D_{2}, R_{i}) ‖}_{1}$ .

2.2.4

Optimal transmission distance

Measuring the dissimilarity between two distributions is one of the core topics in the study of machine learning, and since the method of estimating the distance between two distributions based on the Wasserstein Distance (WD) is more stable in terms of training the model compared to the previous methods, in this paper, we use the WD-based method to calculate the distance between two distributions.

Definition 8 (Wasserstein distance). Let the probability distribution functions of any two random variables be P(x) and Q(y), respectively, and the set of their joint probability distribution functions be Π(P,Q). For any element γ(x,y) of the set the distance d(x,y) of the sample to the (x,y) ~ γ distribution is: (19) $W (P, Q) = \inf_{γ ~ Π (P, Q)} \iint γ (x, y) d (x, y) = \inf_{γ ~ Π (P, Q)} E_{(x, y) ~ γ} [d (x, y)]$

where d(x,y) denotes the cost function, and for any joint distribution γ, by (x,y) ~ γ sampling sample pairs (x,y), one is able to compute the cost-cost matrix between sample pairs d(x,y), and further compute the expected value of the cost of the sample pairs under the conditions of the joint distribution γ, E_(x,y)~γ[d(x,y)]. And by means of optimization, one is able to obtain effective lower bounds on the expected value in all possible joint distributions.

Definition 9 (Lipschitz continuum). For any two metric spaces (ℝ₁,d₂) and (ℝ₂,d₂), if the function f: ℝ₁→ℝ₂, satisfies: (20) $d_{2} (f (x), f (y)) \leq K_{d_{1}, d_{2}} d_{1} (x, y), \forall x \in ℝ_{1}, \forall y \in ℝ_{2}$

holds and K is bounded, then the function f is said to be Lipschitz continuous. where K is called the Lipschitz condition.

2.3

Deep Federated Learning Model for Archival Information Security

2.3.1

System model

This section proposes a deep federated learning modeling approach (PDEC) for efficient privacy data collaboration based on the federated learning framework, which learns that each learning participant (N is assumed here) is responsible for performing the following steps: 1)

Generate its own key pair and disclose its public key to the other members of the scheme.

2)

Randomly select a certain number of samples from the local training dataset as the training data for this loop.

3)

Download the encryption weights updated in the previous round from the DSP.

4)

Decrypt the weights ciphertext with its own private key.

5)

Calculate the new gradient based on the data obtained in step 2) and the weights obtained in step 4).

6)

Encrypt the newly computed gradient with PK and send the ciphertext to KTS.

The key transformation server KTS performs the following main operations: 1)

Selects a security parameter k and randomly chooses two large prime numbers p and q such that Len(p) = Len(q) = k (Len(x) is the bit length of x).

2)

Computes n = p·q and chooses a highest order generator g.

3)

First generate the KTS’s own key pair and then negotiate the Diffie-Hellman key PK with the DSP.

4)

The KTS receives ciphertexts from the learning participants and performs FPRE on these ciphertexts.

5)

Send the FPRE-processed ciphertexts to the corresponding computational units CU_i, (i = 1,2,…,N) of the DSP.

The data service provider DSP performs the following operations: 1)

First generate the DSP’s own key pair and then negotiate the Diffie-Hellman key PK with the KTS.

2)

The DSP receives ciphertexts from the KTS and performs SPRE on these ciphertexts.

3)

Use the encryption gradient obtained in step 2) to update the weights.

4)

Store the updated encryption weights to the corresponding computation unit CU_i,(i = 1,2,…,N) of the DSP.

2.3.2

Deep federated learning model training program

1)

Program description

Parameter generation: (k,p,q)→(g,n)

First KTS chooses a security parameter k and randomly selects two large prime numbers p and q such that Len(p) = Len(q) = k (Len(x) is the bit length of x). Then KTS computes n = p·q and chooses a highest order generator g. Finally KTS publishes the public system parameters 10 to all members of the system.

Key generation: (g,n)→((sk, pk),PK)

Each learning participant generates its key pair (sk_i, pk_i)(i = 1,2,…,N) with the following formula: (21) $p k_{i} = g^{s k_{i}} \mod n^{2}, i = 1, 2, \dots, N$

And KTS and DSP also generate their key pairs (sk_KTS,pk_KTS) and (sk_DSP,pk_DSP) respectively with the following equations: (22) $p k_{K T S} = g^{s k_{K T S}} \mod n^{2} = g^{a} \mod n^{2}$ (23) $p k_{D S P} = g^{s k_{D S P}} \mod n^{2} = g^{b} \mod n^{2}$

Then KTS and DSP negotiate to generate the Diffie-Hellman key PK. i.e., KTS sends its public key pk_KTS = g^a mod nn² to DSP and DSP sends its public key pk_DSP = g^b mod nn² to KTS. So KTS and DSP can compute PK separately, and finally publish all the generated public keys to all members of the system as follows: (24) $P K = p k_{D S P}^{s k_{K T S}} = {(g^{b} \mod n)}^{2} = g^{a \cdot b} \mod n^{2}$ (25) ${P K = p k_{K T S}^{s k S P} = {(g^{a} \mod n)}^{2})}^{a \cdot b} g^{a \cdot b} \mod n^{2}$

Initialization

Since the scheme presented in this chapter uses model parallelism, the deep neural network is split into N part. One machine is responsible for training one copy of the deep neural network. Here notation $W_{i}^{(j)}$ is the weight vector updated by machine i in the j_th th round of the weight update loop, and $G_{i}^{(j)}$ is the gradient vector generated by federal learning participant i in the j_th th round of the weight update loop, i.e., for updating $W_{i}^{(j)}$ . The initial weight vector is noted as $W_{1}^{(0)}, W_{2}^{(0)}, \dots, W_{N}^{(0)}$ . The cross weights connecting machine i and machine k(k > i) are assigned to machine k.

Data encoding

Before encrypting weights and vectors using homomorphic re-encryption, they need to be processed for data encoding. A real number x ∈ ℝ can be expressed as an integer ⌊x·2ⁿ⌋ with a precision of n bits. Here rounding down on x·2ⁿ is used as the result of encoding real number x.

Gradient generation and encryption

Take the first round of weight updates (j = 1) as an example. First, each federal learning participant (here, learning participant i is used as an example) randomly selects a small batch of data from its local dataset and computes a gradient vector $G_{i}^{(1)}$ based on this and the initial weight vector $W_{i}^{(0)}$ . Then, it computes $- α \cdot G_{i}^{(1)}$ and encodes the result of the computation and the individual elements of the weight vector $W_{i}^{(0)}$ into an integer respectively. Finally, each federated learning participant encrypts the elements of the vector $- α \cdot G_{i}^{(1)}$ obtained by its own encoding separately with key PK and sends the encryption result to the KTS. The KTS encrypts each element of the initial weight vector $W_{i}^{(0)}$ with PK.

First Phase Re-Encryption (FPRE) The KTS performs first phase re-encryption on ciphertexts $E_{P K} (- α \cdot G_{1}^{(1)}), E_{P K} (- α \cdot G_{2}^{(1)}), \dots ., E_{P K} (- α \cdot G_{N}^{(1)})$ after receiving them from Federal Learning Participant 1,2,…,N, respectively. The first phase re-encryption function is described using the ciphertext received from learning participant i as an example as shown below: (26) $h_{1} = H ({(p k_{i})}^{s k_{K S T}})$ (27) ${[E_{P K} (W_{i}^{(0)})]}^{+} = {{\hat{W}}_{i}, {\hat{W}}_{i}'} = {W_{i}, {(W_{i}')}^{s k_{K S T}} \cdot g^{h_{1}}}$ (28) ${[E_{P K} (- α \cdot G_{i}^{(1)})]}^{+} = {{\hat{T}}_{i}, {\hat{T}}_{i}'} = {T_{i}, {(T_{i}')}^{s k_{K S T}} \cdot g^{h_{1}}}$

Eventually the KTS sends the re-encrypted ciphertexts ${[E_{P K} (W_{i}^{(0)})]}^{+}$ and ${[E_{P K} (- α \cdot G_{i}^{(1)})]}^{+}$ to the DSP.

Second Stage Re-Encryption (SPRE) and Homomorphic Addition The DSP receives ${[E_{P K} (W_{i}^{(0)})]}^{+}$ and ${[E_{P K} (- α \cdot G_{i}^{(1)})]}^{+} (1 \leq i \leq N)$ from the KTS and stores them in the corresponding computational unit CU_i. Each computational unit CU_i performs SPRE separately. The i th computational unit CU_i, for example, performs as shown in the following equation: (29) $h_{2} = H ({(p k_{i})}^{s k_{D S} P})$ (30) $E_{p k_{i}} (W^{(0)}) = {\bar{W}, \bar{W}'} = {\hat{W}, {(\hat{W}')}^{s k_{D S P}} \cdot g^{h_{2}}}$ (31) $E_{p k_{i}} (- α \cdot G_{i}^{(1)}) = {\bar{T}, \bar{T}'} = {\hat{T}, {(\hat{T}')}^{s k_{D S} P} \cdot g^{h_{2}}}$

It is worth noting that E_{pk_i}(W⁽⁰⁾) and $E_{p k_{i}} (- α \cdot G_{i}^{(1)})$ now have the property of additive homomorphism. Therefore the computational unit CU_i of the DSP can update the weight variables according to the following equation: (32) $E_{p k_{i}} (W_{i}^{(1)}) = E_{p k_{i}} (W_{i}^{(0)}) * E_{p k_{i}} (- α \cdot G_{i}^{(1)}), 1 \leq i \leq N$

The final $E_{p k_{i}} (W_{i}^{(1)})$ is stored in the corresponding computational unit CU_i of the DSP.

Decryption

From the second round of weight update process, each learning participant downloads the updated weight vector from the corresponding computational unit of the DSP. In other words, learning participant i downloads $E_{p k_{i}} (W_{i}^{(j)}) (1 \leq j \leq N_{w u}, 1 \leq i \leq N)$ from CU_i of the DSP, and then learning participant i decrypts the downloaded $E_{p k_{i}} (W_{i}^{(j)})$ with its private key sk_i. The decryption function is shown in the following equation: (33) $h_{1}' = H ({(p k_{K T S})}^{s k_{i}})$ (34) $h_{2}' = H ({(p k_{D S P})}^{s k_{i}})$ (35) $W_{i}^{(j)} = L (\bar{T} \cdot p k_{D S P}^{h_{1}'} \cdot g^{h_{2}'} / \bar{T}' \mod n^{2})$

where L(x) = (x – 1)/n. At the end, the DSP update $W_{i}^{(j)}$ is decrypted and can be used for gradient generation in the next round of the loop; or when training is over, for deep learning model configuration.

The loop

When the training has not yet reached the end condition, each learning participant repeats steps 4) through 8) above to iterate the weight update process. In other words, learning participant i computes the gradient vector $G_{i}^{(j + 1)}$ using another small subset of its local dataset and $W_{i}^{(j)}$ . At the end of the training process, all learning participants can obtain the final weight vector $W = (W_{1}^{(N_{w u})}, W_{2}^{(N_{w u})}, \dots, W_{N}^{(N_{w u})})$ by downloading and decrypting $E_{p k_{i}} (W_{i}^{(w u)})$ from the CU_i(1 ≤ i ≤ N) of the DSP.

2)

Security Discussion

The scheme proposed in this paper is mainly based on the problem of the computational difficulty of DDH on $Z_{n^{2}}^{*}$ : for every probabilistic polynomial time function F, there exists a negligible function negl() such that for sufficiently large l, then the following equation holds. That is, when given g^x and g^y, the probability that the adversary can distinguish g^z and g^xy is negligible. Then there is: (36) $\begin{array}{l} \Pr [F (n, X, Y, Z_{b}) = b | \begin{array}{l} p, q \leftarrow S P (0.5 \cdot l) \\ n = p q; g \leftarrow G \\ x, y, z \leftarrow [1, o r d (G)] \\ X = g^{x} \mod n^{2} \\ Y = g^{y} \mod n^{2} \\ Z_{0} = g^{z} \mod n^{2} \\ Z_{1} = g^{x y} \mod n^{2} \\ b \leftarrow 0, 1 \end{array}] \\ = n e g l (l) + 0.5 \end{array}$

It is assumed here that KTS and DSP do not collude, but KTS or DSP may collude with one of the federal learning [33] participants. Next, we first prove that the scheme proposed in this chapter is secure in the presence of a non-colluding semi-honest adversary (A_LP,A_KTS,A_DSP), and then analyze the security of the scheme in this chapter when a colluding adversary exists.

3

Results of archival information security protection based on deep federated learning

3.1

Data security protection in archive informatization system

This section analyzes and evaluates the performance of the KTS scheme in terms of functionality, accuracy, and resource overhead. In this experiment, a Lenovo server with Ubuntu 18.04 operating system is used to simulate a cloud server, and its hardware configuration is “RAM: 16GB, SSD: 256, CPU: 2.10GHZ”, meanwhile, Java1 is used in this experiment to build the system of this scheme. In addition, a 5-layer fully connected neural network, which consists of one input layer, one output layer, and three hidden layers, is used to train and test the model using data from the dataset 1 database. In addition, this scheme uses $(N; ⌊ \frac{N}{2} ⌋)$ -threshold Paillier encryption system to construct the security framework. In addition to that, to illustrate the superiority of this scheme, experimental comparisons are made in this paper for both schemes, KTS and DSP, in terms of functionality, resource overhead and accuracy. For these two schemes, this paper sets up an experimental environment with the same conditions as the KTS scheme, and uses the same neural network and dataset. In terms of resource overhead, for different number of users (100500, interval 50), different number of single-user gradients (10005000, interval 500), and different user dropout rates (0~0.4, interval 0.05), this paper evaluates the computational overhead as well as the communication overhead of the user-side and the cloud server-side, respectively. Notably, the single-command multiple data stream technique of this method packs multiple compression gradients into a single message for batch encryption processing, thus speeding up the encryption operation in the anti-conspiracy privacy-preserving protocol.

3.1.1

Single-user overhead

In the experimental analysis in this section, this paper analyzes the user-side resource overhead around three important parameters. These three parameters include the total number of users participating in gradient aggregation |U|= E = 3N/4, the number of gradients for a single user |G|=L, and the drop rate R. In this paper, we will evaluate the performance of this scheme on the user side in terms of both computation and communication overheads based on different values of these three parameters. During the process, this paper will evaluate the impact of each parameter on the performance by adjusting one of the parameters to fix the other two.

1)

Computational overhead

Analyzed from the perspective of time complexity, the computational overhead of a single user is divided into four parts: (1) generating a signature, which consumes O(EL); (2) encrypting the local gradient using a public key, which consumes O(L); (3) encrypting the ciphertext of the gradient using a private key in order to complete the decryption, which consumes O(EL); and (4) authenticating the user, which consumes O(L). The overall consumption is O(EL+L).

The variation of single-user elapsed time with the growth of the number of users is shown in Fig. 1. As the number of users increases, the computational overhead of a single user tends to increase linearly. This is because in the process of signature generation and decryption, the parameter |U| will be involved in the modal exponential operation as part of the exponent, and the more users are involved, the longer the computation takes. Specifically, in the authentication process, the user needs to calculate $x_{i}' = x_{i}^{2 Δ^{2} s k_{i}}$ and ${\tilde{x}}_{i} = x_{i}^{4 Δ^{2}}$ , and in the decryption process, the user needs to calculate $C_{p l u s}^{i} = C_{p l u s}^{2 Δ k_{i}}$ .

The variation of single-user elapsed time with the growth of the number of single-user gradients is shown in Fig. 2. The computational overhead of a single user increases linearly with the number of single-user gradients. The reason for this is that during the encryption process, each user’s gradient G_i is used for exponential operations, specifically, a single user needs to compute $C_{G_{i}} = E n c_{p k} (G_{i}) = {(1 + n)}^{G_{i}} r_{i}^{n} \mod n^{2}$ . The more single-user gradients there are, the more exponential operations are performed. Meanwhile, each user holds L gradients, so the computational overhead is proportional to L.

Fixing |U|=100 and |G|=1000, the variation of single-user elapsed time with the growth of the drop rate is shown in Fig. 3, where the increase of the drop rate R does not affect the computational overhead of a single user. This is because an increase in the drop rate only changes the number of online users and the number of offline users, and the computational overhead of a single user is not related to these two values, but only to the number of users E.

2)

Communication overhead

Fig. 4, Fig. 5 and Fig. 6 show the variation of single-user data transmission with the growth of the number of users; the variation of single-user data transmission with the growth of the number of single-user gradients; and the variation of single-user data transmission with the growth of the drop rate, respectively. When the number of users E and the drop rate R increase, the communication overhead remains constant. However, as the number of single-user gradients increases, the single-user communication overhead will show a linear increase. The reason for this is that the single-user overhead is not dependent on the number of users or the drop rate, but rather on the number of single-user gradients. The more gradients a single user has, the more data the user sends to the server and the more data it receives from the cloud server, and therefore, the higher the single-user communication overhead.

3.1.2

Server overhead

1)

Computational overhead

In terms of computational complexity, the computational overhead of the cloud server includes: implementing secure aggregation of gradients in cipher mode, consuming O(EL); implementing decryption to obtain the final sum of gradients, consuming O(E); user authentication, consuming O(EL); and overall consuming O(EL+E).

The results of the computational overhead of the cloud server are shown in Table 1. The computational overhead of the server tends to increase linearly as the number of users increases. This is because the more users there are, the more gradients are aggregated in the cloud server. In addition, the number of users also participates in the modulo exponential operation in the decryption process as part of the exponent. In addition, the computational overhead of the server tends to increase linearly as the number of gradients per user increases. The reason for this is that the more gradients a single user has, the more gradients the cloud server needs to aggregate. The computational overhead of the server decreases linearly with the increase in the drop rate. This is because the greater the dropout rate, the fewer users are online, which causes the cloud server to aggregate fewer gradients in the aggregation process.

Table 1.

Calculation overhead of cloud server

	\|G\|=1000,R=0.1		\|U\|=100,R=0.1		\|U\|=100,\|G\|=1000
	Number of users	Overall running time(×10⁵ms)	Single-use household gradient	Overall running time(×10⁵ms)	Drop rate	Overall running time(×10⁴ms)
Server overhead calculation	100	0.501	1000	0.498	0.00	5.033
	150	0.712	1500	0.738	0.05	4.731
	200	0.900	2000	0.958	0.10	4.459
	250	1.134	2500	1.191	0.15	4.218
	300	1.346	3000	1.461	0.20	4.021
	350	1.565	3500	1.638	0.25	3.765
	400	1.822	4000	1.883	0.30	3.507
	450	2.011	4500	2.116	0.35	3.221
	500	2.222	5000	2.400	0.40	2.995

2)

Communication overhead

From the perspective of communication complexity, the computational overhead of the cloud server consists of three parts: receiving the proof of correctness from E user and sending the verification result to all users, consuming O(EL); receiving the encryption gradient C_{G_i} from to N/2 users and sending the encryption aggregation to E users, consuming O(EL); and receiving the aggregated value of the ciphertexts $C_{p l u s}^{i}$ and sending the updated weights, consuming O(EL). The overall consumption O(EL.

The results of the communication overhead of the cloud server are shown in Table 2. As the number of users increases, the server communication overhead shows a linear increase. This is because the higher the number of users, the higher the overall amount of messages sent and received by the cloud server. The communication overhead increases linearly as the number of gradients per user increases. This is because the more the number of gradients for a single user, the more the overall amount of messages sent and received by the cloud server. The communication overhead decreases linearly as the drop rate increases. This is due to the fact that the higher the dropout rate, the fewer the remaining users, and therefore, the smaller the total amount of messages sent and received by the cloud server.

Table 2.

Communication overhead of cloud servers

	\|G\|=1000,R=0.1		\|U\|=100,R=0.1		\|U\|=100,\|G\|=1000
	Number of users	Total transmission data(×10⁸Byte)	Single-use household gradient	Total transmission data(×10⁸Byte)	Drop rate	Total transmission data(×10⁷Byte)
Server communication overhead	100	0.707	1000	0.738	0.00	7.812
	150	1.089	1500	1.100	0.05	7.532
	200	1.422	2000	1.463	0.10	7.275
	250	1.805	2500	1.785	0.15	6.995
	300	2.167	3000	2.147	0.20	6.739
	350	2.510	3500	2.510	0.25	6.459
	400	2.872	4000	2.882	0.30	6.217
	450	3.255	4500	3.265	0.35	5.938
	500	3.597	5000	3.597	0.40	5.674

3.1.3

Impact of data distribution

Since the data held by each device involved in data collaboration is usually non-independently and identically distributed (non-IID), local updates from some devices may become outliers that deviate from the global convergence trend. One of the innovations of this method is to propose a gradient space sparsification strategy, which enables the model to reduce unnecessary communication overhead without losing accuracy by removing irrelevant local updates in the upload phase. In order to verify the effectiveness of this strategy for non-IID data, the following experimental setup is performed: (a) firstly, another version of this method, called PDEC-C, is obtained by omitting the gradient space sparsification step in this method and keeping the rest of the steps unchanged; (b) in addition to the datasets that satisfy the non-IID distribution, the datasets 1 and 2 are obtained by dividing them homogeneously and allocating them to 100 devices to obtain the dataset satisfying the IID distribution; (c) comparing the communication overhead and accuracy of PDEC and PDEC-C on the datasets with both IID and non-IID distributions.

The results of a comparison of convergence speed across different datasets are shown in Fig. 7, where (a) and (b) represent dataset 1 and dataset 2, respectively. It can be found that both PDEC and PDEC-C perform better on the IID dataset than on the non-IID data. For example, a single device in PDEC needs to consume 892 MB to achieve 99.39% accuracy on the IID dataset, while it needs to consume 969 MB to achieve 99.12% accuracy on the non-IID data. Its similar experimental results on dataset 2. This is mainly due to the fact that the data held by each device on the non-IID data is highly heterogeneous in terms of type and number of samples, resulting in the need for more rounds of co-training among multiple devices to finally converge.

In addition, observing the training curves on dataset 1 and dataset 2, it can be observed that the communication overhead and model accuracy of PDEC and PDEC-C are relatively similar on IID data, while they differ greatly on non-IID data. For example, for the IID dataset of dataset 2, PDEC consumes 1621 MB to reach 83.98% accuracy and PDEC-C consumes 1680 MB to reach 81.6% accuracy, which are similar in performance. While for the non-IID data of dataset 2, PDEC consumes 2114MB to reach 79.45% accuracy, while PDEC-C consumes 2265MB to reach only 75.6% accuracy. This indicates that PDEC significantly outperforms PDEC-C on the non-IID data. The main reason is that there are a large number of irrelevant local updates on the non-IID data, which seriously affects the performance of the model. The number of devices whose local updates are frequently deemed irrelevant during model training is counted for this reason. For dataset 1, there are a total of 6 devices on its IID data version that have more than 200 rounds of local updates discriminated as irrelevant during the training process, while this number rises to 17 devices on its non-IID data version. For dataset 2, there are a total of 9 devices on its IID data version that have more than 1000 rounds of local updates during training that are judged irrelevant, while this number rises to 25 devices on its non-IID data version. It can be seen that by removing irrelevant updates from these devices, PDEC achieves less communication overhead and higher model accuracy than PDEC-C.

3.2

Performance Evaluation of Deep Federated Learning Models

3.2.1

Convergence speed comparison

This section compares the convergence speed of PDEC with the other three methods on two datasets, and the results of the convergence speed comparison on different datasets are shown in Fig. 8. It can be seen that (a) CPFED has the fastest convergence speed and converges after 400 and 1800 rounds of training on dataset 1 and dataset 2 datasets, respectively. But CPFED accelerates the convergence by increasing the amount of local computation at the device side, and it is not applicable to those edge devices with limited computational performance. Moreover, CPFED finally achieves 98.23% and 73.45% accuracy on dataset 1 and dataset 2, respectively, which is lower than 99.57% and 79.03% for PDEC and 98.27% and 74.97% for FEDOPT, because CPFED adds differential privacy noise to the model parameters, which reduces the model accuracy. (b) PDEC outperforms FEDOPT in terms of model accuracy and convergence speed, with PDEC achieving 99.75% accuracy after 500 rounds of training on the dataset 1 dataset and 79.22% on the dataset 2 dataset after 2,748 rounds of training. This is because the variance of the compression operator used by PDEC is smaller than the variance of the compression operator used by FEDOPT, and PDEC also filters out irrelevant gradient updates from affecting the model accuracy. (c) Comparing the two datasets, these methods converge more slowly on dataset 2 because the LSTM has a more complex network structure and a more uneven distribution of training data. However, even in this case, PDEC is still able to achieve higher accuracy at a convergence rate close to that of the benchmark method PPNPC.

3.2.2

Comparison of characteristics

For federated learning-based data collaboration, numerous scholars have studied the communication efficiency and privacy protection issues involved in recent years. In this section, three representative methods, PPNPC, CPFED, FEDOPT and PDEC, are compared in terms of six aspects, including communication efficiency, privacy protection, and resistance to conspiracy attacks, and the results of comparing the characteristics of several methods are shown in Table 3. PPNPC adopts secure multi-party computation to protect the data privacy, but it does not take any measures to solve the problem of high communication overhead; CPFED reduces the overall communication overhead by periodical averaging reduces the number of required training rounds, which reduces the overall communication overhead, but the secure aggregation protocol designed by CPFED fails once some of the devices drop out during the training period, and the differential privacy it uses reduces the accuracy of the data collaboration; the working principle of FEDOPT is most similar to that of PDEC, which improves the communication efficiency through the use of compression arithmetic, and homomorphic encryption protocol to achieve privacy protection. However, the later experimental part will prove that PDEC performs better than FEDOPT in terms of communication efficiency and convergence speed. In addition, PPSMC, CPFED, and FEDOPT cannot be applied to non-independent homomorphic datasets because they do not take into account the impact of non-independent homomorphic data on the model update of the device. PDEC, on the other hand, identifies and removes irrelevant local updates, which mitigates the adverse effects of non-independently and identically distributed data on model accuracy and convergence speed.

Table 3.

The characteristics of several methods compare the results

Characteristic	PPNPC	CPFED	FEDOPT	PDEC
Low communication cost		+	+	+
Privacy protection	+	+	+	+
Complicity		+	+	+
Apply the IID data				+
Device drop	+		+	+
Fast convergence	+	+		+

4

Conclusion

In this paper, in order to solve the problem of data security protection in the archive information system, the deep federal learning model based on archive information security is constructed by introducing and deep learning model and differential privacy technology. The results show that the computational overhead of a single user increases linearly with the number of users and the number of single-user gradients, but the increase of the drop rate R has no effect on the computational overhead of a single user. The computational overhead of the server increases linearly with the number of users and the number of user gradients. However, the computational overhead of the server decreases linearly with the increase in the drop rate. The data held by each device on the data is highly heterogeneous in terms of type and number of samples, which can lead to more rounds of co-training being required for final convergence among multiple devices. For example, the PDEC algorithm will achieve less communication overhead and higher model accuracy than the PDEC-C algorithm by removing irrelevant updates from the devices.

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

Construction of data security protection model in archive informatization system based on deep learning

Min Feng

Published Online: Mar 21, 2025

Received: Nov 06, 2024

Accepted: Feb 09, 2025

DOI: https://doi.org/10.2478/amns-2025-0597

KeywordsArchival information, Deep neural networks, Differential privacy, Computational overhead, Deep federated learning models

© 2025 Min Feng, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Keywords
Archival information, Deep neural networks, Differential privacy, Computational overhead, Deep federated learning models