Research on upgrading the informationization of college archives management based on information technology

During the operation and management of colleges and universities, archive management is one of the important components, as an information source, providing strong support for teaching and research, daily management and other activities of colleges and universities. In the context of the digital era, colleges and universities have begun to emphasize the information-based archive management model, that is, around the existing archive resources, by increasing the efforts of excavation and development, to migrate the archive management work to the line, to facilitate efficient retrieval and utilization, and to provide assistance for the sound development of the archive business [1-2]. Making full use of and giving full play to the advantages of information technology and accelerating the informatization of archive management work are the needs of the times, the needs of the society and the needs of education.

Many social organizations, enterprises and institutions in carrying out archive management work, tend to follow the traditional management methods, archive carrier is also paper-based [3]. But the long-term preservation of paper archives is extremely easy to be damaged, if the preservation method is not appropriate will be subject to moth, oxidation, corrosion, etc., making the archives lose integrity [4-5]. The construction of information technology can improve this situation, the paper archives into electronic form stored in a specific system, not only does not take up physical space, will not be affected by environmental factors, and only need to enter the keywords in the system to achieve rapid retrieval [6-9]. In addition, for the login system users will be authenticated, leaving traces of browsing, greatly enhancing the security of archival resources [10-11]. Based on the modern network technology means innovation to carry out the construction of college archives informatization, in constantly improving the construction of college archives hardware and software at the same time, efforts to develop a one-stop archives data management platform, to realize the automation of data and information transformation [12-13]. Actively carry out the informationization construction, help to ensure the timeliness of the archive resources collection, integration, to ensure that the archive resources and the development process of the times always keep pace with the social organizations, enterprises and institutions to provide a strong basis for management decisions, so that the limited archive resources to achieve high-speed sharing, interconnection and interoperability, and better adapted to the development trend of the digital era [14-15].

The article firstly adopts the B/S three-layer architecture model for the design of college file management system, which is divided into the performance layer, the logic processing layer, and the underlying data processing layer, on the basis of which the system’s function and database are designed in detail. The function of the file management system is divided into five major functional modules: teacher file management, student file management, student registration information update, file data maintenance, file information mining, and a new binary-based dichotomous data analysis algorithm is proposed. After designing the university archive management system, the article designs a traceable digital archive sharing scheme based on blockchain to share digital archives in a controlled range in a secure way, and extracts the archive sharing metadata from the sharing process and stores this information on the blockchain, which makes the process of archive sharing more secure and traceable. Then a Byzantine fault-tolerant consensus algorithm based on credit partitioning is proposed, which improves the consensus efficiency by partitioning the network nodes for autonomy and reducing the consensus nodes to optimize the consensus process. The classical C4.5 algorithm is analyzed to improve the calculation method of attribute selection, using Taylor’s formula and McLaughlin’s expansion formula in order to complete the approximate conversion of the information gain rate formula. Data mining technology is applied to archive management, analyzing and deciding the system by establishing the archive information decision-making model, and utilizing information technology to process the college archive services at a deeper level, mining out the information resources that are helpful and valuable for the development of college archives. Finally, the C4.5 algorithm is applied to the field of university archive management, mining the relationship between personal information of students and faculty and the access frequency of university digital libraries, so as to accurately categorize frequent and infrequent access customers. And the consensus algorithm is compared and analyzed with the traditional consensus algorithm in terms of communication complexity.

2

Design of a university file management system based on B/S architecture

2.1

Brief description of the system design

2.1.1

System design fundamentals

The design quality of the system design is directly related to the goodness of the software. The system needs to follow certain design principles in design: 1)

System design refers to the system is divided into several large modules, and each module is designed independently of each other, there is no logical relationship between each other, but through the database is connected, so as to make the system in the form of a whole to play a role in its conducive to the better realization of the system’s functions. Each module is independent of each other, so it is more secure and reliable, and the structure of the system is more perfect.

2)

The system has a certain degree of abstraction, so the designers of the system have fewer factors to consider when designing, and can concentrate on the main factors, thus making the system design simpler and more convenient.

3)

Detailing, after solving the key problems of the system, it is necessary to gradually begin to solve its detailed problems.

4)

Information localization and hiding. Localization refers to the combination of elements with similar structure and function to achieve information localization. Hiding refers to the fact that due to the independence of each module, information can only be used in the respective module, and information sharing between modules cannot be realized.

5)

Independence between the system modules and between the functions of the system is conducive to the operation and maintenance of the system, and the system design needs to realize “high cohesion and low coupling”.

2.1.2

System design principles

When designing the system, in addition to following certain design principles, it is also necessary to take into account the reality of the actual situation, mainly in the following aspects: 1)

Practicality

Practicality refers to the design of the system should be based on the reality of the school’s needs, so as to facilitate the school’s management of teacher and student records, to achieve a high degree of data informatization, and to enhance the practicality of the system.

2)

Stability

Stability refers to the design of the system to make the system can run stably for a long time, in the face of unforeseen events, there should be a certain degree of adaptability. As the management of the file management system is a very important school management business, so in the design of the system must take into account the stability of the system.

3)

Security

Although the operation of the system for the school’s file management has brought great convenience, but also at the same time gave rise to corresponding risks, the most important is the data security. It is necessary to keep student information strictly confidential to prevent the impact of data leakage on the school.

2.2

Overall system design objectives

Based on the urgent demand for networked university file management system, the stability, security and data security of the system should be taken into consideration when designing the system, which is designed to make the system reach the advanced and reliability, scalability and flexibility, mainly realizing the file management of teachers and students, the management of teaching business, the management of the subject and the management of the system and other businesses.

2.3

Overall system architecture

This system design adopts the MVC idea, which is a three-layer model, and divides the system into an operator interface layer, a logical processing layer, and a data access processing layer according to its structure. The operation interface layer integrates relevant operation controls, through which users can operate the system to achieve the desired purpose. In the early operation, the user cares about the function realized through the operation, and does not need to pay attention to the logic of the operation, so that all the parts below the operation interface layer are hidden from the outside world. The logic processing layer is used to parse the relevant operation commands sent from the operation interface layer, compile the commands into the required format, and call the relevant operation logic to execute the operation commands and send the commands to the data access processing layer, so that the logic processing layer is the middle part of the system as well as the link between the upper and lower layers, and is the core of the system [16]. The data access processing layer is mainly used to receive commands from the logical processing layer, filter the data required by the commands in a certain way, and return these data to the logical processing layer to be processed into a format that can be directly displayed by the operator interface layer. The overall technical architecture of the system is shown in Figure 1.

2.4

System Functional Module Division

The file management system researched and developed in this paper is mainly to process and standardize the daily work of file management. Combined with the school’s business needs for file management, the system is set up into six major functional modules, i.e., teachers’ file management module, students’ file management module, teaching business management module, scientific research project management module, system maintenance module and data maintenance module, and each module is independent of each other. The overall functional structure of the system is shown in Figure 2.

2.5

System security system design

2.5.1

Centralized authentication of identity

The digital certificate of the management system can be used to securely authenticate the user’s identity, and its main function is to safeguard the security of the file in the transmission process and the object of transmission. This system can guarantee system security by using digital certificates that uniquely identify the user’s identity.

2.5.2

System authentication system

Authentication is required for various operations of staff within the university in the records management system. So a new CA authentication support is needed. Through the use of middleware technology, ASP, XML, WebService, CORBA, COM components, and so on, so as to realize the underlying security device hardware differences and cryptographic logic for shielding, the user only needs to add its needs in the business of the security function, and then the operation of a simple configuration, you can realize the security applications based on PKI, and, therefore, will be able to greatly make the development cost can be reduced, thus maximizing the efficiency of development [17]. The current extranet PKI security services middleware platform has been used by a large number of universities and colleges unified construction, the project construction of this system can be realized by using the platform that has already been constructed.

2.5.3

System data encryption and access control

In order to ensure database security and information leakage, this system adopts MD5 encryption to encrypt the key information of the system, such as ID card number, home address and other sensitive information. And in order to distinguish between different account roles, the system introduces a privilege control function, which strictly distinguishes the functional privileges of users, and strictly restricts the functional privileges of users with different roles to ensure the framework security and stability of the system.

3

Technology related to upgrading the informationization of university records management

3.1

Blockchain-based traceability sharing scheme for digital archives

3.1.1

Introduction to Shared Architecture

In this section, we present the architecture of the blockchain-based traceable digital archive sharing scheme. The system framework is shown in Fig. 3, and the architecture consists of a user interface layer, a smart contract layer, and a storage layer.

3.1.2

Shared program design

1)

Extraction and protection of archive sharing metadata (1)

Extraction of archive sharing metadata

Archive sharing metadata is a data structure extracted from the archive data sharing process, containing data description information, data operation records and security control information, which is stored in the form of transactions in the distributed ledger of the blockchain.

Keyld is the unique identifier of the archive sharing metadata, which is obtained by calculating the hash value of the privacy preserving metadata topic. The calculation formula is shown in (1): (1) $\begin{array}{l} K e y I d = H A S H (P a r e n t K e y I d | S e n d e r I d | \\ R e c i p i e n t I d \ A r c h i v e A t t r i b u t i o n) \end{array}$

Digitalsignature is used to ensure the integrity and non-repudiation of archive sharing metadata by generating a signature on the archive sharing metadata using the sender’s private key. The calculation formula is shown in (2): (2) $D i g i t a l s i g n a t u r e = S I G N (K e y I d, S e n d e r P r i v a t e K e y)$

(2)

File sharing metadata storage

The storage of archive shared metadata is encrypted and packaged to be stored on the alliance chain through smart contract to ensure its tampering and traceability. Through the hash function of the file sharing metadata, the file sender Id, file receiver Id, file attribute data and other hash operations to get the unique identification keyId of the current data block, and then encrypt the keyId through the file sender’s private key sender Private Key to generate a digital signature, and this keyId will be sent to the file receiver, file The receiver of the archive will use this keyId to query the shared metadata of the archive on the federation chain. Finally, call the uplink method of the federation chain to uplink the generated file sharing metadata.

2)

Traceable digital archive sharing mechanism across libraries (1)

Identity registration and authentication: each digital archive joining the alliance chain has to register its identity to generate a digital identity, which is the basis for the archive to operate in the system. The archive that needs identity registration first generates a pair of public and private keys locally, in which the private key is stored locally. The public key is then sent to all members of the alliance through the network, and the alliance members receive the request and execute the identity registration smart contract to initiate the identity registration vote, and when more than half of the nodes agree, the identity registration is successful, and the public key is added to the local alliance chain configuration list for recording. The authentication module includes user identity and system authority authentication based on the federation chain.

(2)

Data transmission: the data transmission module consists of file sending and file receiving. The user who joins the federation chain node has the network address and network port of other users on the federation chain. The sharing of data files first requires the file sharing requestor to initiate a sharing request through the network, and then the requested user determines whether to perform the sending or receiving operation by analyzing the specific request, and analyzes the specific file data to be shared through the request. The file sharing process is shown in Figure 4.

3)

Archive leakage tracking

By storing the archive sharing metadata on the blockchain, this ensures transparency and tamper-proof data sharing, and thus this paper solves the problem of traceability of the sharing process in the archive sharing process. The relationship between the archive sharing metadata and the blockchain ledger transaction record structure is shown in Figure 5.

3.1.3

Optimized Consensus Algorithm for Shared Coalition Chains

1)

Practical Byzantine Fault Tolerant Consensus Algorithm

The PBFT consensus algorithm provides a high level of security to the system. The PBFT consensus algorithm proves that the ultimate consistency of the system can be guaranteed when there are f malicious nodes in the system as long as the total number of nodes N in the network satisfies the number N ≥ 3f + 1. To ensure that the nodes of the blockchain network are in agreement, the PBFT consensus algorithm implements a consistency protocol, a view switching protocol and a checkpointing protocol. The consistency protocol ensures that all nodes are able to save correct and consistent messages through a system node peer-to-peer communication and reply confirmation mechanism. Consistency protocol is the key process to realize consensus in PBFT.

2)

CP-PBFT improvement program (1)

Node credit evaluation model

The lack of reward and punishment mechanism for nodes in PBFT consensus algorithm leads to the lack of motivation of nodes, in CP-PBFT the ability of nodes to reach consensus is judged by calculating the credit value of nodes, nodes with high credit value are more likely to act as the master node or consensus node, and nodes with low credit value are only able to act as a listener node. Due to the nature of the above indexes, there are differences in the scale and quantity values, in order to transform the original indexes into more comparable standard values the original indexes are processed as follows:

Latency index, which represents the response rate of nodes responding to various types of requests during the consensus process. The delay index σ_i is denoted as: (3) $σ_{i} = 1 - {(\frac{d_{i}}{d_{\max}})}^{3} \in (0, 1)$

Where d_max denotes the maximum delay allowed by the system, communication delay exceeding this value will cause the task node to go down.

Completion index, which refers to the success rate of nodes in completing message consensus during the consensus period. Completion index τ is denoted as: (4) $τ_{i} = \frac{n_{i} - v_{i}}{E} \in (0, 1]$

Where E represents the total number of times the node participates in the consensus, the more times the node participates in the consensus proves that the sub-node is more credible, and the degree of completion will be relatively higher, but the more times it submits incorrect data, the lower the degree of completion.

The node activity index refers to the ratio of nodes’ online participation in the consensus process. The activity index μ is expressed as (5) $μ_{i} = e^{- \frac{t_{i}}{w_{i}}} \in (0, 1)$

From the index, it can be seen that the longer the node is online the greater the active index, on the contrary the longer the node is offline the lower the active index, and in the selection of the consensus node is more inclined to choose the more active node.

The method of calculating credit value according to the node credit evaluation index is shown in equation (6). (6) $R_{i} = ω_{1} \times σ_{i} + ω_{2} \times τ_{i} + ω_{3} \times μ_{1}$

(2)

Node Partitioning Mechanism

In CP-PBFT improved consensus algorithm, let the total number of nodes be N, the maximum number of Byzantine nodes that the system can withstand is f, and N ≥ 3f + 1 is required for the system to reach consensus correctly. This paper introduces the Sturges empirical formula as a network grouping method. The Sturges empirical formula is widely used in statistical research to control the number of data groupings, ensuring that the number of data groupings should be large enough to reflect the distribution of the data but not too large to make the distribution of the data unclear. Using its grouping by credit value adequately reflects the overall credit value situation. In partitioning, let the number of partitions be k, then $k = F l o o r (1 + 3.3 \times \log (N))$ , $p_{i} \in {p_{1}, p_{2}, p_{3}, \dots, p_{k}}$ denote the node partitions in the blockchain network. By evenly distributing nodes according to their credit values, the credit value status of nodes in each partition group remains consistent, preventing a certain partition from failing to reach consensus due to the presence of too many Byzantine nodes, setting the sequence of nodes’ credit values as $r_{i} \in {r_{1}, r_{2}, r_{3}, \dots, r_{N}}$ , with credit values increasing from r_i to r_N, and ensuring that the credit value status of each partition is similar by means of the S-shaped credit value filling method.

After the network node partition, the average number of nodes owned by each node $n u m_{p_{i}}$ is rounded to $\frac{N}{F l o o r (1 + 3.3 \times \log (N))}$ . In this paper, the partition is carried out in accordance with the credit value, a credit value corresponds to a node, and the credit value belonging to a partition corresponds to the node belonging to a partition.

3.2

Improved C4.5 classification algorithm

3.2.1

The classical C4.5 algorithm

C4.5 algorithm is the more mature decision tree algorithm and the mainstream decision tree algorithm, which is characterized by fast classification speed and high classification accuracy. C4.5 algorithm is an improved algorithm based on ID3 algorithm, and ID3 algorithm, compared to the increase of continuous attributes, attribute value of the vacant case of processing, in the efficiency of a great improvement in the C4.5 algorithm of the idea is: assuming that the S is a training sample set, the training sample set S to build a decision tree for the training sample set, the largest value of $G a i n - R a t i o (x)$ is selected to be an attribute for splitting the node, in accordance with this criterion can be S divided into n subsets. If the tuples contained within the ith subset S_i are of the same category, then this node acts as a leaf node of the decision tree and stops splitting. For those S_i that do not satisfy the above conditions, the tree is generated sequentially and recursively using the above method until all the tuples contained in all the subsets belong to the same category [18]. It is based on the following principle:

Define a category information entropy: let the training set S has s samples, the training set is divided into m classes, the number of instances in the ith class is s_i, s_i/s that is the probability p_i, Info(S) is the category information entropy, based on the information entropy calculation of the formula is: (7) $I n f o (S) = - \sum_{i = 1}^{m} p_{i} \log_{2} (p_{i})$

Definition 2 Conditional information entropy: If attribute A is chosen to divide the training set S, and the training sample set S is divided into k subsets ${S_{1}, S_{2}, \dots, S_{K}}$ , and let attribute A have k different values ${a_{1}, a_{2}, \dots, a_{k}}$ , then the number of training instances belonging to the ith class in Definition S_j is s_ij, and Info_A(S) is the conditional information entropy of attribute A, and the information entropy of the subset divided into subsets by A is given by Eq. (8): (8) $I n f o_{A} (S) = \sum_{j = 1}^{k} \frac{s_{1 j} + s_{2 j} + \dots + s_{m j}}{s} \times I n f o (S_{j})$

where $I n f o (S_{j}) = - \sum_{i = 1}^{m} p_{i j} \log_{2} (p_{i j})$ ; $p_{i j} = \frac{s_{i j}}{s_{j}}$ is the sample probability of class i in S_j.

Definition 3: The information gain for classifying attribute A is calculated as: (9) $G a i n (A, S) = I n f o (S) - I n f o_{A} (S)$

Definition 4 Split information entropy: Let attribute A have k different values, and the sample set S can be divided into k subsets using attribute A. One of them S_j contains some such samples in S: they have value a_j on attribute A. If the samples are partitioned in terms of the value of attribute A, $I n f o (A)$ is the split information entropy of attribute A, as in equation (10): (10) $I n f o (A) = - \sum_{j = 1}^{k} p_{j} \log_{2} (p_{j})$

Definition 5: The formula for dividing the information gain rate for attribute A is: (11) $G a i n - R a t i o (A) = \frac{G a i n (A, S)}{I n f o (A)}$

3.2.2

Improved calculation of information gain rate

1)

Theoretical foundation

The Taylor series is a representation of a function in terms of series, that is, a concatenation of infinite terms, which are computed from the derivatives of the function at a given point, and Taylor’s Median Theorem is defined as follows:

If for any function f(x), it contains derivatives up to order (n + 1) in some open interval (a, b) containing x₀, then for any x ∈ (a, b), there are: (12) $\begin{array}{rcl} f (x) & = & f (x_{0}) + f' (x_{0}) (x - x_{0}) + \frac{f^{-} (x_{0})}{2!} {(x - x_{0})}^{2} + \dots \\ + \frac{f^{n} (x_{0})}{n!} {(x - x_{0})}^{n} + \frac{f^{(n + 1)} (ξ)}{(n + 1)!} {(x - x_{0})}^{n + 1} \end{array}$

The ξ in the above equation refers to a value between x₀ and x.

In Taylor’s formula, let x₀ = 0, then ξ is between 0 and x. Let $ξ = θ x (0 < θ < 1)$ , and so Taylor’s formula becomes the simple form, which is often referred to as McLaughlin’s formula with a Lagrange cosine term: (13) $\begin{array}{rcl} f (x) & = & f (0) + f' (0) x + \frac{f' (0)}{2!} x^{2} + \dots + \frac{f^{n} (0)}{n!} x^{n} \\ + \frac{f^{(n + 1)} (θ x)}{(n + 1)!} x^{n + 1} (0 < θ < 1) \end{array}$

Removing the Lagrange residual term gives the approximate formula: (14) $f (x) \approx f (0) + f' (0) x + \frac{f^{*} (0)}{2!} x^{2} + \dots + \frac{f^{n} (0)}{n!} x^{n}$

2)

Improvement Ideas

From the basic principle of C4.5 algorithm, it can be known that the selection of attributes when generating the decision tree is based on the principle of information theory, due to the fact that in the process of calculating the formula for the information gain rate involves multiple logarithmic function operations, which requires calling the library function several times during the calculation, which greatly increases the calculation time. To address this problem, an improvement method for the information gain rate formula is proposed, i.e., to simplify the calculation of the information gain rate of the C4.5 algorithm by using the mathematical Taylor’s formula and McLaughlin’s formula, which reduces the complexity of the algorithm’s computation to a large extent, and the improved C4.5 algorithm is named as the TAM-C4.5 algorithm.

Since the derivative of ln(x) at x₀ = 0 is meaningless and the range of values of the probability commonly used under the information gain rate formula is between $[0, 1]$ , the McLaughlin’s formula of ln(x + 1) is chosen in this paper to improve the formula of the information gain rate in the traditional C4.5 as in Eq. (15): (15) $\ln (x + 1) \approx x - \frac{1}{2} x^{2} + \frac{1}{3} x^{3} \dots + {(- 1)}^{n - 1} \frac{1}{n} x^{n}$

So there: (16) $\ln (x) \approx (x - 1) - \frac{1}{2} {(x - 1)}^{2} + \frac{1}{3} {(x - 1)}^{3} \dots + {(- 1)}^{n - 1} \frac{1}{n} {(x - 1)}^{n}$

When x ∈ (0, 1). (17) $\ln (x) \approx (x - 1) - \frac{1}{2} {(x - 1)}^{2} + \frac{1}{3} {(x - 1)}^{3}$

Through the above approximate simplification, it is possible to realize the conversion of logarithmic operations into non-logarithmic operations, which can eliminate the complex logarithmic operations in the formula of the rate of information gain by using the above transformation characteristics, so as to achieve the purpose of simplifying the calculation formula and improving the efficiency of tree building. And the formula is more accurate than the simplification of the equivalent infinitesimal ln(1 + x) ≈ x.

The transformation formula of category information statement is shown in (18): (18) $\begin{array}{rcl} I n f o' (S) & = & - \sum_{i = 1}^{m} p_{i} \log_{2} (p_{i}) = - \sum_{i = 1}^{m} \frac{s_{i}}{s} \log_{2} (\frac{s_{i}}{s}) \\ = & - \sum_{i = 1}^{m} \frac{s_{i}}{s} \frac{\ln (\frac{s_{i}}{s})}{\ln 2} = - \frac{1}{\ln 2 \times s} \sum_{i = 1}^{m} s_{i} \times \ln (\frac{s_{i}}{s}) \\ = & - \frac{1}{\ln 2 \times s} \sum_{i = 1}^{m} s_{i} \times [(\frac{s_{i}}{s} - 1) - \frac{1}{2} {(\frac{s_{i}}{s} - 1)}^{2} + \frac{1}{3} {(\frac{s_{i}}{s} - 1)}^{3}] \\ = & - \frac{1}{\ln 2 \times s} \sum_{i = 1}^{m} [\frac{s_{i} (s_{i} - s) (11 s^{2} + 2 s_{i}^{2} - 7 s_{i} s)}{6 s^{2}}] \end{array}$

Similarly, the transformation formula for conditional information stated and split information entropy is shown in (19): (19) $I n f o_{A}^{'} (S) = - \frac{1}{\ln 2 \times s} \sum_{j = 1}^{k} \sum_{i = 1}^{m} [\frac{s_{i j} (s_{i j} - s_{j}) (11 s_{j}^{2} + 2 s_{i j}^{2} - 7 s_{i j} s_{j})}{6 s_{j}^{2}}]$ (20) $I n f o (A) = - \frac{1}{\ln 2 \times s} \sum_{j = 1}^{k} [\frac{s_{j} (s_{j} - s) (11 s^{2} + 2 s_{j}^{2} - 7 s_{j} s)}{6 s^{2}}]$

Therefore, the formula for the transformed information gain rate is shown in (21): (21) $G a i n - R a t i o^{'} (A) = \frac{I n f o (S) - I n f o_{A}^{'} (S)}{I n f o (A)}$

4

Classification of archival information based on the decision tree C 4.5 algorithm

4.1

Categorize archival information

The archival information is divided into 12 major categories according to the archival classification standards and presented in primary, secondary and tertiary categories, and the archival information is categorized as shown in Table 1.

Table 1.

Archives information classification

Serial number	Primary class	Secondary class	Tertiary class
1	DQ	Party group	Year
2	XZ	Administration	Year
3	JX	Teaching	Year
4	KY	Scientific research	Year
5	JJ	Infrastructure	Year
6	SB	Equipment	Year
7	CB	Publish	Year
8	WS	Foreign affairs	Year
9	CK	Finance council	Year
10	SX	Acoustic image	Year
11	YJ	performance	Year
12	CP	Product production research development	Year

4.2

Grouping of archival information

For example, there are a total of 34 departments in the university, of which 17 are administrative departments, 14 are teaching departments, and 3 are paraprofessional departments. Combined with this situation, and taking into account the different division of labor in each department, we therefore selected three meaningful attributes from them, namely, administrative department group, teaching department group, and teaching and support department, and the departmental grouping information is shown in Table 2.

Table 2.

Department information

Representative department	Related department 1	Related department 2	Related department 3	Related department 4
Administration department	Dean’s office	Personnel department	Finance office	Asset facilities
Teaching department	Provost	Tourist school	Foreign language institute	Institute of financial management
Auxiliary department group	Library	Education technology center	Experimental center

4.3

Data processing on the level of confidentiality of archival information

First, organize the raw data. After we have finished collecting the raw data, we need to have a more specific elaboration of the data and come up with the corresponding answers. Finally, the results obtained will be stored in the relevant archive text, pictures and video information of the various departments of the system. Combine this information to determine the corresponding data and generate charts.

Second, data pre-processing. The data need to be processed in advance, the purpose is to generate data that can be used, the impurities of the data will be processed to ensure that the impurities disappear or negligible. The presence of impurities can have an impact on the entire decision tree, making it less accurate and in the presence of erroneous amounts of information. For example, if there are departments that do not have archival information for some reason, the data set can be initialized as none. Relatively speaking, individual non-existent values, we need to apply the manual filling method to solve the problem.

Again, data conversion is performed. In general, the results of the decision tree algorithm is irregular type, therefore, the need to irregularize the data, we have to deal with the departmental archive information, the confidentiality of the archive information is divided into three levels of top secret, confidential and secret, which set the attribute of top secret as A, the attribute of confidentiality as B, and the attribute of secret as C. The departmental group is discretized as shown in Table 3.

Table 3.

Department group discretization

Attribute setting	Attribute value	meaning
File information	A	People are limited to school leaders
	B	People are limited to school leaders and middle managers
	C	Consult people on campus

5

Experiments and analysis of results

5.1

Application of C4.5 Algorithm in College Records Management

5.1.1

Problem definition and data preprocessing

In this paper, the C4.5 algorithm is used to accurately classify the access customers of college archives according to whether their access frequency is frequent or not, and the assessment of whether the access frequency of college archives is frequent or not should be combined with the specific behaviors and actual situation of each customer, including the specialty, age, etc. of each customer. According to the identity (academic background), gender, age and specialty of each customer, the classification model of C4.5 decision tree is established to evaluate the frequency of customers’ visits to college archives, the essence of which is to use the C4.5 algorithm to mine the data to obtain the classification law, that is, to obtain the relationship between the personal information of the visiting customers and their visit frequency, and to derive the classification rules, i.e., the customers’ evaluation of the frequency of visits to college archives. Intelligent model.

Take the information of 2023 customers visiting college archives in a college as an example to establish a data table, including fields customer number, identity, gender, age, major and access frequency, use data conversion and integration and other related technologies to remove noise and other irrelevant information in the data, and transform data types and values of data sources into a unified standard format. For clients who are both teachers and postgraduate students, the status is defined as teachers. The attributes of the clients’ visit frequency are treated as follows: ① Clients who visit more than 3 times a week are designated as frequent visitors. ② The remaining customers are designated as infrequent visitors. The information table of converted customers is obtained after data preprocessing, and the personal information feature set of customers after data conversion is shown in Table 4.

Table 4.

Customer personal information feature set after data conversion

Customer	Identity	Gender	Age	Majors	Visits are frequent
1	Teacher	Female	⩾40	Mathematics	N
2	Undergraduates	Man	<20	Chemistry	Y
3	Graduate student	Female	20~40	Physics	Y
4	Teacher	Female	⩾40	Chemistry	N
5	Graduate student	Man	⩾40	Physics	Y
6	Undergraduates	Man	<20	Physics	N
7	Graduate student	Female	20~40	Mathematics	Y
8	Teacher	Female	20~40	Mathematics	N
9	Graduate student	Man	<20	Chemistry	Y
10	Graduate student	Female	⩾40	Physics	N

5.1.2

Constructing decision trees

After pre-processing the data, according to the C4.5 algorithm, the attributes identity, gender, age and specialty are taken as the object attributes of the algorithm, and the attribute “whether to visit frequently” is taken as the target attribute, and the attributes are sorted one by one by applying the meaning of the rate of information gain, and the attribute with a high rate of information gain is selected as the test attribute, and a root node is constructed, and this attribute is used as a marker to construct a branch for each value of the decision tree recursively. Using this attribute as a marker, we build a branch for each attribute value, and recursively build a tree to construct a corresponding decision tree, the specific processing of the algorithm is as follows. The attribute “identity” is chosen as the root node of the decision tree because it has the largest information gain rate value. For each branch, the above steps are repeated to generate the decision tree, and the final decision tree is shown in Figure 6.

5.1.3

Extraction of classification rules

1)

If status = “undergraduate” AND age <20 AND major = “chemistry” THEN visits were more frequent.

2)

If status=“graduate student” and age<40then visits are more frequent.

3)

If status = “graduate student” AND age >= 40 AND major = “physics” AND gender = “male” THEN visits are more frequent.

4)

If status=“Teacher” then visits are infrequent.

5)

If status=“Undergraduate” AND age<20 AND major=“Physics” THEN visits are infrequent.

6)

If status=“Graduate student” AND age>=40 AND major=“Physics” AND gender=“Female” THEN visits are infrequent.

From the above rules, it can be seen that the degree of influence of each information of customers on the frequency of visit is “identity”, “age”, “specialty”, “gender”. According to this rule, university archives can classify their clients into frequent and infrequent visitors. In addition, according to the customer’s personal information can be provided timely information content, not only to maintain the access of frequent customers to the college archives, but also to attract infrequent customers to visit the college archives, and thus improve the management of college archives, so as to provide customers with better personalized service.

5.2

File Encryption Performance Analysis

5.2.1

Communication complexity analysis of the PAPBFT consensus algorithm

Taking the number of nodes n as 100, 200, 400, 500, 800, 1000, the number of PBFT consensus algorithm communication is shown in Table 5. Taking the number of groups of PAPBFT consensus algorithm as 5, 10, 15, 20, 25, the number of PAPBFT consensus communication is shown in Table 6. As shown in Table 5 and Table 6, when the number of nodes n is 100, 200, 400, 500, 800, 1000 respectively, the number of PAPBFT consensus algorithm communication is reduced by many orders of magnitude than the traditional PBFT consensus algorithm.

Table 5.

The number of communication times of the PBFT consensus algorithm

Node number	100	200	400	500	800	1000
Communication frequency	19752	79635	322500	495000	1296322	1985220

Table 6.

Number of communications for the PAPBFT consensus algorithm

	5	10	15	20	25
100	247	383	600	930	1348
200	434	559	801	1120	1558
400	837	954	1176	1527	1946
500	1019	1160	1398	1714	2132
800	1630	1765	1973	2309	2751
1000	2029	2141	2383	2724	3164

The communication complexity of PBFT is known to be O(n2) and the communication complexity of PAPBFT is calculated to get O(nk*log kn) where n is the total number of nodes in the network and k is the number of groups. The communication complexity comparison is shown in Fig. 7. After the above comparative analysis, when the number of nodes is less, the PAPBFT communication complexity is closer to PBFT, and as the number of nodes increases, the PAPBFT communication complexity is much smaller than the traditional PBFT, and the increase is smaller and the communication is stable. It can be seen that when the number of nodes is large, PAPBFT can significantly reduce the system communication complexity and improve the efficiency of the system. In the university file management framework, a large number of school nodes no longer need to maintain a P2P link state with all other school nodes, and only need to maintain a network connection with the same node from the consensus cluster, which reduces the resource consumption of the school nodes in hardware and network.

5.2.2

Transaction Throughput Analysis of PAPBFT Consensus Algorithm

The experiment uses docker containers as the isolation technology. Due to the lightweight nature of Docker containers, thousands of containers can be run in a single machine environment. Multiple node containers are started based on the customized Docker image using Docker commands. Each container represents a consensus node, which collectively constitutes the consensus domain. The experiment is conducted using the Golang programming language and supports multiple concurrent program design schemes. The Golang programming language is used and simulation tests are conducted using JetBrainsGoLand version 2023.1.2. Golang provides efficient concurrent program design solutions and supports multi-threaded work modes, which facilitates distributed consensus algorithm implementation. Multi-computer network communication is implemented in the same LAN for consensus process simulation. This experiment compares the throughput of traditional PBFT algorithm as well as PAPBFT under different number of groupings based on the number of nodes, the number of nodes are taken as 25, 50, 125 and 200 respectively and the PAPBFT groupings are taken as 5, 10, 15 and 20 to draw the histograms and analyze the data comparatively and the throughput when N and K are taken at different values is shown in Fig. 8. According to the above shown, the throughput of PAPBFT consensus algorithm increases as the number of nodes N increases. From this point of view, the larger the number of consensus nodes, the larger the throughput of the system, which greatly increases the scalability of the university file management framework. As can be seen from the figure, the throughput of the traditional PBFT consensus algorithm gradually decreases with the increase of the number of nodes, when N=25, the PBFT throughput is 1250, when N=200, the PBFT throughput is 215. The improved consensus algorithm effectively alleviates the problem of the obvious decrease in throughput due to the continuous increase of the number of nodes, and is suitable for application in the university archive management framework.

6

Conclusion

The article applies the consensus algorithm decision tree classification algorithm C4.5 algorithm to the design of the college file management system, which can accurately classify the access customers of the college file and lay the foundation for providing personalized and intelligent services. The following conclusions are drawn from the experimental article:

C4.5 algorithm applied to the college archive system, mining the relationship between the customer’s personal information and the frequency of access to the college digital library, from the rules can be seen, the customer’s individual information on the frequency of access to the degree of influence in order of “identity”, “age”, “specialty”, “gender”. Therefore, the university file management system can be based on the rules of access to the customer into frequent access to the customer and infrequent access to the customer. And then provide better personalized service for customers.

Comparing the PAPBFT consensus algorithm with the traditional PBFT consensus algorithm in terms of the number of communications, the experimental results show that the PAPBFT consensus algorithm has a great effect in reducing the number of nodes’ communications, and when N=200, the throughput of the PAPBFT (k=5) is 2,365, which can effectively promote the expansion of the coalition chain network and realize the upgrading of the informationization of the university archive management.

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

Research on upgrading the informationization of college archives management based on information technology

Mingqi Tang

Lei Xin

Published Online: Sep 29, 2025

Received: Jan 30, 2025

Accepted: May 11, 2025

DOI: https://doi.org/10.2478/amns-2025-1093

KeywordsC4.5 algorithm, Decision tree algorithm, Consensus algorithm, College file management

© 2025 Mingqi Tang and Lei Xin, published by Sciendo.

This work is licensed under the Creative Commons Attribution 4.0 International License.

Keywords
C4.5 algorithm, Decision tree algorithm, Consensus algorithm, College file management