Research on upgrading the informationization of college archives management based on information technology
Pubblicato online: 29 set 2025
Ricevuto: 30 gen 2025
Accettato: 11 mag 2025
DOI: https://doi.org/10.2478/amns-2025-1093
Parole chiave
© 2025 Mingqi Tang and Lei Xin, published by Sciendo.
This work is licensed under the Creative Commons Attribution 4.0 International License.
During the operation and management of colleges and universities, archive management is one of the important components, as an information source, providing strong support for teaching and research, daily management and other activities of colleges and universities. In the context of the digital era, colleges and universities have begun to emphasize the information-based archive management model, that is, around the existing archive resources, by increasing the efforts of excavation and development, to migrate the archive management work to the line, to facilitate efficient retrieval and utilization, and to provide assistance for the sound development of the archive business [1-2]. Making full use of and giving full play to the advantages of information technology and accelerating the informatization of archive management work are the needs of the times, the needs of the society and the needs of education.
Many social organizations, enterprises and institutions in carrying out archive management work, tend to follow the traditional management methods, archive carrier is also paper-based [3]. But the long-term preservation of paper archives is extremely easy to be damaged, if the preservation method is not appropriate will be subject to moth, oxidation, corrosion, etc., making the archives lose integrity [4-5]. The construction of information technology can improve this situation, the paper archives into electronic form stored in a specific system, not only does not take up physical space, will not be affected by environmental factors, and only need to enter the keywords in the system to achieve rapid retrieval [6-9]. In addition, for the login system users will be authenticated, leaving traces of browsing, greatly enhancing the security of archival resources [10-11]. Based on the modern network technology means innovation to carry out the construction of college archives informatization, in constantly improving the construction of college archives hardware and software at the same time, efforts to develop a one-stop archives data management platform, to realize the automation of data and information transformation [12-13]. Actively carry out the informationization construction, help to ensure the timeliness of the archive resources collection, integration, to ensure that the archive resources and the development process of the times always keep pace with the social organizations, enterprises and institutions to provide a strong basis for management decisions, so that the limited archive resources to achieve high-speed sharing, interconnection and interoperability, and better adapted to the development trend of the digital era [14-15].
The article firstly adopts the B/S three-layer architecture model for the design of college file management system, which is divided into the performance layer, the logic processing layer, and the underlying data processing layer, on the basis of which the system’s function and database are designed in detail. The function of the file management system is divided into five major functional modules: teacher file management, student file management, student registration information update, file data maintenance, file information mining, and a new binary-based dichotomous data analysis algorithm is proposed. After designing the university archive management system, the article designs a traceable digital archive sharing scheme based on blockchain to share digital archives in a controlled range in a secure way, and extracts the archive sharing metadata from the sharing process and stores this information on the blockchain, which makes the process of archive sharing more secure and traceable. Then a Byzantine fault-tolerant consensus algorithm based on credit partitioning is proposed, which improves the consensus efficiency by partitioning the network nodes for autonomy and reducing the consensus nodes to optimize the consensus process. The classical C4.5 algorithm is analyzed to improve the calculation method of attribute selection, using Taylor’s formula and McLaughlin’s expansion formula in order to complete the approximate conversion of the information gain rate formula. Data mining technology is applied to archive management, analyzing and deciding the system by establishing the archive information decision-making model, and utilizing information technology to process the college archive services at a deeper level, mining out the information resources that are helpful and valuable for the development of college archives. Finally, the C4.5 algorithm is applied to the field of university archive management, mining the relationship between personal information of students and faculty and the access frequency of university digital libraries, so as to accurately categorize frequent and infrequent access customers. And the consensus algorithm is compared and analyzed with the traditional consensus algorithm in terms of communication complexity.
The design quality of the system design is directly related to the goodness of the software. The system needs to follow certain design principles in design:
System design refers to the system is divided into several large modules, and each module is designed independently of each other, there is no logical relationship between each other, but through the database is connected, so as to make the system in the form of a whole to play a role in its conducive to the better realization of the system’s functions. Each module is independent of each other, so it is more secure and reliable, and the structure of the system is more perfect. The system has a certain degree of abstraction, so the designers of the system have fewer factors to consider when designing, and can concentrate on the main factors, thus making the system design simpler and more convenient. Detailing, after solving the key problems of the system, it is necessary to gradually begin to solve its detailed problems. Information localization and hiding. Localization refers to the combination of elements with similar structure and function to achieve information localization. Hiding refers to the fact that due to the independence of each module, information can only be used in the respective module, and information sharing between modules cannot be realized. Independence between the system modules and between the functions of the system is conducive to the operation and maintenance of the system, and the system design needs to realize “high cohesion and low coupling”.
When designing the system, in addition to following certain design principles, it is also necessary to take into account the reality of the actual situation, mainly in the following aspects:
Practicality Practicality refers to the design of the system should be based on the reality of the school’s needs, so as to facilitate the school’s management of teacher and student records, to achieve a high degree of data informatization, and to enhance the practicality of the system. Stability Stability refers to the design of the system to make the system can run stably for a long time, in the face of unforeseen events, there should be a certain degree of adaptability. As the management of the file management system is a very important school management business, so in the design of the system must take into account the stability of the system. Security Although the operation of the system for the school’s file management has brought great convenience, but also at the same time gave rise to corresponding risks, the most important is the data security. It is necessary to keep student information strictly confidential to prevent the impact of data leakage on the school.
Based on the urgent demand for networked university file management system, the stability, security and data security of the system should be taken into consideration when designing the system, which is designed to make the system reach the advanced and reliability, scalability and flexibility, mainly realizing the file management of teachers and students, the management of teaching business, the management of the subject and the management of the system and other businesses.
This system design adopts the MVC idea, which is a three-layer model, and divides the system into an operator interface layer, a logical processing layer, and a data access processing layer according to its structure. The operation interface layer integrates relevant operation controls, through which users can operate the system to achieve the desired purpose. In the early operation, the user cares about the function realized through the operation, and does not need to pay attention to the logic of the operation, so that all the parts below the operation interface layer are hidden from the outside world. The logic processing layer is used to parse the relevant operation commands sent from the operation interface layer, compile the commands into the required format, and call the relevant operation logic to execute the operation commands and send the commands to the data access processing layer, so that the logic processing layer is the middle part of the system as well as the link between the upper and lower layers, and is the core of the system [16]. The data access processing layer is mainly used to receive commands from the logical processing layer, filter the data required by the commands in a certain way, and return these data to the logical processing layer to be processed into a format that can be directly displayed by the operator interface layer. The overall technical architecture of the system is shown in Figure 1.

Overall technical architecture of the system
The file management system researched and developed in this paper is mainly to process and standardize the daily work of file management. Combined with the school’s business needs for file management, the system is set up into six major functional modules, i.e., teachers’ file management module, students’ file management module, teaching business management module, scientific research project management module, system maintenance module and data maintenance module, and each module is independent of each other. The overall functional structure of the system is shown in Figure 2.

Overall functional structure of the system
The digital certificate of the management system can be used to securely authenticate the user’s identity, and its main function is to safeguard the security of the file in the transmission process and the object of transmission. This system can guarantee system security by using digital certificates that uniquely identify the user’s identity.
Authentication is required for various operations of staff within the university in the records management system. So a new CA authentication support is needed. Through the use of middleware technology, ASP, XML, WebService, CORBA, COM components, and so on, so as to realize the underlying security device hardware differences and cryptographic logic for shielding, the user only needs to add its needs in the business of the security function, and then the operation of a simple configuration, you can realize the security applications based on PKI, and, therefore, will be able to greatly make the development cost can be reduced, thus maximizing the efficiency of development [17]. The current extranet PKI security services middleware platform has been used by a large number of universities and colleges unified construction, the project construction of this system can be realized by using the platform that has already been constructed.
In order to ensure database security and information leakage, this system adopts MD5 encryption to encrypt the key information of the system, such as ID card number, home address and other sensitive information. And in order to distinguish between different account roles, the system introduces a privilege control function, which strictly distinguishes the functional privileges of users, and strictly restricts the functional privileges of users with different roles to ensure the framework security and stability of the system.
In this section, we present the architecture of the blockchain-based traceable digital archive sharing scheme. The system framework is shown in Fig. 3, and the architecture consists of a user interface layer, a smart contract layer, and a storage layer.

System framework
Extraction and protection of archive sharing metadata
Extraction of archive sharing metadata Archive sharing metadata is a data structure extracted from the archive data sharing process, containing data description information, data operation records and security control information, which is stored in the form of transactions in the distributed ledger of the blockchain. Keyld is the unique identifier of the archive sharing metadata, which is obtained by calculating the hash value of the privacy preserving metadata topic. The calculation formula is shown in (1):
Digitalsignature is used to ensure the integrity and non-repudiation of archive sharing metadata by generating a signature on the archive sharing metadata using the sender’s private key. The calculation formula is shown in (2):
File sharing metadata storage The storage of archive shared metadata is encrypted and packaged to be stored on the alliance chain through smart contract to ensure its tampering and traceability. Through the hash function of the file sharing metadata, the file sender Id, file receiver Id, file attribute data and other hash operations to get the unique identification keyId of the current data block, and then encrypt the keyId through the file sender’s private key sender Private Key to generate a digital signature, and this keyId will be sent to the file receiver, file The receiver of the archive will use this keyId to query the shared metadata of the archive on the federation chain. Finally, call the uplink method of the federation chain to uplink the generated file sharing metadata. Traceable digital archive sharing mechanism across libraries
Identity registration and authentication: each digital archive joining the alliance chain has to register its identity to generate a digital identity, which is the basis for the archive to operate in the system. The archive that needs identity registration first generates a pair of public and private keys locally, in which the private key is stored locally. The public key is then sent to all members of the alliance through the network, and the alliance members receive the request and execute the identity registration smart contract to initiate the identity registration vote, and when more than half of the nodes agree, the identity registration is successful, and the public key is added to the local alliance chain configuration list for recording. The authentication module includes user identity and system authority authentication based on the federation chain. Data transmission: the data transmission module consists of file sending and file receiving. The user who joins the federation chain node has the network address and network port of other users on the federation chain. The sharing of data files first requires the file sharing requestor to initiate a sharing request through the network, and then the requested user determines whether to perform the sending or receiving operation by analyzing the specific request, and analyzes the specific file data to be shared through the request. The file sharing process is shown in Figure 4.

File sharing process
Archive leakage tracking By storing the archive sharing metadata on the blockchain, this ensures transparency and tamper-proof data sharing, and thus this paper solves the problem of traceability of the sharing process in the archive sharing process. The relationship between the archive sharing metadata and the blockchain ledger transaction record structure is shown in Figure 5.

The relationship between the metadata and the record structure of the account
Practical Byzantine Fault Tolerant Consensus Algorithm The PBFT consensus algorithm provides a high level of security to the system. The PBFT consensus algorithm proves that the ultimate consistency of the system can be guaranteed when there are CP-PBFT improvement program
Node credit evaluation model The lack of reward and punishment mechanism for nodes in PBFT consensus algorithm leads to the lack of motivation of nodes, in CP-PBFT the ability of nodes to reach consensus is judged by calculating the credit value of nodes, nodes with high credit value are more likely to act as the master node or consensus node, and nodes with low credit value are only able to act as a listener node. Due to the nature of the above indexes, there are differences in the scale and quantity values, in order to transform the original indexes into more comparable standard values the original indexes are processed as follows: Latency index, which represents the response rate of nodes responding to various types of requests during the consensus process. The delay index
Where
Completion index, which refers to the success rate of nodes in completing message consensus during the consensus period. Completion index
Where
The node activity index refers to the ratio of nodes’ online participation in the consensus process. The activity index
From the index, it can be seen that the longer the node is online the greater the active index, on the contrary the longer the node is offline the lower the active index, and in the selection of the consensus node is more inclined to choose the more active node.
The method of calculating credit value according to the node credit evaluation index is shown in equation (6).
Node Partitioning Mechanism In CP-PBFT improved consensus algorithm, let the total number of nodes be After the network node partition, the average number of nodes owned by each node
C4.5 algorithm is the more mature decision tree algorithm and the mainstream decision tree algorithm, which is characterized by fast classification speed and high classification accuracy. C4.5 algorithm is an improved algorithm based on ID3 algorithm, and ID3 algorithm, compared to the increase of continuous attributes, attribute value of the vacant case of processing, in the efficiency of a great improvement in the C4.5 algorithm of the idea is: assuming that the
Define a category information entropy: let the training set
Definition 2 Conditional information entropy: If attribute
where
Definition 3: The information gain for classifying attribute
Definition 4 Split information entropy: Let attribute
Definition 5: The formula for dividing the information gain rate for attribute
Theoretical foundation The Taylor series is a representation of a function in terms of series, that is, a concatenation of infinite terms, which are computed from the derivatives of the function at a given point, and Taylor’s Median Theorem is defined as follows: If for any function
The
In Taylor’s formula, let
Removing the Lagrange residual term gives the approximate formula:
Improvement Ideas From the basic principle of C4.5 algorithm, it can be known that the selection of attributes when generating the decision tree is based on the principle of information theory, due to the fact that in the process of calculating the formula for the information gain rate involves multiple logarithmic function operations, which requires calling the library function several times during the calculation, which greatly increases the calculation time. To address this problem, an improvement method for the information gain rate formula is proposed, i.e., to simplify the calculation of the information gain rate of the C4.5 algorithm by using the mathematical Taylor’s formula and McLaughlin’s formula, which reduces the complexity of the algorithm’s computation to a large extent, and the improved C4.5 algorithm is named as the TAM-C4.5 algorithm. Since the derivative of ln(
So there:
When
Through the above approximate simplification, it is possible to realize the conversion of logarithmic operations into non-logarithmic operations, which can eliminate the complex logarithmic operations in the formula of the rate of information gain by using the above transformation characteristics, so as to achieve the purpose of simplifying the calculation formula and improving the efficiency of tree building. And the formula is more accurate than the simplification of the equivalent infinitesimal ln(1 +
The transformation formula of category information statement is shown in (18):
Similarly, the transformation formula for conditional information stated and split information entropy is shown in (19):
Therefore, the formula for the transformed information gain rate is shown in (21):
The archival information is divided into 12 major categories according to the archival classification standards and presented in primary, secondary and tertiary categories, and the archival information is categorized as shown in Table 1.
Archives information classification
Serial number | Primary class | Secondary class | Tertiary class |
---|---|---|---|
1 | DQ | Party group | Year |
2 | XZ | Administration | Year |
3 | JX | Teaching | Year |
4 | KY | Scientific research | Year |
5 | JJ | Infrastructure | Year |
6 | SB | Equipment | Year |
7 | CB | Publish | Year |
8 | WS | Foreign affairs | Year |
9 | CK | Finance council | Year |
10 | SX | Acoustic image | Year |
11 | YJ | performance | Year |
12 | CP | Product production research development | Year |
For example, there are a total of 34 departments in the university, of which 17 are administrative departments, 14 are teaching departments, and 3 are paraprofessional departments. Combined with this situation, and taking into account the different division of labor in each department, we therefore selected three meaningful attributes from them, namely, administrative department group, teaching department group, and teaching and support department, and the departmental grouping information is shown in Table 2.
Department information
Representative department | Related department 1 | Related department 2 | Related department 3 | Related department 4 |
---|---|---|---|---|
Administration department | Dean’s office | Personnel department | Finance office | Asset facilities |
Teaching department | Provost | Tourist school | Foreign language institute | Institute of financial management |
Auxiliary department group | Library | Education technology center | Experimental center |
First, organize the raw data. After we have finished collecting the raw data, we need to have a more specific elaboration of the data and come up with the corresponding answers. Finally, the results obtained will be stored in the relevant archive text, pictures and video information of the various departments of the system. Combine this information to determine the corresponding data and generate charts.
Second, data pre-processing. The data need to be processed in advance, the purpose is to generate data that can be used, the impurities of the data will be processed to ensure that the impurities disappear or negligible. The presence of impurities can have an impact on the entire decision tree, making it less accurate and in the presence of erroneous amounts of information. For example, if there are departments that do not have archival information for some reason, the data set can be initialized as none. Relatively speaking, individual non-existent values, we need to apply the manual filling method to solve the problem.
Again, data conversion is performed. In general, the results of the decision tree algorithm is irregular type, therefore, the need to irregularize the data, we have to deal with the departmental archive information, the confidentiality of the archive information is divided into three levels of top secret, confidential and secret, which set the attribute of top secret as A, the attribute of confidentiality as B, and the attribute of secret as C. The departmental group is discretized as shown in Table 3.
Department group discretization
Attribute setting | Attribute value | meaning |
---|---|---|
File information | A | People are limited to school leaders |
B | People are limited to school leaders and middle managers | |
C | Consult people on campus |
In this paper, the C4.5 algorithm is used to accurately classify the access customers of college archives according to whether their access frequency is frequent or not, and the assessment of whether the access frequency of college archives is frequent or not should be combined with the specific behaviors and actual situation of each customer, including the specialty, age, etc. of each customer. According to the identity (academic background), gender, age and specialty of each customer, the classification model of C4.5 decision tree is established to evaluate the frequency of customers’ visits to college archives, the essence of which is to use the C4.5 algorithm to mine the data to obtain the classification law, that is, to obtain the relationship between the personal information of the visiting customers and their visit frequency, and to derive the classification rules, i.e., the customers’ evaluation of the frequency of visits to college archives. Intelligent model.
Take the information of 2023 customers visiting college archives in a college as an example to establish a data table, including fields customer number, identity, gender, age, major and access frequency, use data conversion and integration and other related technologies to remove noise and other irrelevant information in the data, and transform data types and values of data sources into a unified standard format. For clients who are both teachers and postgraduate students, the status is defined as teachers. The attributes of the clients’ visit frequency are treated as follows: ① Clients who visit more than 3 times a week are designated as frequent visitors. ② The remaining customers are designated as infrequent visitors. The information table of converted customers is obtained after data preprocessing, and the personal information feature set of customers after data conversion is shown in Table 4.
Customer personal information feature set after data conversion
Customer | Identity | Gender | Age | Majors | Visits are frequent |
---|---|---|---|---|---|
1 | Teacher | Female | ⩾40 | Mathematics | N |
2 | Undergraduates | Man | <20 | Chemistry | Y |
3 | Graduate student | Female | 20~40 | Physics | Y |
4 | Teacher | Female | ⩾40 | Chemistry | N |
5 | Graduate student | Man | ⩾40 | Physics | Y |
6 | Undergraduates | Man | <20 | Physics | N |
7 | Graduate student | Female | 20~40 | Mathematics | Y |
8 | Teacher | Female | 20~40 | Mathematics | N |
9 | Graduate student | Man | <20 | Chemistry | Y |
10 | Graduate student | Female | ⩾40 | Physics | N |
After pre-processing the data, according to the C4.5 algorithm, the attributes identity, gender, age and specialty are taken as the object attributes of the algorithm, and the attribute “whether to visit frequently” is taken as the target attribute, and the attributes are sorted one by one by applying the meaning of the rate of information gain, and the attribute with a high rate of information gain is selected as the test attribute, and a root node is constructed, and this attribute is used as a marker to construct a branch for each value of the decision tree recursively. Using this attribute as a marker, we build a branch for each attribute value, and recursively build a tree to construct a corresponding decision tree, the specific processing of the algorithm is as follows. The attribute “identity” is chosen as the root node of the decision tree because it has the largest information gain rate value. For each branch, the above steps are repeated to generate the decision tree, and the final decision tree is shown in Figure 6.

The resulting decision tree
If status = “undergraduate” AND age <20 AND major = “chemistry” THEN visits were more frequent. If status=“graduate student” and age<40then visits are more frequent. If status = “graduate student” AND age >= 40 AND major = “physics” AND gender = “male” THEN visits are more frequent. If status=“Teacher” then visits are infrequent. If status=“Undergraduate” AND age<20 AND major=“Physics” THEN visits are infrequent. If status=“Graduate student” AND age>=40 AND major=“Physics” AND gender=“Female” THEN visits are infrequent.
From the above rules, it can be seen that the degree of influence of each information of customers on the frequency of visit is “identity”, “age”, “specialty”, “gender”. According to this rule, university archives can classify their clients into frequent and infrequent visitors. In addition, according to the customer’s personal information can be provided timely information content, not only to maintain the access of frequent customers to the college archives, but also to attract infrequent customers to visit the college archives, and thus improve the management of college archives, so as to provide customers with better personalized service.
Taking the number of nodes n as 100, 200, 400, 500, 800, 1000, the number of PBFT consensus algorithm communication is shown in Table 5. Taking the number of groups of PAPBFT consensus algorithm as 5, 10, 15, 20, 25, the number of PAPBFT consensus communication is shown in Table 6. As shown in Table 5 and Table 6, when the number of nodes n is 100, 200, 400, 500, 800, 1000 respectively, the number of PAPBFT consensus algorithm communication is reduced by many orders of magnitude than the traditional PBFT consensus algorithm.
The number of communication times of the PBFT consensus algorithm
Node number | 100 | 200 | 400 | 500 | 800 | 1000 |
---|---|---|---|---|---|---|
Communication frequency | 19752 | 79635 | 322500 | 495000 | 1296322 | 1985220 |
Number of communications for the PAPBFT consensus algorithm
5 | 10 | 15 | 20 | 25 | |
---|---|---|---|---|---|
100 | 247 | 383 | 600 | 930 | 1348 |
200 | 434 | 559 | 801 | 1120 | 1558 |
400 | 837 | 954 | 1176 | 1527 | 1946 |
500 | 1019 | 1160 | 1398 | 1714 | 2132 |
800 | 1630 | 1765 | 1973 | 2309 | 2751 |
1000 | 2029 | 2141 | 2383 | 2724 | 3164 |
The communication complexity of PBFT is known to be

Comparison chart of communication complexity
The experiment uses docker containers as the isolation technology. Due to the lightweight nature of Docker containers, thousands of containers can be run in a single machine environment. Multiple node containers are started based on the customized Docker image using Docker commands. Each container represents a consensus node, which collectively constitutes the consensus domain. The experiment is conducted using the Golang programming language and supports multiple concurrent program design schemes. The Golang programming language is used and simulation tests are conducted using JetBrainsGoLand version 2023.1.2. Golang provides efficient concurrent program design solutions and supports multi-threaded work modes, which facilitates distributed consensus algorithm implementation. Multi-computer network communication is implemented in the same LAN for consensus process simulation. This experiment compares the throughput of traditional PBFT algorithm as well as PAPBFT under different number of groupings based on the number of nodes, the number of nodes are taken as 25, 50, 125 and 200 respectively and the PAPBFT groupings are taken as 5, 10, 15 and 20 to draw the histograms and analyze the data comparatively and the throughput when N and K are taken at different values is shown in Fig. 8. According to the above shown, the throughput of PAPBFT consensus algorithm increases as the number of nodes N increases. From this point of view, the larger the number of consensus nodes, the larger the throughput of the system, which greatly increases the scalability of the university file management framework. As can be seen from the figure, the throughput of the traditional PBFT consensus algorithm gradually decreases with the increase of the number of nodes, when N=25, the PBFT throughput is 1250, when N=200, the PBFT throughput is 215. The improved consensus algorithm effectively alleviates the problem of the obvious decrease in throughput due to the continuous increase of the number of nodes, and is suitable for application in the university archive management framework.

The throughput of N and K at different values
The article applies the consensus algorithm decision tree classification algorithm C4.5 algorithm to the design of the college file management system, which can accurately classify the access customers of the college file and lay the foundation for providing personalized and intelligent services. The following conclusions are drawn from the experimental article:
C4.5 algorithm applied to the college archive system, mining the relationship between the customer’s personal information and the frequency of access to the college digital library, from the rules can be seen, the customer’s individual information on the frequency of access to the degree of influence in order of “identity”, “age”, “specialty”, “gender”. Therefore, the university file management system can be based on the rules of access to the customer into frequent access to the customer and infrequent access to the customer. And then provide better personalized service for customers.
Comparing the PAPBFT consensus algorithm with the traditional PBFT consensus algorithm in terms of the number of communications, the experimental results show that the PAPBFT consensus algorithm has a great effect in reducing the number of nodes’ communications, and when N=200, the throughput of the PAPBFT (k=5) is 2,365, which can effectively promote the expansion of the coalition chain network and realize the upgrading of the informationization of the university archive management.