An Optimization Model of Financial Management Teaching Strategies Based on Reinforcement Learning
Publicado en línea: 17 mar 2025
Recibido: 17 oct 2024
Aceptado: 26 ene 2025
DOI: https://doi.org/10.2478/amns-2025-0233
Palabras clave
© 2025 Fei Wang, published by Sciendo
This work is licensed under the Creative Commons Attribution 4.0 International License.
Financial management is an important part of modern enterprise management, and it is a basic skill that must be mastered for the survival and development of enterprises. As an important branch of management, financial management plays a vital role in the economic decision-making and resource allocation of enterprises [1-4]. In domestic and foreign institutions of higher education, financial management has become an independent discipline, and has gradually become a compulsory course for management majors in institutions of higher education [5-6].
Financial management has been regarded as an important discipline in business, and has cultivated a large number of financial management professionals. The traditional teaching of financial management mainly centers on the teaching of theoretical knowledge, and students mainly listen to passive lectures in the classroom [7-10]. However, learning only at the theoretical level can not meet the needs of the current society. In addition, the traditional financial management teaching is teacher-led, students passive acceptance, this teaching mode can not stimulate students’ learning interest and enthusiasm [11-14]. In addition, the traditional financial management teaching is mainly based on examination results as the evaluation standard, ignoring the students’ creative thinking and practical application ability. With the rapid development of the economy and changes in social needs, traditional teaching can no longer fully meet the learning needs of students [15-18]. In order to adapt to the changes of the times, the teaching strategy of financial management majors also needs to be constantly innovated to realize the combination of theory and practice in teaching, interactive teaching, multiple evaluation methods and other teaching strategies to adapt to the requirements of financial management in the new era and promote the overall development of students [19-22].
In this study, the state representation model of diversity optimization recommendation algorithm as well as the decision model are designed by integrating deep learning model using reinforcement learning technology in financial management teaching. A financial management teaching system based on the reinforcement learning recommendation algorithm was developed by organizing the learner module, administrator module and database in the teaching system. In addition, several experiments are conducted in this paper, including recommendation accuracy measurement, performance testing of the teaching system, reward and outcome analysis of the teaching system, and Top-5 recommendation of financial management teaching materials based on different students, to evaluate the excellence of the model proposed in this paper in financial management teaching.
Reinforcement learning is an autonomous learning algorithm that adapts to changes in the environment by providing feedback through constant inputs from the environment and continuously improves the performance of the reinforcement algorithm based on learning from updates to the state of the environment. Intelligent bodies are constantly rewarded and punished by the environment according to their behavioral actions, and learn a set of behavioral methods adapted to the environment, which is the optimal strategy finally obtained by the algorithm. This learning process includes the state of the environment, the choice of actions and the reward return from the environment feedback.
Figure 1 shows the basic framework of a reinforcement learning system, in a reinforcement learning process in addition to the environment and the intelligent body, but also includes the following six elements:
1) State set: represents the possibilities of the environment in which the intelligent body is currently located, where the elements represent the state of the environment in which the intelligent body is located at moment 2) Action set: represents all possible actions taken by the intelligent body. 3) State transfer probability: It represents all possible probabilities of transferring from a certain state to another state, which indicates the probability of transferring to the next state under the current state 4) Reward payoff: denotes that executing action 5) Strategy function: the strategy function describes the mapping relationship between actions and states. It indicates what kind of action the intelligent body should choose to perform in the state 6) Value function: the value function is also known as the utility function. The value function can evaluate the impact of executing the current action in the current state on the future training process, representing the long-term reward payoff. Basic reinforcement learning framework
The interaction process between the intelligence and the environment in reinforcement learning training can be hypothesized as a Markov decision process. Reinforcement learning training has to satisfy the Markov decision process, i.e., the upcoming state is not dependent on the past state only related to the current state. The process can be described in the following way: S denotes the set of states that contains all the states in the environment model, A denotes the set of actions that contains all the possible actions to be executed during the training process, T denotes the matrix of state transfer probability functions, and
Selection of action
The Q-learning algorithm is described as follows [23]: there is a finite Markov process of intelligences, the state set
Q-learning adopts the optimal
The essential idea of Attention Mechanism [24-25] can be described as mapping the key-value pairs in Query and Source of a given element of a task to the output, where Query, Key, Value and Output are all vectors. By calculating the similarity or correlation between Query and each Key, the weight coefficients of each Key corresponding to Value are obtained, and then the Value is weighted and summed up, i.e., the final Attention value is obtained. The essential idea can be abstracted as in equation (4):
In a translation task, Query can be regarded as a sequence of word vectors in the original language, while Key and Value can be regarded as a sequence of word vectors in the target language. The general attention mechanism can be interpreted as calculating the similarity between Query and Key and utilizing this similarity to determine the attention relationship between Query and Value.
Gated recurrent neural networks are proposed to better capture dependencies in time series with large time step distances. It controls the flow of information through gates that can be learned. One of the commonly used gated recurrent neural networks is the gated recurrent unit (GRU) [26-27]. The design of gated recurrent unit is shown in Fig. 2.GRU replaces the forgetting and input gates in LSTM with update gates. GRU modifies the computation of hidden states in recurrent neural networks by reset gates and update gates.

GRU internal structure
The formulae for reset gate and update gate are respectively:
The candidate memory cell and the current moment memory cell are calculated as:
In this section, a deep reinforcement learning recommendation algorithm optimized for recommendation diversity will be introduced, which is based on the classical reinforcement learning algorithm, combined with the attention mechanism and the recurrent network model, designs and implements a state representation model and a decision model, and improves the diversity of recommendation results by optimizing and improving the loss function.
The state representation model has an important role for intelligences in reinforcement learning, and for the recommendation task in this paper, the state representation model can provide the intelligences with dimension compression, context modeling and generalization capabilities to extract the key information from the interaction data, and transform them into vector representations that can be processed by the reinforcement learning intelligences.
Since the interaction sequence between the user and the intelligent body contains the recommendation decision of the intelligent body and the user’s feedback value on the recommendation result, the state representation model in this paper needs a data processing layer to realize preprocessing of the data. In order to map the recommendation actions and user feedback rewards into a low-dimensional vector space and represent them as vectors with semantic information, it is necessary to design the embedding layer and encoding layer for the recommendation actions and user feedback rewards, respectively. In this paper, recurrent neural network is chosen as the bottom layer of the state representation model. The structure of the state representation model designed in this paper is shown in Figure 3.

Status indicates the structure of the model
The state representation model takes interaction sequence
The input interaction sequences will first go through data preprocessing to separate the action records from the reward records to form action sequences
The action embedding layer maps each recommended action into a
The reward coding layer uses one-hot coding to map the reward value into a
The recurrent neural network used in the state representation model in this paper contains
The structure of the decision model is shown in Fig. 4. The decision model is the key for the intelligent body to make action decisions, and in the recommendation task in this paper, the decision model mainly has the roles of recommending action selection, strategy learning, balance exploration and utilization, and personalized adaptation, and it decides which items or contents to recommend to the user according to the current environment state and the learned recommendation strategy.

Structure of decision model
The decision model learns and optimizes the recommendation strategy by obtaining reward feedback through interaction with the user, and the recommendation action of the intelligent body provides the reference. At the moment of
The second fully connected layer is the soft-max activation function, which is responsible for mapping the previously obtained item preference vector
For the recommendation task in this paper, I would like to make the recommendation result of the intelligent body cover as many recommended items with potential value as possible, instead of only recommending some items with high reward value given by the user, so this paper adds a dynamic constraint term for the loss function of the algorithm based on the model entropy method, as shown in Equation (14):
The overall architecture of the system is shown in Fig. 5. The architecture contains network layer, front-end part and back-end part respectively. These three layers of architecture are described in detail below:
1) Network layer: as the main part to support the operation of the course recommendation learning platform system, the network layer provides guarantee for the users to access the system, and the network layer mainly consists of Nginx, CDN and so on. 2) Front-end part: This system is developed using front-end and back-end separation, and the front-end part, as the part used by users, realizes the design of modules such as financial management course center, user center, course comment, order, Q&A, and so on. 3) Back-end part: the framework used to develop the back-end part is SpringBoot framework, which realizes the design of modules such as tutor management, course management, user management, order management and so on. Overall architecture of the course recommendation learning platform system
The course recommendation learning platform system implemented in this research consists of learner module and administrator module. After analyzing the requirements and designing the architecture of the required functions of the learning platform system, we designed the system using the idea of modularization. The system is divided into a learner part and an administrator part, and each part can be divided into several individual sub-functional modules, and the design of each individual sub-functional module is carried out to finalize the design of the whole system.
The learner module contains several sub-functions. Learners use the system’s operating procedures: when using the system for registration and login, the learner in accordance with the provisions of the system to register and login, the system will automatically jump to the home page, the learner can be in the home page search bar in accordance with the keywords to search for courses or tutors. After the system collects enough information, the system will recommend courses to the user according to the user’s previous learning behavior, and the course that the learner has purchased can be viewed in the personal center, and the course can be learned by clicking on the method.
The financial management course recommendation learning module consists of several basic pages, including the system home page, the course classification page, the Q&A page, and the personal information settings page. On the basis of these basic pages, the learner function module is developed and designed.
Financial management course learning includes all courses, recommended courses and instructor view three parts, through all courses learners can view all courses. Course recommendation is recommended by the system based on the learner’s feedback on the course, and based on the learner’s rating of the purchased course, it can be recommended to the learner in the unlearned areas.
The administrator module mainly includes sub-functions such as administrator login, learner information management, tutor management, course management, course uploading, order management and learner comment management. Among them, learner management consists of two parts: learner basic information and recommendation service.
System operation flow of administrator role: When using the system for the first time, the administrator needs to enter the given login account and password to log in, and the system will verify the login information. The system will verify the login information. If the verification passes, it means the login is successful and the administrator enters the system homepage. System administration includes management and menu settings, which are the basic settings of the administrator module, you can add administrators and set up new menus to provide services for administrators. Lecturer management includes lecturer list, lecturer add and lecturer audit. When a new lecturer is added to the learning platform, the administrator can add the lecturer information through the add lecturer function. Recommended services mainly include the system for the learner’s recommendation, according to the learner’s learning in the financial management course, for its course recommendation services, the administrator role can be viewed through this module recommended course information.
Data redundancy as a frequent problem in the process of database design, in order to build a high-quality database that meets the requirements of the financial management teaching recommendation learning platform system, this study needs to take the logical relationship between the data as the first and foremost concern when carrying out database design. Therefore, the database design of the course recommendation learning platform system needs to meet the following specifications:
1) In the database design, it is necessary to satisfy the data integrity constraints that are jointly composed of entity integrity, referential integrity and user-defined integrity. 2) For the database design should ensure the atomicity of data, should also meet the third paradigm, that is, to eliminate some of the function dependencies and transfer dependencies. 3) Designing a database requires consideration of naming convention issues, i.e., maintaining a uniform style. In addition to the most basic case format and other uniformity, the basic table with underscores is usually used to name the associated table.
The experimental environment information of this paper is as follows.
Hardware configuration, CPU is Intel(R) Core(TM) i5-7500U; RAM is 32.00GB; graphics card is RTX 2080; hard disk space is 1TB.
For software configuration, the operating system is Ubuntu 18.04.1; the integrated development environment is PyCharm and VScode; the development language is Python version 3.6; and PyTorch 1.6.0.
In this paper, we use PyTorch machine learning framework to build the proposed TRDV and TRDP models.Pytorch is built in C++ as the base language, which has good ease of use, scalability and flexibility. At the same time Pytorch supports the model to operate on CPU and GPU.
In order to study the advantages of the deep attention mechanism of reinforcement learning recommender system over other mainstream recommender systems, in this section, TRDV, MostPop, TRDP, ItemKNN are chosen as pairs of models, and different sizes of datasets (100K, 1M) are chosen to compare with the models in this paper in terms of training time and accuracy, and the accuracy rate is used to measure the accuracy. Since there is no significant difference in the training time of each model under 100K dataset, this section only shows the results of training time comparison under 10K MB dataset. The training time comparison results of different models under 1M dataset are shown in Fig. 6.

Different models are trained in 1M data sets
As can be seen from the figure, the loss function of each model shows a drooping trend with the training time, and the model in this paper converges faster than the comparison model. When the training time is about 17 minutes, the loss function value of this model converges to about 0.15. While the comparison model needs to be trained to 30 minutes later to converge successively, and the convergence value is between 0.18 and 0.32.
Figure 7 illustrates the recommendation accuracy of each model with different sized datasets. The data in the figure shows that there is no significant difference in the recommendation accuracy of the models on different sized datasets, while there is a difference between the models. The model in this paper has the highest recommendation accuracy, which is 0.983, 0.962, and 0.972 on datasets 1, 2, and 3, respectively, which is 50% to 57.03% higher than the comparison model.

The recommended accuracy of various models in different size data sets
To summarize, the advantages of combining the attention mechanism and recurrent neural networks in carrying out the autonomous recommendation process enable the reinforcement learning recommendation algorithm to achieve significant results in terms of accuracy and convergence time.
In this section, the performance test of the financial management instructional system based on reinforcement learning recommendations is conducted with the aim of exploring the performance requirements of the system as well as its compatibility and stability when making instructional recommendations. The performance test results of the financial management teaching system are shown in Table 1.
Financial management teaching systematically can test the results
Project | CPU occupancy rate | Memory occupancy rate | Mean response time |
---|---|---|---|
Application server | 26.35% | 45.77% | 267ms |
Algorithm server(unrun) | 17.62% | 14.96% | |
Algorithm server(run) | 49.15% | 53.82% |
After a number of tests and modifications, the system can ensure that under normal load conditions, the application server has an average CPU occupancy of 26.35% per minute and an average memory occupancy of 45.77% per minute. The algorithm server has an average CPU occupancy of 17.62% and an average memory occupancy of 14.96% per minute when the algorithm is not running, and an average CPU occupancy of 49.15% and an average memory occupancy of 53.82% per minute when the algorithm is running. Correct results are returned for 200 concurrent accesses with a flat response time of 267ms.
In order to better analyze the impact of the system in this paper on the effectiveness of financial management teaching, this section of the experiment selected two classes of sophomore students in the School of Economics and Management of a university to conduct a teaching simulation experiment. There is no significant difference in the financial management knowledge level between the two classes of students, and the financial management knowledge level test is conducted after an eight-week financial management teaching experiment. The control class (CK class) is the traditional financial management teaching mode, and the experimental class (T class) is taught using the system of this paper. Figures 8 and 9 show the reward curves and teaching results results based on the financial management teaching recommendation system, respectively.

The reward curve of the teaching recommendation platform

The teaching results of the teaching recommendation platform
As can be seen from Fig. 8, the reward curve shows an upward trend and tends to stabilize when iterating to 600 rounds, converging to around 1.70. It indicates that the intelligences of the system constructed in this paper can gradually improve their strategies and obtain higher cumulative rewards during the learning process.
The results of the comparison of teaching results show that the class taught by applying the system in this paper has a higher passing rate, which is 38.14% higher compared with the control class, and there are 3 more students with high scores (>80 points) than the control class, which makes the average score between the two classes significantly different, the average scores of the control class and the experimental class are 63.89 and 71.47, respectively.It can be seen from the above experiments that the system in this paper has higher rewards, which is piggybacked on the fact that the algorithm sets a large discount factor to improve the scores quickly in the early stages to increase the overall rewards. As the deep attention mechanism based reinforcement learning recommendation algorithm used in the teaching system, it can meet the system’s needs for financial management teaching recommendation function.
In order to further verify the superiority of the reinforcement learning recommendation model with deep attention mechanism proposed in this paper for financial management teaching system, five students were selected from those who used the teaching system of this paper for financial management teaching material recommendation, and the recommendation results were visualized.
Financial management teaching materials are selected from a total of 10 materials, including Financial Management, Enterprise Financial Diagnosis, Computerized Accounting, Investment, Financial Markets and Financial Institutions, Financial Statement Analysis, Corporate Finance, Management Accounting, Financial Budget Management, and Financial Risk Prevention, and the system built in this paper is used to recommend the Top-5 with the highest weighted ratings for the students. The results of Top-5 recommendation of financial management teaching materials are shown in Figure 10.

Recommendations for financial management teaching data
As can be seen from the figure, for different students, this paper’s system for financial management teaching recommended materials weighted scores there are obvious differences, to student 1, for example, the system for the student recommended teaching materials Top-5 for “financial markets and financial institutions”, “investment science”, “corporate finance”, “management accounting”, “financial statement analysis”, weighted scores in order of 97.89, 96.84, 95.68, 94.54, and 93.77, which further validates that the system in this paper is able to place more relevant materials higher in the recommendation list.
In this paper, a deep learning recommendation model incorporating the attention mechanism is designed using reinforcement learning technology and deep learning technology. The model is applied to financial management teaching, and a financial management teaching system based on reinforcement learning is developed. Through the design of functional modules, the system can provide students with personalized financial management teaching materials recommendation.
In this paper, the model can be trained for 17 minutes, the loss function value can be converged to about 0.15, and the recommendation accuracy under different sized datasets is higher than 0.95. In this paper, the system can return accurate results for the user, and the average response speed is fast, 267ms.The reward value of this paper’s system converges to about 1.7 when iterating to 600 rounds, and the passing rate of the financial management knowledge level test of the experimental class using this paper’s system increases by 38.14% compared with the control class. In addition, the Top-5 financial management teaching materials recommended by this paper’s system for student 1 are Financial Markets and Financial Institutions, Investments, Corporate Finance, Managerial Accounting, and Financial Statement Analysis, with weighted scores ranging from 93.77 to 97.89.