Acceso abierto

An Optimization Model of Financial Management Teaching Strategies Based on Reinforcement Learning

  
17 mar 2025

Cite
Descargar portada

Introduction

Financial management is an important part of modern enterprise management, and it is a basic skill that must be mastered for the survival and development of enterprises. As an important branch of management, financial management plays a vital role in the economic decision-making and resource allocation of enterprises [1-4]. In domestic and foreign institutions of higher education, financial management has become an independent discipline, and has gradually become a compulsory course for management majors in institutions of higher education [5-6].

Financial management has been regarded as an important discipline in business, and has cultivated a large number of financial management professionals. The traditional teaching of financial management mainly centers on the teaching of theoretical knowledge, and students mainly listen to passive lectures in the classroom [7-10]. However, learning only at the theoretical level can not meet the needs of the current society. In addition, the traditional financial management teaching is teacher-led, students passive acceptance, this teaching mode can not stimulate students’ learning interest and enthusiasm [11-14]. In addition, the traditional financial management teaching is mainly based on examination results as the evaluation standard, ignoring the students’ creative thinking and practical application ability. With the rapid development of the economy and changes in social needs, traditional teaching can no longer fully meet the learning needs of students [15-18]. In order to adapt to the changes of the times, the teaching strategy of financial management majors also needs to be constantly innovated to realize the combination of theory and practice in teaching, interactive teaching, multiple evaluation methods and other teaching strategies to adapt to the requirements of financial management in the new era and promote the overall development of students [19-22].

In this study, the state representation model of diversity optimization recommendation algorithm as well as the decision model are designed by integrating deep learning model using reinforcement learning technology in financial management teaching. A financial management teaching system based on the reinforcement learning recommendation algorithm was developed by organizing the learner module, administrator module and database in the teaching system. In addition, several experiments are conducted in this paper, including recommendation accuracy measurement, performance testing of the teaching system, reward and outcome analysis of the teaching system, and Top-5 recommendation of financial management teaching materials based on different students, to evaluate the excellence of the model proposed in this paper in financial management teaching.

System for optimizing financial management teaching based on enhanced learning
Enhanced learning techniques for teaching financial management
Enhanced Learning Model

Reinforcement learning is an autonomous learning algorithm that adapts to changes in the environment by providing feedback through constant inputs from the environment and continuously improves the performance of the reinforcement algorithm based on learning from updates to the state of the environment. Intelligent bodies are constantly rewarded and punished by the environment according to their behavioral actions, and learn a set of behavioral methods adapted to the environment, which is the optimal strategy finally obtained by the algorithm. This learning process includes the state of the environment, the choice of actions and the reward return from the environment feedback.

Figure 1 shows the basic framework of a reinforcement learning system, in a reinforcement learning process in addition to the environment and the intelligent body, but also includes the following six elements:

1) State set: represents the possibilities of the environment in which the intelligent body is currently located, where the elements represent the state of the environment in which the intelligent body is located at moment t.

2) Action set: represents all possible actions taken by the intelligent body.

3) State transfer probability: It represents all possible probabilities of transferring from a certain state to another state, which indicates the probability of transferring to the next state under the current state st and operation at.

4) Reward payoff: denotes that executing action at in state st at moment t is rewarded with a feedback. It can refer to the immediate reward received at the moment, or it can indicate the value of the reward obtained throughout the subsequent training process as a result of the execution of the strategy in that state.

5) Strategy function: the strategy function describes the mapping relationship between actions and states. It indicates what kind of action the intelligent body should choose to perform in the state st at the moment t, which can generally be expressed as at = π(st).

6) Value function: the value function is also known as the utility function. The value function can evaluate the impact of executing the current action in the current state on the future training process, representing the long-term reward payoff.

Figure 1.

Basic reinforcement learning framework

Markov decision process

The interaction process between the intelligence and the environment in reinforcement learning training can be hypothesized as a Markov decision process. Reinforcement learning training has to satisfy the Markov decision process, i.e., the upcoming state is not dependent on the past state only related to the current state. The process can be described in the following way: S denotes the set of states that contains all the states in the environment model, A denotes the set of actions that contains all the possible actions to be executed during the training process, T denotes the matrix of state transfer probability functions, and Psa denotes the probability of selecting the action a at state s followed by a state transfer to the next state s’ as shown in equation (1): T(s,a,s)=Psar(st+1=s|st=s,at=a)

Selection of action a at state s subsequent state transfer to s' and, at the same time, an immediate payoff value r = R(s,a,s) is obtained as shown in equation (2): R(s,a,s)=E{rr+1|st=s,at=a,st+1=s}

Q-learning algorithm

The Q-learning algorithm is described as follows [23]: there is a finite Markov process of intelligences, the state set S denotes the states of an intelligence in the environment, and the action set A denotes the actions that can be executed in each state. In the starting state s, the intelligent body selects an action a(aA) to execute based on an action selection policy, and in the interaction with the environment the intelligent body’s state is transferred from the current state s to the next state s', while receiving an immediate reward r and modifying the value of Q according to an update rule.

Q-learning adopts the optimal Q(st+1,at+1) for updating, which is equivalent to the policy adopts only the action corresponding to the maximum Q value, and the updating formula for the action value function of the Q-learning algorithm is shown in equation (3): Q(st,at)Q(st,at)+α[ rt+1+γmaxaQ(st+1,at)Q(st,at) ] where α is the learning rate, which indicates the rate at which the original Q value is replaced by the new Q value during the Q-value updating process, and takes a value in the range α∈[0,1].

Deep learning network models
Attention mechanisms

The essential idea of Attention Mechanism [24-25] can be described as mapping the key-value pairs in Query and Source of a given element of a task to the output, where Query, Key, Value and Output are all vectors. By calculating the similarity or correlation between Query and each Key, the weight coefficients of each Key corresponding to Value are obtained, and then the Value is weighted and summed up, i.e., the final Attention value is obtained. The essential idea can be abstracted as in equation (4): Attention(Query,Source)=i=1LiSimilarity(Query,Keyi)*Valuei

In a translation task, Query can be regarded as a sequence of word vectors in the original language, while Key and Value can be regarded as a sequence of word vectors in the target language. The general attention mechanism can be interpreted as calculating the similarity between Query and Key and utilizing this similarity to determine the attention relationship between Query and Value.

Gated recurrent neural networks

Gated recurrent neural networks are proposed to better capture dependencies in time series with large time step distances. It controls the flow of information through gates that can be learned. One of the commonly used gated recurrent neural networks is the gated recurrent unit (GRU) [26-27]. The design of gated recurrent unit is shown in Fig. 2.GRU replaces the forgetting and input gates in LSTM with update gates. GRU modifies the computation of hidden states in recurrent neural networks by reset gates and update gates.

Figure 2.

GRU internal structure

The formulae for reset gate and update gate are respectively: rt=σ(Wrxt+Urht1+br) zt=σ(Wzxt+Uzht1+bz)

The candidate memory cell and the current moment memory cell are calculated as: h^t=tanh(Whxt+Uh(rt°ht1)+b) ht=(1zt)h^t+ztht1

Recommendation Algorithm for Optimizing the Diversity of Teaching Materials

In this section, a deep reinforcement learning recommendation algorithm optimized for recommendation diversity will be introduced, which is based on the classical reinforcement learning algorithm, combined with the attention mechanism and the recurrent network model, designs and implements a state representation model and a decision model, and improves the diversity of recommendation results by optimizing and improving the loss function.

State representation model design

The state representation model has an important role for intelligences in reinforcement learning, and for the recommendation task in this paper, the state representation model can provide the intelligences with dimension compression, context modeling and generalization capabilities to extract the key information from the interaction data, and transform them into vector representations that can be processed by the reinforcement learning intelligences.

Since the interaction sequence between the user and the intelligent body contains the recommendation decision of the intelligent body and the user’s feedback value on the recommendation result, the state representation model in this paper needs a data processing layer to realize preprocessing of the data. In order to map the recommendation actions and user feedback rewards into a low-dimensional vector space and represent them as vectors with semantic information, it is necessary to design the embedding layer and encoding layer for the recommendation actions and user feedback rewards, respectively. In this paper, recurrent neural network is chosen as the bottom layer of the state representation model. The structure of the state representation model designed in this paper is shown in Figure 3.

Figure 3.

Status indicates the structure of the model

The state representation model takes interaction sequence A0,R0,A1,R1,…,At–1,Rt–1 as input and a fixed-length vector St as output, which serves to compute the state representation vector St of the environment at moment t based on the user’s interaction records with the recommender system before moment t–1.

The input interaction sequences will first go through data preprocessing to separate the action records from the reward records to form action sequences A0,A1,…,At–1 and reward sequences R0,R1,…,Rt–1, and then the action sequences and reward sequences will be inputted to the action embedding layer and reward coding layer respectively, which are used to transform the action and reward information into vector representations, respectively.

The action embedding layer maps each recommended action into a k -dimensional vector in the embedding space, which is realized as shown in Equation (9): E(Ai)=Ek×numitemsOAi,i[0,numitems] Where OAi is a one-dimensional vector of length num_items , the element at the position Aith inside it is 1, and the rest of the elements are 0.

The reward coding layer uses one-hot coding to map the reward value into a l-dimensional vector of uniform length, which is realized as shown in the following equation: E(Ri)=onehot(lfloor(l*(br)ba),l),i[0,numrewards] Where l is the encoded dimension and the length of the output vector, floor(x) serves to round x down to the largest integer whose result is not greater than x, (a,b] is the range of values of the reward R, and for the l-dimensional vector of the output of onehot(i,l), where the value of the position of ith is 1, and the rest of the positions are 0.

The recurrent neural network used in the state representation model in this paper contains t hidden layer neuron nodes corresponding to the t interaction data input to the state representation model at t–1 time step, and its computational structure is shown in the following equation: h1=tanh(W[h0,EA0,ER0]+b)h2=tanh(W[h1,EA1,ER1]+b)...St=ht=tanh(W[ht1,EAt1,ERt1]+b) where t is the time step of the interaction sequence, tanh(x) is the hyperbolic tangent activation function that maps the input values between -1 and 1, W is the hidden layer node weights, and b is the bias term. Each computational node of the hidden layer receives data inputs from the action embedding layer, the reward coding layer, and the previous node at the same time. Taking the tth hidden layer node as an example, its inputs are the action vector EA, the reward vector ER, and the shared parameters from the previous hidden layer node ht–1, and finally the recurrent neural network outputs a vector representation of the current environment state, which contains the state information and the context in the interaction record.

Decision modeling

The structure of the decision model is shown in Fig. 4. The decision model is the key for the intelligent body to make action decisions, and in the recommendation task in this paper, the decision model mainly has the roles of recommending action selection, strategy learning, balance exploration and utilization, and personalized adaptation, and it decides which items or contents to recommend to the user according to the current environment state and the learned recommendation strategy.

Figure 4.

Structure of decision model

The decision model learns and optimizes the recommendation strategy by obtaining reward feedback through interaction with the user, and the recommendation action of the intelligent body provides the reference. At the moment of t, the interaction sequence passes through the state representation model to obtain the current environment state St, and after that the probability corresponding to the execution of the recommendation action will be calculated by two fully connected layers. The first fully connected layer contains the number of nodes equal to the number of recommendable items, and the preference vector of candidate recommended items ht can be calculated based on the input St, and the realization principle is shown in Equation (12): ht=Θnumitems×dSt where Θ is the weight matrix, d is the length of the environment state representation vector St, and the element in ht represents the preference value of the intelligent body to select the item in the current state St.

The second fully connected layer is the soft-max activation function, which is responsible for mapping the previously obtained item preference vector ht to action probability Pt, and its realization principle is shown in Equation (13): Pt=ehii=1numitemsehi,i[1,numitems]

For the recommendation task in this paper, I would like to make the recommendation result of the intelligent body cover as many recommended items with potential value as possible, instead of only recommending some items with high reward value given by the user, so this paper adds a dynamic constraint term for the loss function of the algorithm based on the model entropy method, as shown in Equation (14): θt+1θt+α((Rt+γν^u(St+1)ν^u(St))θlnπθ(St,At)efi|A|θ(πθ(At,St)lnπθ(At,St))) where πθ(St,At) represents the probability of taking action At in state St, ∇θ represents the gradient of the constraint term with respect to θ, and ef is the weight coefficient. The addition of a loss function with dynamic constraint terms can make the recommendation strategy of an intelligent body tend to obtain future rewards, thus optimizing the diversity of recommendation results.

Teaching system for financial management based on reinforcement learning recommendations
Overall functional structure of the system

The overall architecture of the system is shown in Fig. 5. The architecture contains network layer, front-end part and back-end part respectively. These three layers of architecture are described in detail below:

1) Network layer: as the main part to support the operation of the course recommendation learning platform system, the network layer provides guarantee for the users to access the system, and the network layer mainly consists of Nginx, CDN and so on.

2) Front-end part: This system is developed using front-end and back-end separation, and the front-end part, as the part used by users, realizes the design of modules such as financial management course center, user center, course comment, order, Q&A, and so on.

3) Back-end part: the framework used to develop the back-end part is SpringBoot framework, which realizes the design of modules such as tutor management, course management, user management, order management and so on.

Figure 5.

Overall architecture of the course recommendation learning platform system

The course recommendation learning platform system implemented in this research consists of learner module and administrator module. After analyzing the requirements and designing the architecture of the required functions of the learning platform system, we designed the system using the idea of modularization. The system is divided into a learner part and an administrator part, and each part can be divided into several individual sub-functional modules, and the design of each individual sub-functional module is carried out to finalize the design of the whole system.

Functional design of the learner module

The learner module contains several sub-functions. Learners use the system’s operating procedures: when using the system for registration and login, the learner in accordance with the provisions of the system to register and login, the system will automatically jump to the home page, the learner can be in the home page search bar in accordance with the keywords to search for courses or tutors. After the system collects enough information, the system will recommend courses to the user according to the user’s previous learning behavior, and the course that the learner has purchased can be viewed in the personal center, and the course can be learned by clicking on the method.

The financial management course recommendation learning module consists of several basic pages, including the system home page, the course classification page, the Q&A page, and the personal information settings page. On the basis of these basic pages, the learner function module is developed and designed.

Financial management course learning includes all courses, recommended courses and instructor view three parts, through all courses learners can view all courses. Course recommendation is recommended by the system based on the learner’s feedback on the course, and based on the learner’s rating of the purchased course, it can be recommended to the learner in the unlearned areas.

Administrator module functionality design

The administrator module mainly includes sub-functions such as administrator login, learner information management, tutor management, course management, course uploading, order management and learner comment management. Among them, learner management consists of two parts: learner basic information and recommendation service.

System operation flow of administrator role: When using the system for the first time, the administrator needs to enter the given login account and password to log in, and the system will verify the login information. The system will verify the login information. If the verification passes, it means the login is successful and the administrator enters the system homepage. System administration includes management and menu settings, which are the basic settings of the administrator module, you can add administrators and set up new menus to provide services for administrators. Lecturer management includes lecturer list, lecturer add and lecturer audit. When a new lecturer is added to the learning platform, the administrator can add the lecturer information through the add lecturer function. Recommended services mainly include the system for the learner’s recommendation, according to the learner’s learning in the financial management course, for its course recommendation services, the administrator role can be viewed through this module recommended course information.

Database design

Data redundancy as a frequent problem in the process of database design, in order to build a high-quality database that meets the requirements of the financial management teaching recommendation learning platform system, this study needs to take the logical relationship between the data as the first and foremost concern when carrying out database design. Therefore, the database design of the course recommendation learning platform system needs to meet the following specifications:

1) In the database design, it is necessary to satisfy the data integrity constraints that are jointly composed of entity integrity, referential integrity and user-defined integrity.

2) For the database design should ensure the atomicity of data, should also meet the third paradigm, that is, to eliminate some of the function dependencies and transfer dependencies.

3) Designing a database requires consideration of naming convention issues, i.e., maintaining a uniform style. In addition to the most basic case format and other uniformity, the basic table with underscores is usually used to name the associated table.

Performance testing of the teaching system and analysis of the effectiveness of recommendations

The experimental environment information of this paper is as follows.

Hardware configuration, CPU is Intel(R) Core(TM) i5-7500U; RAM is 32.00GB; graphics card is RTX 2080; hard disk space is 1TB.

For software configuration, the operating system is Ubuntu 18.04.1; the integrated development environment is PyCharm and VScode; the development language is Python version 3.6; and PyTorch 1.6.0.

In this paper, we use PyTorch machine learning framework to build the proposed TRDV and TRDP models.Pytorch is built in C++ as the base language, which has good ease of use, scalability and flexibility. At the same time Pytorch supports the model to operate on CPU and GPU.

Deep reinforcement learning model training and recommendation accuracy evaluation

In order to study the advantages of the deep attention mechanism of reinforcement learning recommender system over other mainstream recommender systems, in this section, TRDV, MostPop, TRDP, ItemKNN are chosen as pairs of models, and different sizes of datasets (100K, 1M) are chosen to compare with the models in this paper in terms of training time and accuracy, and the accuracy rate is used to measure the accuracy. Since there is no significant difference in the training time of each model under 100K dataset, this section only shows the results of training time comparison under 10K MB dataset. The training time comparison results of different models under 1M dataset are shown in Fig. 6.

Figure 6.

Different models are trained in 1M data sets

As can be seen from the figure, the loss function of each model shows a drooping trend with the training time, and the model in this paper converges faster than the comparison model. When the training time is about 17 minutes, the loss function value of this model converges to about 0.15. While the comparison model needs to be trained to 30 minutes later to converge successively, and the convergence value is between 0.18 and 0.32.

Figure 7 illustrates the recommendation accuracy of each model with different sized datasets. The data in the figure shows that there is no significant difference in the recommendation accuracy of the models on different sized datasets, while there is a difference between the models. The model in this paper has the highest recommendation accuracy, which is 0.983, 0.962, and 0.972 on datasets 1, 2, and 3, respectively, which is 50% to 57.03% higher than the comparison model.

Figure 7.

The recommended accuracy of various models in different size data sets

To summarize, the advantages of combining the attention mechanism and recurrent neural networks in carrying out the autonomous recommendation process enable the reinforcement learning recommendation algorithm to achieve significant results in terms of accuracy and convergence time.

Financial Management Teaching System Performance Test and Analysis

In this section, the performance test of the financial management instructional system based on reinforcement learning recommendations is conducted with the aim of exploring the performance requirements of the system as well as its compatibility and stability when making instructional recommendations. The performance test results of the financial management teaching system are shown in Table 1.

Financial management teaching systematically can test the results

Project CPU occupancy rate Memory occupancy rate Mean response time
Application server 26.35% 45.77% 267ms
Algorithm server(unrun) 17.62% 14.96%
Algorithm server(run) 49.15% 53.82%

After a number of tests and modifications, the system can ensure that under normal load conditions, the application server has an average CPU occupancy of 26.35% per minute and an average memory occupancy of 45.77% per minute. The algorithm server has an average CPU occupancy of 17.62% and an average memory occupancy of 14.96% per minute when the algorithm is not running, and an average CPU occupancy of 49.15% and an average memory occupancy of 53.82% per minute when the algorithm is running. Correct results are returned for 200 concurrent accesses with a flat response time of 267ms.

Analysis of Incentives and Outcomes Based on Instructional Systems

In order to better analyze the impact of the system in this paper on the effectiveness of financial management teaching, this section of the experiment selected two classes of sophomore students in the School of Economics and Management of a university to conduct a teaching simulation experiment. There is no significant difference in the financial management knowledge level between the two classes of students, and the financial management knowledge level test is conducted after an eight-week financial management teaching experiment. The control class (CK class) is the traditional financial management teaching mode, and the experimental class (T class) is taught using the system of this paper. Figures 8 and 9 show the reward curves and teaching results results based on the financial management teaching recommendation system, respectively.

Figure 8.

The reward curve of the teaching recommendation platform

Figure 9.

The teaching results of the teaching recommendation platform

As can be seen from Fig. 8, the reward curve shows an upward trend and tends to stabilize when iterating to 600 rounds, converging to around 1.70. It indicates that the intelligences of the system constructed in this paper can gradually improve their strategies and obtain higher cumulative rewards during the learning process.

The results of the comparison of teaching results show that the class taught by applying the system in this paper has a higher passing rate, which is 38.14% higher compared with the control class, and there are 3 more students with high scores (>80 points) than the control class, which makes the average score between the two classes significantly different, the average scores of the control class and the experimental class are 63.89 and 71.47, respectively.It can be seen from the above experiments that the system in this paper has higher rewards, which is piggybacked on the fact that the algorithm sets a large discount factor to improve the scores quickly in the early stages to increase the overall rewards. As the deep attention mechanism based reinforcement learning recommendation algorithm used in the teaching system, it can meet the system’s needs for financial management teaching recommendation function.

Financial Management Teaching Materials TOP-5 Recommendations

In order to further verify the superiority of the reinforcement learning recommendation model with deep attention mechanism proposed in this paper for financial management teaching system, five students were selected from those who used the teaching system of this paper for financial management teaching material recommendation, and the recommendation results were visualized.

Financial management teaching materials are selected from a total of 10 materials, including Financial Management, Enterprise Financial Diagnosis, Computerized Accounting, Investment, Financial Markets and Financial Institutions, Financial Statement Analysis, Corporate Finance, Management Accounting, Financial Budget Management, and Financial Risk Prevention, and the system built in this paper is used to recommend the Top-5 with the highest weighted ratings for the students. The results of Top-5 recommendation of financial management teaching materials are shown in Figure 10.

Figure 10.

Recommendations for financial management teaching data

As can be seen from the figure, for different students, this paper’s system for financial management teaching recommended materials weighted scores there are obvious differences, to student 1, for example, the system for the student recommended teaching materials Top-5 for “financial markets and financial institutions”, “investment science”, “corporate finance”, “management accounting”, “financial statement analysis”, weighted scores in order of 97.89, 96.84, 95.68, 94.54, and 93.77, which further validates that the system in this paper is able to place more relevant materials higher in the recommendation list.

Conclusion

In this paper, a deep learning recommendation model incorporating the attention mechanism is designed using reinforcement learning technology and deep learning technology. The model is applied to financial management teaching, and a financial management teaching system based on reinforcement learning is developed. Through the design of functional modules, the system can provide students with personalized financial management teaching materials recommendation.

In this paper, the model can be trained for 17 minutes, the loss function value can be converged to about 0.15, and the recommendation accuracy under different sized datasets is higher than 0.95. In this paper, the system can return accurate results for the user, and the average response speed is fast, 267ms.The reward value of this paper’s system converges to about 1.7 when iterating to 600 rounds, and the passing rate of the financial management knowledge level test of the experimental class using this paper’s system increases by 38.14% compared with the control class. In addition, the Top-5 financial management teaching materials recommended by this paper’s system for student 1 are Financial Markets and Financial Institutions, Investments, Corporate Finance, Managerial Accounting, and Financial Statement Analysis, with weighted scores ranging from 93.77 to 97.89.