Research on Dynamic Prediction Model of Consumer Credit Risk under Fintech Innovation

In recent years, commercial banks have increased their publicity in consumer loans, which has led to a certain development in the scale of consumer loans [1]. The development of China’s economy promotes people’s affluence, and people’s consumption ability has been greatly improved, opening up new development space for the development of consumer credit business. Personal consumer credit will become an important economic income-generating business for the future development of commercial banks [2]. Consumer credit is currently one of the important businesses of commercial banks’ comprehensive financial services. For commercial banks, this is to reduce homogeneous competition and find a personalized development path that has an important role [3]. As the level of China’s economic development continues to improve, people’s quality of life and disposable income have increased substantially, and the overall demand for consumer credit has developed rapidly, promoting economic development. The personal credit market has prompted people to obtain financial support through consumer credit to further satisfy the consumption needs of residents, thus greatly improving their consumption [4-5].

Nowadays, the traditional banking business can no longer meet the needs of the diversified development of banks. This business approach is in urgent need of transformation. In the context of the enormous competitive pressure, banks gradually began to optimize and adjust their business strategy [6]. On the one hand, with the continuous progress of the economy, the income of the residents has increased significantly, the consumption demand of the residents has also increased, and the financial institutions have gradually transferred their human, financial, and material resources to the field of personal consumer credit [7]. On the other hand, the low risk of personal consumer credit transactions can effectively improve the assets of financial institutions. Quality structure [8]. In terms of loan delinquency rate, the competition in the consumer credit market is becoming more and more intense because it is much lower than corporate credit and has higher profit margins. Therefore, the personal consumer credit business has become a new economic income-generating point for commercial banks, and at the same time, the personal consumer credit business will be the most important development direction in the next development period of this field. Consumer credit business of commercial banks continues to rise [9-10]. One of the important ways for banks to generate profits is personal consumer credit, and the scientific development of this business can largely increase the commercial profits of Chinese commercial banks, but on the other hand, it also increases the business risks of commercial banks [11]. Generally speaking, carrying out business operations usually brings certain business operation risks, especially personal credit transactions, on the rise. Although the total volume of personal consumer credit transactions in China is growing rapidly, the business risks faced are more diverse, which requires commercial banks to improve their risk management awareness [12-13]. For commercial banks, developing and managing risks is usually the most important problem they face when developing personal consumer credit. For commercial banks, the establishment of a perfect risk early warning indicator system and risk early warning model is the most important [14].

The existing risk early warning indicator system of commercial banks relies heavily on the People’s Bank of China (PBOC) credit system. However, the PBOC credit system has structural defects, which leads to limitations in the coverage of the PBOC credit system [15]. In addition, due to the inherent design problems of the system itself, its credit data has a lag due to its inability to record customers’ recent credit behaviors and lacks timeliness and comprehensiveness. Therefore, relying solely on this traditional credit assessment method can no longer meet the needs of commercial banks for consumer credit risk management in the era of rapid development of big data [16-18].

The development and popularity of consumer credit are intensifying. The risk assessment of consumer credit is also becoming more and more important. The current research on consumer credit risk assessment mainly explores the optimization and improvement of credit risk assessment strategies empowered by big data technology, artificial intelligence technology, and machine learning algorithms, and also constantly tries to discover and improve the index elements of credit risk assessment models. Literature [19] surveyed residents’ income, housing value, education level, and occupational status to predict the self-assessed health chances of residents’ credit scores and SEP measures and found that the credit scores and SEP indicators showed a moderate correlation, and pointed out that the credit scores can realize an effective complement to the SEP indicators. Literature [20], based on the literature review method, reveals that artificial intelligence technology and machine learning algorithms can analyze consumer credit customer asymmetry, adverse selection, and moral hazard based on public information data. Literature [21] conceptualized a credit risk assessment model considering the factors of political and economic crisis and conducted simulation tests on it, which showed that the analytical prediction results of the model were basically in line with the actual situation of non-performing loans and gained the unanimous approval of experts. Literature [22] talks about a kind of multi-rule based decision making (MRDM) which has very potential in the application of consumer credit assessment and compares the roles, advantages, and disadvantages played by two MRDM methods in assisting the assessment of consumer credit through actual cases. Literature [23] used bibliometric analysis to examine journal articles on the combination of credit risk assessment and big data technologies and found that research on big data technologies in credit assessment practices is a significantly hot and growing topic in the related field and noted that current research is effective in improving people’s knowledge and understanding of big data in credit risk assessment practices. Literature [24] conceived a two-stage credit risk model. It introduced target evolutionary feature selection to minimize the misclassification cost (root-mean-square error) and the number of attributes required for modeling PD as well as EAD models. Finally, test experiments pointed out that the performance of the proposed model is excellent in terms of prediction accuracy and cost-effectiveness. Literature [25] aims to improve the accuracy of credit risk assessment by envisioning a modified two-particle swarm algorithm (MBPSO) based on the logic of the GK algorithm and applying it in conjunction with the GK algorithm for credit risk assessment, obtaining a more robust and accurate credit assessment accuracy. Literature [26] explored how consumers’ personal information affects credit risk assessment and showed that consumer information helps in the prediction of individuals’ repayment behavior while incorporating personal information as a reference in credit risk assessment can improve prediction accuracy. Literature [27] tries to comprehensively analyze the problems related to consumer credit risk assessment from three dimensions of classification algorithms, data features, and machine learning methods, classifies the classification algorithms, data features, and learning methods, and builds a data feature-driven modeling framework based on multiple classifiers based on this foundation, and finally investigates the model’s interpretability, fairness, and the model’s multimodality, which makes a positive contribution to the credit risk assessment field of credit risk assessment. Literature [28] conceptualized an online consumer credit risk inference methodology based on data augmentation and model enhancement strategies, which adds consumer profile information, while multi-stage view monitoring based on consumer repayment time enhances the prediction accuracy of credit risk.

Reasonable use and management of consumer credit have a positive effect on the good development of individuals and society, so it is also very meaningful to deeply understand the logic of the phenomenon of consumer credit fire and improve the supervision and management of consumer credit. Literature [29] builds a measurement model to assess how material desire affects people’s credit card use, impulsive buying behavior, and compulsive buying behavior and the results of the study show that material desire significantly promotes impulsive buying behavior, and it is believed that the reduction of material desire can reduce impulsive buying and compulsive buying. Literature [30] discusses the international consensus on the regulation of the consumer credit market, which includes appropriate regulation by regulators, mandatory disclosure of information, reasonable cost of credit financing, etc., and concludes with an in-depth analysis of how payment intermediaries can play an active role in cross-border purchasing. Literature [31] identifies a database of loan characteristics in combination with a sample of more than one million personal consumer loans from LendingClub, which effectively improves the accuracy of consumer loan default prediction and also points out that the regulatory and guidance aspects of loan origination platforms should be made more transparent. Literature [32] quantifies the explanatory power of liquidity constraints and anchoring theory in conjunction with changes in issuers’ minimum payment formulas, pointing out that anchoring a contractual clause facilitates households’ repayment decisions.

The article constructs a consumer credit risk prediction model based on survival analysis out of the demand for the prediction of consumer credit risk in the context of financial technology. Based on the survival time theory, the survival data are divided. Kaplan-Meier method is used to find the survival function. After testing the survival data and completing the model matching degree, the consumer credit risk prediction model is constructed using by Cox proportional risk prediction model. The Cox proportional risk model is compared with the risk prediction models such as RandomForest, XGBoost, Lightbm, etc., and the risk prediction effect of the Cox model is tested by integrating the ROC curve, KS value, and probability value. The predictive validity of the Cox model is further verified by empirically analyzing whether the borrower is overdue and the overdue time.

2

Consumer credit risk prediction modeling

2.1

Basic concepts

2.1.1

Financial technology

Financial technology refers to a new type of financial industry through the organic integration of modern science and technology, and the in-depth transformation and innovation of financial products and financial enterprise management [33]. It is not just a simple combination of finance and science and technology, but through big data, blockchain, artificial intelligence, cloud computing, and other advanced technologies, to realize the transparency, intelligence, and digitalization of financial services so as to produce the value-added effect of “1+1>2”. The development of fintech aims to enhance the operational efficiency of the financial industry, optimize customer experience, and give rise to more advanced and convenient financial products and modern financial management systems.

(1)

Big Data

Big data technology is an advanced technology for processing and analyzing massive data sets, especially referring to those data that exceed the processing capacity of traditional databases. It relies heavily on sophisticated data processing software that can quickly analyze, process, and extract valuable information.

In the banking industry, the application of big data technology is becoming increasingly broad and deep. First, it plays an important role in risk management, enabling more effective identification and management of credit and market risks. Second, in terms of customer service, big data technology helps banks analyze customer needs and provide more personalized services. In addition, big data technology plays a key role in banks’ marketing strategies. By analyzing customer data, banks can design more accurate marketing campaigns. Also, big data technology is very effective in preventing financial fraud. Finally, big data technology also helps banks improve their internal operational efficiency.

(2)

Blockchain

Blockchain technology is a distributed ledger technology that makes data transmission secure and reliable in a decentralized network. In this network, data is stored in the form of blocks connected by encrypted chains, with each new block containing encrypted information from the previous block, thus ensuring data immutability and transparency. This structure makes blockchain an ideal system for recording and sharing data. In the banking industry, blockchain technology is mainly used to improve transaction efficiency and security.

(3)

Artificial Intelligence

Artificial Intelligence (AI) technology mimics human thought processes, including the ability to learn, reason, self-correct, and automate decision-making utilizes algorithms and large amounts of data to enable machines to solve complex problems and perform specific tasks. In the banking industry, AI is used to improve service efficiency and customer experience.

(4)

Cloud Computing

Cloud computing is a technology that provides computing resources and data storage services over the Internet. It allows users to access and use software, storage, and other computing functions located on remote servers over the Internet without having to manage physical servers or run application software locally.

In the banking industry, the use of cloud computing is growing, revolutionizing this traditional industry. First, cloud computing provides powerful data processing capabilities that help banks efficiently handle large amounts of transaction data and enable fast, accurate data analysis. Second, cloud computing helps banks cope with business fluctuations by providing flexible resource allocation. In addition, cloud computing supports innovation in the banking industry. Finally, cloud computing also plays a key role in improving the security and compliance of banking operations.

2.1.2

Consumer Credit and Risks

(1)

Consumer credit

Consumer loans specifically refer to loan services provided by banks or other financial institutions to individual consumers with a defined consumption purpose. Such loans are mainly based on the borrower’s credit history and future repayment ability and are used to satisfy their needs for purchasing consumer goods or paying for other personal consumption.

(2)

Classification of consumer credit

Thanks to personal consumer credit, public deposit funds can be further circulated, which can effectively alleviate the financial pressure on families or individuals. Under the joint promotion of “consumption upgrading, policy support, and financial technology development”, China’s consumer credit system is becoming increasingly mature. According to the purpose of the loan, personal consumer credit can be categorized into housing and non-housing categories.

(1)

Housing Consumer Loans

Housing consumer loans, also known as personal housing loans, provide financial support for borrowers to purchase various types of housing for their use, including ordinary housing and villas. This type of loan usually has a large amount, up to tens of millions of dollars, and a long loan period, usually 1-30 years, and the sum of the borrower’s age and the loan period does not exceed 30 years. Because the mortgage loan cycle is generally longer, the borrower can choose different repayment methods, such as equal monthly principal and interest, equal monthly principal and interest each month for a repayment period, and the month principal and interest on the same month to pay off. In order to minimize business risks, commercial banks often require borrowers to provide security in the form of mortgages or pledges. If the borrower fails to repay the loan in accordance with the contract at that time, the commercial bank has the right to dispose of the collateral to recover the loan. In the past, the interest rate for housing loans was fixed. With the introduction of LPR interest rates, borrowers can choose to repay their loans at floating interest rates.

(2)

Non-housing consumer loans

Non-housing consumer loans refer to personal consumption loans other than those used for the purchase of a home, which have a wide range of uses, including but not limited to the purchase of automobiles, home renovations, and the purchase of large consumer goods. Based on the length of the repayment period, non-housing consumer loans can be further classified into installment and non-installment modes. The installment mode is usually used for larger amounts of consumption, such as purchasing a car, renovation, etc., while the non-installment mode is usually used for smaller amounts of consumption or emergencies.

(3)

Consumer credit risk

Consumer credit risk refers to the possibility that the borrower of personal consumer credit is unable to fulfill the repayment obligation according to the contract due to various uncertainties, thus exposing the commercial bank to the loss of funds [34]. This kind of risk runs through the three links of pre-credit, credit, and post-credit, including but not limited to incomplete pre-credit investigation, non-compliance of credit review, as well as untimely and unstandardized post-credit management. When the borrower defaults, commercial banks will face the risk of non-performing loans, which will affect their asset quality and operational stability.

The characteristics of consumer credit risk mainly include: First, the diversity of risk sources, which may come from the borrower’s own credit risk but also from the market environment, policy adjustment, and other external factors. Second, the risk is hidden. Part of the risk may be difficult to detect at the beginning of the loan but gradually exposed over time. Thirdly, risks are contagious, and once a risk arises in a certain link, it may spread to the entire credit chain and affect the overall operation of the bank.

2.2

Construction of consumer credit risk prediction model based on survival analysis

2.2.1

Survival time

Survival time refers to the time elapsed between the beginning of a certain research time point and the occurrence of an endpoint event in the research object. Survival functions have different units and ways of measurement depending on how they are defined, and their probability density functions can be categorized into continuous and discrete types [35]. These two types of functions are explained in detail below.

(1)

Continuous function

Assuming that T denotes the survival time of a single individual in a population and is a continuous type non-negative random value covering the range [0,∞], the probability F(t) of the cumulative distribution function of T is given by the following equation: 1 $F (t) = P (T \leq t) = \int_{0}^{t} f (x) d x, \forall t \geq 0$

In contrast, the probability of an individual surviving longer than time t is the survival function, whose probability model S(t) is shown below: 2 $S (t) = P (T > t) = 1 - F (t) = \int_{t}^{\infty} f (x) d x, \forall t \geq 0$

S(t) is a monotonically decreasing continuous function. where S(0) = 1, denotes the probability that an individual survives beyond the starting point 0, and $S (\infty) = \lim_{t \to \infty} S (t) = 0$ , denotes the probability that an individual survives for an infinite amount of time.

If first-order differentiation is done on F(t), then the probability function of death for survival time T can be obtained, denoting the magnitude of the probability of death occurring at any infinitesimal time from t to t + Δt when the individual uses time as a variable, denoted by f(t), and whose probability equation is expressed in the following equation: 3 $f (t) = \frac{d F (t)}{d t} = - \frac{d S (t)}{d t} = \lim_{Δ t \to 0^{+}} \frac{P (t < T < t + Δ t)}{Δ t}, \forall t \geq 0$

Another way of describing the magnitude of the probability of death in the survival analysis approach is to express the degree of risk or instantaneous mortality that an individual may die, called the risk function when it is known that the individual is still alive at the moment of t, but the probability of death occurring in a subsequent very small unit of time Δt is expressed as a probability equation in h(t): 4 $h (t) = \lim_{Δ t \to 0^{+}} \frac{P (t < T < t + Δ t | T > t)}{Δ t} = \frac{f (t)}{S (t)}, \forall t \geq 0$

From equations (1) and (2) we know f(t)=–F(t), so the above equation can be rewritten as: 5 $h (t) = \frac{f (t)}{S (t)} = - \frac{\frac{d S (t)}{d t}}{d t} = - \frac{d \log S (t)}{d t}, \forall t \geq 0$

The survival function is then obtained by taking the integral of the above equation and converting it to exponential form: 6 $\int_{0}^{t} h (x) d x = - \log S (t)$

Also i.e., for: 7 $S (t) = \exp (- \int_{0}^{t} h (x) d x)$

Then, based on the relationship between the functions, the probability density function of an individual at time point t can then be obtained: 8 $f (t) = h (t) S (t) = h (t) \exp (- \int_{0}^{t} h (x) d x)$

(2)

Discrete Functions

Similarly, assuming that T represents the survival time of a single sample in the aggregate and is a discrete non-negative random value, denoted by t₁,t₂,t₃,... , etc., where 0 < t₁ < t₂ < t₃ <..., the probability function is: 9 $P (t_{i}) = P (T = t_{i}), i = 1, 2, 3, \dots$

At this point, the survival function is represented by the following equation: 10 $S (t) = P (T \geq t) = \sum_{i : t_{i} \geq t} P (t_{i})$

where S(t) is a monotonically decreasing function, S(0)=1, S(∞)=0.

The risk function h(t) can be expressed as: 11 $h (t) = P (T = t_{i} | T \geq t_{i}) = \frac{P (t_{i})}{S (t_{i})}, i = 1, 2, 3, \dots$

From the above equation, we know that P(t_i) = S(t_i)–S(t_i+1).

So equation (11) can be rewritten as: 12 $h (t_{i}) = 1 - \frac{S (t_{i + 1})}{S (t_{i})}, i = 1, 2, 3, \dots$

Also ie: 13 $S (t) = \prod_{i : t_{i} < t} [1 - h (t_{i})], i = 1, 2, 3, \dots$

2.2.2

Survival Data Categories

The data of the study are categorized in survival analysis into two types, complete data, and censored data, to make a distinction between the final results of the study sample.

Suppose the individual data can be observed for a complete survival time (i.e., whether they can be determined to be normal or past due customers) at the time of data collection or experimentation. In that case, these complete data are called complete data. Conversely, if individuals cannot be consistently observed at the time of data collection or experimentation due to some uncertainties, or if individuals do not fail or die because the researcher cannot confirm the true survival time due to some reasons (costs in terms of human resources, material resources, and time) that should have been decided beforehand for the duration of the data collection or experimentation, these data are referred to as censored data.

2.2.3

Survival Analysis Methods

The main method utilized in the industry is the Kaplan-Meier method, which is a non-parametric method of making curves by finding the survival function. In practice, it is easy to observe the turning points of the graph and the degree of variation, and it also has the nature of the product limit, so it can also be called the PL method.

The Kaplan-Meier method arranges the survival times of complete and censored data from smallest to largest, assuming that n denotes all the data, and i denotes the ordinal number (i.e., the cumulative number of deaths at a given point in time) at which a given data is located, then the number of individuals who can survive to a given survival time within the set of instantaneous risks is n–r+1, and so the probability of death at that point in time may be denoted as $\frac{1}{n - r + 1}$ , and the probability of survival may be denoted as $\frac{n - r}{n - r + 1}$ . Then the probability that a set of data can survive sequentially to each full data survival time point to a certain survival number of i is the survival probability of each survival time point with a number of i multiplied by the survival probability. If the probability of survival and the probability of death are expressed as respectively: 14 $p = \frac{n - r}{n - r + 1}, q = \frac{1}{n - r + 1}$

Then, the estimation method of the survival function is expressed as: 15 $S (t_{i}) = \prod_{t_{1} < t} p_{i}$

In this paper, the Kaplan-Meier method will be applied to estimate the cumulative survival survival rate S₀(t).

2.2.4

Data testing

When we want to compare more than n set of survival data, i.e. test H₀:S₁(t)=S₂(t)=S₃(t)=…=S_n(t), the method generally used is the log-rank test. The principle is to divide the square of the sum of the difference between the number of observations minus the sum of the expected number at each time point by the sum of the expected number of events per event, which is expressed in the following formula: 16 $\frac{{[\sum A_{i} - \sum E (A_{i})]}^{2}}{\sum E (A_{i})}, i = 1, 2, 3, \dots$

where A_i denotes the number of deaths (defaults) at time i and E(A_i) denotes the estimated number of deaths (defaults) at time i.

2.2.5

Model fit test

When the resulting model is considered in practical application, we are generally not sure in advance whether the model is a match for the solution of the problem because one or more features of the model may not be appropriate for the particular data that is available, and therefore checks should be made to confirm that the model matches the resulting data before the model is made.

A common method used in the industry is to make a log-log plot, which is a logarithmic transformation of the specific variables for the estimated survival function. Assuming that all samples are now divided into n non-repeating strata based on the variables, the regression coefficients are obtained without taking into account the risk function h(t,x)=h₀(t)e^βx constructed by variable x, and the matching survival function S(t,x) for each stratum is calculated. n curves are plotted on the axes with the survival time t as the horizontal axis and log(–log(h(t)exp βx)) as the vertical axis, and if the n curves are parallel, then the model is a good match.

2.2.6

Cox proportional risk models

The Cox proportional risk model, which was first applied to the pricing of bonds and some financial products, has been gradually used to measure credit risk in recent years with the continuous development of the commercial bank credit risk measurement model [36]. This subsection mainly introduces the basic form of the Cox proportional risk model and the estimation and testing of relevant parameters.

(1)

The basic form of the Cox model

Let the density function of survival time T be f(t|X) and the survival function be S(t|X) when the covariate is X. The hazard rate function is: 17 $λ (t | X) = \frac{f (t | X)}{S (t | X)}$

If X₁ ≠ X₂, then ratio $\frac{λ (t | X_{1})}{λ (t | X_{2})}$ is independent of t. The relationship between survival time T and covariate X is said to fit the proportional risk rate model, so for this model, the hazard rate for survival time T has the following form: 18 $S (t | X) = \exp {- \int_{0}^{t} λ (μ | X) d μ}$

Here λ₀(t) is called the benchmark risk rate due to the survival function of T: 19 $S (t | X) = \exp {- \int_{0}^{t} λ (μ | X) d μ}$

From equation (18): 20 $S (t | X) = {[S_{0} (t)]}^{g (X)}$

where $S_{0} (t) = \exp {- \int_{0}^{t} λ_{0} (μ) d μ}$ .

It follows from Eq. (20) that for all X₁, X₂, or S(t|X₁)≥S(t|X₂), or S(t|X₁)≤S(t|X₂), which indicates that the survival times are randomly comparable for different values of the covariates.

In practical problems, g(X) in Eq. (20) is often chosen to be in parametric form, i.e., g(X) = g₀(X,β). Where here g₀ is the known function, X = (x₁,...x_p)^T is the covariate, and β = (β₁,...,β_p)^T is the unknown parameter. In this case, Eq. (20) simplifies to: 21 $S (t | X) = {[S_{0} (t)]}^{g_{0} (X, β)}$

Equation (21) is called the generalized Cox model. When g₀ (X,β) = exp{X^T β}, Eq. (21) is called the Cox proportional risk model, which was proposed by the British statistician Cox in 1972. There are two important parameters in Eq. (21), one is parameter β and the other is the baseline survival function S₀(t), thus the model is a semiparametric model, and both of the so far parameters need to be estimated from observed data.

(2)

Proportional risk hypothesis testing

The so-called proportional risk assumption test, or PH assumption, means that the effect of covariates in the model does not change with time. That is to say, in the model, the risk function of different individuals is proportional to each other [37]. The test of the PH assumption is necessary before building the Cox proportional risk model. Only through the PH assumption the model built is valid; otherwise, it is invalid. If the PH assumption is not satisfied, on the one hand, the changed significance due to the existence of variables that do not satisfy the PH assumption receives a corresponding effect. Secondly, if the risk ratio increases with time, then the relative hazard ratio is overestimated, and if the opposite occurs, the relative hazard ratio is underestimated.

In academic research, common methods for testing PH assumptions include the graphical method and the test method.

(1) Graphical method, the so-called graphical method, is to determine whether the PH assumption is satisfied by observing the distribution in the scatterplot. The graphical method is simple and easy to operate and is often applied to a variety of variables, including continuous variables, binary variables, hierarchical variables, etc., which enables the researcher to judge whether the PH assumption is satisfied intuitively and has a certain degree of credibility. However, human judgment is subjective, and sometimes, it is difficult to determine the degree of deviation that leads to the error of the model, thus affecting the validity of the model. Therefore, it is necessary to judge whether the PH assumption is statistically significant with the help of statistical tests.

(2) Testing method, both through the construction of statistics to test the established model and through the construction of statistics derived from the P value to determine whether the data meets the assumptions set by the model. This mainly includes the time-covariate method, generalized linear regression method, etc. These tests are based on the original assumption that the risk ratio is zero and do not require the stratification of time and covariates, and these methods are more common in testing the PH assumption.

(3)

Parameter estimation

For the parameter part of the Cox proportional risk model, this paper adopts the partial likelihood estimation method. Assuming a sample size of n, i.e., the survival time of n listed firms is observed, three variables are included in the sample data, namely, X_i, δ_i, and T_i, where X_i is the covariate of the ith firm and T_i is its survival time. When the true survival time of a listed firm is greater than t_i, it means that the firm did not experience financial distress during the study period, which is denoted as δ_i = 0;. On the contrary, when the true survival time of a listed firm is t_i, then it means that the listed firm did not experience financial distress during the study period, which is denoted as δ_i = 1. Therefore, the observations of interest are represented by a set, i.e.: 22 $(t_{i}, δ_{i}, X_{i}), i = 1, 2, \dots, n$

Let t₁ < t₂ < … < t_n denote the order of survival times. Define the risk set at time t_i to be R(t_i):{t_j > t_i}, which represents the set of all individuals still in the study process until t_i. The principle of partial likelihood estimation to estimate the parameters of the model covariates will be briefly described below.

If an individual in R(t_i) dies at time t_i, i.e., financial distress occurs, then the conditional probability of death at time t_i for an individual with covariate X_i is, given this condition: 23 $\begin{array}{l} P [i n d i v i d u a l d i e d a t m o m e n t t_{i} | a n i n d i v i d u a l d i e d a t m o m e n t t_{i}] \\ = \frac{h [t_{i} | X_{i}]}{\sum_{j \in R (t)} h [t_{i} | X_{j}]} = \frac{h_{0} (t_{i}) \exp [β^{'} X_{i}]}{\sum_{j \in R (t_{i})} h_{0} (t_{i}) \exp [β^{'} X_{j}]} = \frac{\exp [β^{'} X_{i}]}{\sum_{j \in R (t_{i})} \exp [β^{'} X_{j}]} \end{array}$

Multiplying the conditional probabilities of death at all time points gives the partial likelihood function, shown in equation (24): 24 $L (β) = \prod_{i = 1}^{n} \frac{\exp [β' X_{i}]}{\sum_{j \in R (t_{i})} \exp [β' X_{j}]}$

Taking logarithms of Eq. (24) and solving for derivatives makes: 25 $\frac{\partial \ln L (β)}{\partial β} = 0$

The great likelihood estimate of parameter β from the partial likelihood estimation method can be obtained by solving equation (25).

(4)

Estimation of the baseline survival function

Once the estimate of parameter β is obtained by the partial likelihood estimation method, the next step is the estimation of the baseline survival function. In the Cox proportional risk model, the benchmark survival function S₀(t) is a nonparametric form, and there are two main methods for its estimation:

(1)

Non-parametric method

The nonparametric method is a method of estimating the baseline survival function, i.e., the estimation expression is: 26 $H_{0} (t) = \sum_{t_{i} < t} [1 - {(1 - \frac{n_{i} \exp [β' X_{i}]}{\sum_{j \in R (t_{i})} \exp [β' X_{j}]})}^{\exp [- β X_{j}]}]$

Where n_i denotes the number of listed companies that have financial crisis at the moment of t_i. Based on the definition of the benchmark survival function S₀(t) and the benchmark accumulation hazard rate function H₀(t), the expression of the benchmark survival function can be obtained from the mathematical relationship: 27 $S_{0} (t) = \exp [- H_{0} (t)]$

(2)

Breslow’s method

This method defines the baseline cumulative hazard rate function H₀(t) at moment t_i as: 28 $H_{0} (t) = \sum_{t_{i} < t} [\frac{n_{i}}{\sum_{j \in R (t_{t})} \exp [β' X_{j}]}]$

In this way, the baseline survival function is obtained: 29 $S_{0} (t) = \exp {- \sum_{t_{i} < t} [\frac{n_{i}}{\sum_{j \in R (t_{i})} \exp [β^{'} X_{j}]}]}$

(3)

Predictive model performance and empirical analysis

3.1

Model Performance Analysis

The Cox proportional risk model in this paper is compared with the risk prediction models such as RandomForest, XGBoost, and Lightbm to analyze the prediction effect of each model.

3.1.1

Comparison of ROC curves (AUC values)

Several algorithms are trained on the training set (Train) using the above parameters, and then the model results are evaluated using the test set (Test). A comparison of ROC curves of several algorithms is shown in Figure 1.

In Figure 1, the best experimental result is the Cox proportional risk model proposed in this paper, whose AUC value reaches 0.7295, compared with the worst RF model in this group of experiments, whose AUC value is 0.6978, the same dataset, the same incidence characteristics can still improve the AUC value by 3.17 percentage points.

3.1.2

Comparison of KS values

Comparison of KS curves and KS values for several algorithms are shown in Fig. 2. Figs. (a)~(d) are the KS curves and KS values for RF, XGB, LGB, and Cox models, respectively. And the optimal dividing line of the sample when the KS value is taken as maximum. It can be seen that the cut-off point for different modeling algorithms to obtain the maximum KS value is different. In the RF model results, the most obvious differentiation effect of the model can be obtained from the samples with the top 36.5% of the probability value ranking in the prediction results. The model has the maximum KS value at this time. Comparing the four algorithms, the Cox algorithm proposed in this paper has the largest KS value of effect and the best prediction result. From the samples in the top 27.53% of the probability value ranking in the prediction results, the maximum KS value (0.2846) can be obtained.

3.1.3

Comparison of probability values

In the final model results, the probabilities are transformed into scores through a linear representation function, where it is important to note that the larger the probability value, the lower the score and the higher the likelihood of a positive sample (bad people in the wind control), and conversely, the higher the score and the higher the likelihood of a negative sample (good people in the wind control). The scores are binned to compare the distribution of good and bad people between different models. The distribution of scoring results of each model is shown in Fig. 3 and Figs. (a)~(d) shows the distribution of prediction results of RF, XGBoost, Lightbm, and Cox models, respectively.

Under the same feature conditions, the Cox algorithm proposed in this paper can effectively capture more positive samples (label=1). In the low score of 300-500, the model results of several algorithms, RF, XGBoost, and Lightbm, have almost no positive sample distribution share and do not effectively catch positive samples, and the model passes positive samples that should have been rejected in the low score. In wind control, a bad customer often causes a certain asset loss, which is extremely unfavorable, and the Cox algorithm proposed in this paper can effectively capture a certain proportion of the positive sample population in the low score segment, which is the model effectiveness of the mention.

In short, whether from the algorithm or feature construction on the innovation, this paper proposes the method in the results once again to verify the effectiveness of its own.

3.2

Consumer Credit Risk Forecast Analysis

3.2.1

Characterization

The paper adopts the Boruta feature selection method to screen the key influencing features, and there are 8 features after the screening, which are interest rate, the maximum credit limit of valid RMB credit card, number of times a single credit card has been overdue for M1, and above in the past 24 months, the total amount of mortgage repayment due in the current month, number of credit card inquiries in 6 months, whether the repayment is overdue for the first time, the amount of the first overdue amount, and the current credit limit. The above eight features are numbered X1~X8, where X1 belongs to the macro environment, X2~X5 belongs to the pre-credit credit information, and X6~X8 belongs to the post-credit lending behavior information.

Then, the Pearson correlation coefficient is calculated for the screened features, and the correlation coefficients of the feature variables are shown in Table 1. It is found that the correlation between the features is low, and all of them can be used as the feature variables for the next regression analysis.

Table 1.

Characteristic variable correlation coefficient

	X1	X2	X3	X4	X5	X6	X7	X8
X1	1	-0.055	-0.042	0.018	-0.042	0.035	-0.088	-0.065
X2	-0.055	1	0.452	0.033	0.105	-0.052	0.164	0.273
X3	-0.042	0.452	1	0.138	0.046	-0.037	0.153	0.242
X4	0.018	0.033	0.138	1	-0.114	-0.009	0.033	0.092
X5	-0.042	0.105	0.046	-0.114	1	0.010	0.003	-0.094
X6	0.035	-0.052	-0.037	-0.009	0.010	1	0.221	-0.119
X7	-0.088	0.164	0.153	0.033	0.003	0.221	1	0.379
X8	-0.065	0.273	0.242	0.092	-0.094	-0.119	0.379	1

3.2.2

PH Hypothesis Testing

The results of the paper’s PH assumption test are shown in Table 2. When P>=0.05, the variables to be tested then satisfy the PH assumption. From the test results, the P value is greater than 0.05, so the original hypothesis is accepted, i.e., the model meets the conditions of the PH assumption, and the eight covariates have a relatively small correlation with time t.

Table 2.

Covariable PH test results

Code	Covariable	P
X1	Interest rate	0.21
X2	Highest valid credits	0.88
X3	M1 and above overdue frequency	0.32
X4	Housing loan repay the amount of this month	0.59
X5	Credit query number in 6 months	0.23
X6	First overdue	0.27
X7	First overdue amount	0.24
X8	Current credit	0.69

3.2.3

Modeling and Significance Tests

The following proportional risk model is established based on the above eight covariates, and the significance level of the joined variables is taken as 0.05 and the model parameters and significance test results are shown in Table 3. Then, according to the Cox proportional risk model expression, the survival function of the Cox model of default risk after the first borrowing of personal loans can be obtained: 30 $S (t) = S_{0} {(t)}^{\exp (0.05 X 1 - 0.08 X 2 + 0.14 X 3 + 0.08 X 5 + 0.42 X 6 + 0.03 X 7 - 0.78 X 8)}$

Table 3.

Model parameters and significance test results

Code	Covariable	coef	exp(coef)	se(coef)	P
X1	Interest rate	0.05	1.16	0.05	<0.005
X2	Highest valid credits	-0.08	0.97	0.07	0.015
X3	M1 and above overdue frequency	0.14	1.18	0.06	<0.005
X4	Housing loan repay the amount of this month	0	1	0.04	0.088
X5	Credit query number in 6 months	0.08	1.09	0.03	<0.005
X6	First overdue	0.42	1.52	0.03	<0.005
X7	First overdue amount	0.03	1.04	0.04	<0.005
X8	Current credit	-0.78	0.55	0.09	<0.005

The paper used the Likelihood Ratio Test LRT for the overall significance test of the model, which showed an overall significance level of p<0.05, indicating that the overall test of the model is significant, i.e., there is at least one covariate with non-zero coefficients.

3.2.4

Analysis of Results

1) Interest rate (X1) regression coefficient coef = 0.05 > 0, indicating that the interest rate is a risk factor, exp (coef) = 1.16 > 1, indicating that for every unit increase in the interest rate, the degree of risk will increase to 1.16 times the original. Generally speaking, the higher the borrowing interest rate, the higher the loan cost to be paid and the higher the repayment pressure, and thus its default risk will rise.

2) The regression coefficient of effective RMB credit card maximum credit limit (X2) coef=-0.08<0, indicating that the effective RMB credit card maximum credit limit is a protective factor, and exp(coef)=0.97<1, indicating that for every increase of this variable by one unit, the risk degree will be reduced to 0.97 times of the original, which indicates that the higher the borrower’s effective RMB credit card maximum credit limit, the higher the borrower’s effective RMB credit card maximum credit limit, the better the borrower’s credit status, and thus the lower their default risk. Generally speaking, the higher the credit limit, the higher the creditworthiness of the borrower as recognized by the bank, and the higher the repayment ability of the borrower.

3) The regression coefficient coef=0.14>0 for the number of times a single credit card has been overdue for M1 and above in the past 24 months (X3) indicates that the maximum credit limit of a valid RMB credit card is a risk factor, and exp(coef)=1.18>1 indicates that for every increase of one unit in the variable, the risk level will be increased by 1.18 times of the original, which means that the more the number of times a single credit card has been overdue for M1 and above in the past 24 months, the higher the default risk will be. The more the number of times, the higher the risk of default.

4) The regression coefficient of the number of inquiries within 6 months (X5) coef=0.08>0, indicating that the number of inquiries within 6 months is a risk factor, exp(coef)=1.09>1, indicating that the risk degree will increase to 1.09 times of the original for every increase in the number of inquiries by one unit, which means that the greater the number of inquiries within 6 months, the greater the risk of default.

5) Whether the first repayment is late (X6) regression coefficient coef = 0.42 > 0, indicating that whether the first repayment is late as a risk factor, exp (coef) = 1.52 > 1, indicating that the first repayment occurs late, the risk degree will increase to the original 1.52 times, that is, the risk of default will increase.

6) First overdue amount (X7) regression coefficient coef=0.03>0, indicating that the first overdue amount is a risk factor, exp(coef)=1.04>1, indicating that for every increase of one unit in the first overdue amount, the risk degree will increase to 1.04 times of the original, which means that the higher the amount of the first overdue amount is, the higher the default risk is.

7) Current credit limit (X8) regression coefficient coef=-0.78<0, indicating that the current credit limit is a protective factor, exp(coef)=0.55<1, indicating that for every increase of one unit in the current credit limit, the risk degree will be reduced to 0.55 times of the original, which indicates that the higher the borrower’s current credit limit is, the better the credit condition of the borrower is, and the lower its default risk is.

3.2.5

Model predictions and evaluation

This paper predicts the Cox survival function based on the fitted Cox survival function since the paper takes repayment overdue for more than 180 days to be written off as a sign of default, there is a legitimate repayment period of 30 days after the borrowing, a borrower will certainly not be written off within 210 days after the first borrowing, so the main prediction is to predict the defaults after a borrower’s first borrowing of 210 days, but within 1 year, and the result is shown in Fig. 4. The horizontal axis indicates the date (in days) and the vertical axis is the survival probability. The content represents the probability of survival of the borrower after the first borrowing, and it is set that the borrower is considered to be in default when the value is < 0.5.

As can be seen in Figure 4, the number of borrowers who are predicted to default after 210 days of first borrowing and within 1 year are 3 and 9, and they are predicted to default on day 286 and 357, respectively. The other borrowers are not predicted to default during that performance period.

In actuality, borrowers numbered 3 and 9 did default and survived for 286 and 357 days, respectively. Borrowers 3 and 9 were not over or behind in their time to default predictions, and the other eight clients were predicted not to default. The Cox proportional risk model performed well in its predictions.

(4)

Conclusion

This paper combs through consumer credit and its risks in the context of financial technology and adopts the Cox proportional risk model to dynamically predict consumer credit risks. The Cox model is compared with other models in terms of prediction performance, and its validity is further verified through empirical research.

The AUC value of the Cox proportional risk model is 0.7295, and the maximum value of KS (0.2846) can be obtained in the top 27.53% of the samples in the order of the probability value. The predictive performance of the Cox proportional risk model is the best performance among all the predictive models.

The regression coefficients of interest rate, the maximum credit limit of valid RMB credit card, number of single credit cards overdue M1 and above in the last 24 months, number of credit card inquiries in 6 months, whether the first repayment is overdue, the amount of the first overdue amount, and the current credit limit are 0.05, -0.08, 0.14, 0.08, 0.42, 0.03, and -0.78, respectively.

Among the 10 borrowers, the Cox model predicts that borrowers #3 and #9 will default, and they are predicted to default on day 286 and day 357, respectively. At the same time, the other borrowers are predicted not to default during that performance period. The prediction matches the actual situation, and Cox’s empirical results are good.

Funding:

This research was supported by the provincial soft science key project of Hunan Science and Technology Department: Science and Technology Financial Innovation supporting the Development of Hunan Science and Medium-sized Enterprises (No. 2013ZK2024).

Langue:: Anglais

Périodicité:: 1 fois par an
Sujets de la revue:: Sciences de la vie, Sciences de la vie, autres, Mathématiques, Mathématiques appliquées, Mathématiques générales, Physique, Physique, autres

RSS Feed de la revue

Research on Dynamic Prediction Model of Consumer Credit Risk under Fintech Innovation

Yangyudongnanxin Guo

Publié en ligne: 17 mars 2025

Reçu: 04 oct. 2024

Accepté: 26 janv. 2025

DOI: https://doi.org/10.2478/amns-2025-0341

Mots clésSurvival analysis, Cox proportional risk model, Fintech, Consumer credit, Risk prediction

© 2025 Yangyudongnanxin Guo, published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Mots clés
Survival analysis, Cox proportional risk model, Fintech, Consumer credit, Risk prediction