Open Access

Feature selection for high-dimensional data based on scaled cross operator threshold filtering specific memory algorithm

, , ,  and   
Mar 26, 2025

Cite
Download Cover

Introduction

With the rapid development of the information age, big data has become one of the most representative symbols of this era. How to mine the most valuable information in the huge scale and complex structure of high-dimensional data is a hot concern for researchers and technical work [1-3]. Feature selection is an effective preprocessing method for high-dimensional data, aiming at removing redundant and irrelevant information from the data, reducing the data dimensionality and the computational complexity of learning algorithms, and it is a hot research topic in the basic theory of artificial intelligence [4-6].

Feature selection can be categorized into many methods such as embedded, encapsulated and filtered. Embedded feature selection integrates feature selection with the learner and automatically performs feature selection during the learner training process. This method lacks interpretability of feature selection [7-8]. Encapsulated feature selection selects a subset of features and trains the learner by continuously selecting a subset of features from an initial set of features and evaluating the subset based on the performance of the learner until the best subset is selected. Multiple training of the learner leads to high computational complexity in encapsulated feature selection methods [9-10]. Filtered methods select representative features by designing feature evaluation metrics or correlation statistics by thresholding or heuristic search methods [11]. Commonly used feature evaluation metrics are entropy, mutual information, information gain, Gini coefficient and so on. Feature selection methods based on rough sets, neighborhood rough sets and fuzzy neighborhood rough sets are an important class of heuristic feature selection methods, which are widely studied because they ensure that the selected feature subset maintains or even exceeds the classification ability of the original features by designing feature evaluation functions with good properties [12-14].

In the era of big data, with the increasing data size, it is increasingly difficult to construct machine learning models for classification tasks. Feature selection is an important technique in data preprocessing, which aims to reduce the error rate of the classification task and decrease the number of features required for training [15-16]. Especially in high-dimensional datasets, there are a large number of redundant features that tend to negatively affect the effectiveness of the classifier. Therefore, in the high-dimensional feature selection classification problem, it is particularly important to filter out features that are highly relevant to the classification labels and are not redundant with each other [17-18]. In recent years, scholars have proposed many feature selection algorithms for the classification problem, although they can obtain good classification results, their feature selection performance still needs to be further improved when facing high-dimensional feature classification [19-20].

This paper analyzes the coherence of features, finds the application value of feature selection, and sets up its general process as the basis for studying the feature selection problem in high-dimensional data. The basic principle of wavelet threshold filtering algorithm, the process of selecting threshold parameters and threshold functions, and the limitations of this algorithm are described. Optimize the process of feature selection by using the genetic algorithm based on scaling cross operator threshold filtering, and introduce the weighted attenuation memory factor to reduce the classification error rate of feature selection. Through comparative experiments, study the differences between the optimization method of this paper and other progressive algorithms in feature causality selection, analyze and compare the classification error rate of several algorithms, and verify the effectiveness of scaling crossover operator in reducing the classification error rate of the optimization method of this paper.

Feature selection problem analysis

The evaluation criterion for feature selection focuses on the direction of assessing the coherence and redundancy between the features and the target variable, i.e., the ultimate goal of the research on feature selection methods is to remove the features that are incoherent or redundant with the target variable in the original dataset. The coherence of features is broadly categorized into three types, i.e., strong coherence, weak coherence, and incoherence.

Coherence of features

Assuming that the entire set of features in the training data is denoted as Ω, where the ind feature is denoted by Fi, U = Ω − Fi denotes the remaining set of features after the removal of feature Fi, Y denotes the target information, and P()$$P\left( \cdot \right)$$ denotes the probability of occurrence of the event, then:

The sufficient necessary condition for strong correlation with respect to the target information Y, Fi is: P(Y|U)P(Y|Fi,U)$$P(Y|U) \ne P\left( {Y|{F_i},U} \right)$$

A sufficiently necessary condition for Fi to be weakly correlated with respect to the target information Y is: P(Y|U)=P(Y|Fi,U)$$P(Y|U) = P\left( {Y|{F_i},U} \right)$$

and for ∃U′ ⊂ U, the following conditions hold: P(Y|U)P(Y|Fi,U)$$P\left( {Y|U^\prime } \right) \ne P\left( {Y|{F_i},U^\prime } \right)$$

A sufficiently necessary condition for Fi to be an incoherent feature with respect to the target information Y is, for ∀U′ ⊆ U, the following equation: P(Y|U)=P(Y|Fi,U)$$P\left( {Y|U^\prime } \right) = P\left( {Y|{F_i},U^\prime } \right)$$

From the above definition of feature coherence, it can be seen that features with strong coherence to target information Y are necessary to form the optimal feature subset, weakly coherent features will only have an effect on target information Y under certain circumstances, and incoherent features will not have an effect on Y under any conditions. Based on the above definition, the ultimate goal of feature selection is to retain the features that are strongly coherent to the target information Y and remove the features that exhibit incoherence to the target information Y in the selected feature subset as much as possible. The above definition does not give an answer to the question of whether to retain weakly coherent features in the final feature subset.

Application value of feature selection

Feature selection is to treat all possible feature combinations as a search object in the search space, the goal is to remove those features that are irrelevant to the target variable and contain redundant information, and then find a subset of features with the best discriminative ability to model the empirical data. The simplified dataset can effectively shorten the model training time and improve the accuracy of the constructed model. Additionally, by using simplified data to create models, researchers can gain greater assistance in comprehending and analyzing the original data.

The theoretical value of the definition of feature selection lies in the following three aspects: 1) the theoretical definition of feature selection is completed from the perspective of coherence and redundancy between features and target variables, which points out the direction of the research on the feature selection problem: 2) it clarifies the purpose of the research on the feature selection problem, i.e., searching for the best performance feature combinations from the original feature space, which unites the research ideas for researchers; 3) The definition of feature selection does not limit the optimization objective of the problem, but is guided by the discovery of the knowledge hidden behind the original dataset, which gives researchers more space for algorithmic choices in solving specific problems, and greatly expands the scope of application of the definition of feature selection.

General process of feature selection

Figure 1 shows the process of the feature selection algorithm, which contains five basic steps:

The feature selection algorithm starts with the initialization process.

The initialization process is the first step of the feature selection algorithm, which is based on all the original features in the problem. For example, in PSO-based feature selection algorithms, the search space dimension is usually set to the total number of all available features for this process.

The process of generating a subset of candidate features.

This search process can start with no features, all features, or a random subset of all features. Many search techniques are applied in this feature subset search step to search for the best subset of features, including conventional methods and evolutionary computational techniques.

Evaluation process, which is used to test the merit of the feature subset.

The subset of features generated by the search process will be examined by the evaluation function to determine their merit. The evaluation function plays an important role in the feature selection algorithm as it helps guide the algorithm in searching for the optimal subset of features.

Decide when to stop based on given criteria.

The stopping criteria can be based either on the search process or the evaluation function. Stopping criteria based on the search process can be whether a predetermined number of features have been selected and whether a predetermined maximum number of iterations has been reached. Stopping criteria are determined by the evaluation process and include whether adding or removing any features does not produce a better subset and obtaining the best subset according to a certain evaluation function. The loop continues until the stopping criteria is satisfied.

A validation process for testing whether a subset of features is valid.

The validation process is not a part of the feature selection itself, but it is necessary to validate the effectiveness of the feature selection algorithm, so it is necessary to validate the pre-selected subset of features in a test set.

Figure 1.

General process of feature selection

The two key factors in a feature selection algorithm are the search strategy and the evaluation criteria. Feature selection asks for a search space with 2n candidate solution, where n is the number of all available features. Feature selection algorithms explore the search space for different combinations of features to find the optimal subset of features. However, the search space is large when the number of features is large. This makes feature selection one of the main reasons why it is a difficult task. Therefore, in most cases, it is impractical to search the entire space using the exhaustive method. So, an effective global search technique is needed to find the optimal feature subset.

Evaluation criteria are used to form an evaluation function, which determines the merit of the selected features and guides the algorithm’s search. The optimal subset of features is always associated with an evaluation function. For the feature selection problem, the optimal subset of features selected using one evaluation function may be different from the optimal subset of features selected using another evaluation function. Due to the problem of feature interaction, the subset of features that can achieve the highest classification performance is usually a set of complementary features.

Evaluating good features will preferentially guide the algorithm to search for complementary feature subsets. And depending on whether a learning, classification algorithm is known to be used in the evaluation function or not, existing feature selection algorithms can be broadly categorized into two groups: filter methods and wrapper methods. Filter feature selection algorithms are independent of any classification algorithm, whereas wrapper in methods use a classification algorithm in the evaluation function. The wavelet threshold filtering algorithm described in the following section belongs to the filter method.

Algorithm selection and optimization

This part describes the wavelet threshold algorithm in the filter method, including the basic principle, threshold selection and threshold function selection and other parts, and based on the limitations of the wavelet threshold algorithm, the introduction of the genetic algorithm for optimization, and the design of the scaling crossover operator and threshold filtering part of the genetic algorithm.

Wavelet Thresholding (WT) Denoising Methods
Wavelet Threshold Filtering Fundamentals

Wavelet threshold filtering algorithm is a de-noising method that is simple to implement, small in arithmetic and good in filtering effect. The basic principle is: the noisy image is decomposed into wavelet coefficients of different amplitudes after wavelet transform, the energy of the image signal is concentrated in a few wavelet coefficients, which makes the wavelet coefficients of large amplitude, while the noise energy corresponds to the wavelet coefficients of smaller amplitude, according to the different sizes of the wavelet coefficients corresponding to them, a threshold is set to distinguish these coefficients, the wavelet coefficients of amplitude lower than the threshold is rejected, and wavelet coefficients of amplitude greater than the threshold is retained or shrinkage is done accordingly. The wavelet coefficients whose magnitude is lower than the threshold are eliminated, the wavelet coefficients whose magnitude is greater than the threshold are retained or shrinkage processed accordingly, and finally the remaining wavelet coefficients are utilized for image reconstruction, which realizes the denoising of the image and retains the detailed information of the image well.

Using the threshold method of filtering, there are two key points: the first is the selection of the threshold parameters; the second is the selection of the threshold function.

Wavelet threshold selection

The wavelet threshold filtering algorithm relies on the threshold selection, and its size has a direct impact on image denoising. If the threshold is too large, it will make the useful high-frequency information (such as edges, details of texture information) is lost, resulting in image distortion, blur; if the threshold is too small, it will make too much noise residual and lead to poor filtering effect.

Currently commonly used image denoising thresholds are VisuShrink thresholds.VisuShrink thresholds are also unified thresholds or universal thresholds, and their threshold λ is expressed as follows: λ=σ2lnN$$\lambda = \sigma \sqrt {2\ln N}$$

In equation (5), σ is the standardized variance of the noise and N is the size of the noise-containing image or the length of the signal. There is another condition for the use of this threshold, which is that the noise variance must be evaluated first. In practice, the standard variance σ^$$\hat \sigma$$ of the noise is unknown, and since the noise is mainly concentrated in the smallest scale to get HH1 sub-band, we can utilize the median estimation method, i.e., wavelet coefficients of the HH1 sub-band to evaluate the noise variance. σ^=Median(|W[xij]|)0.6745|W[xii]|HH1$$\hat \sigma = \frac{{Median\left( {\left| {W\left[ {{x_{ij}}} \right]} \right|} \right)}}{{0.6745}}\left| {W\left[ {{x_{ii}}} \right]} \right| \in H{H_1}$$

In Eq. (6), σ^$$\hat \sigma$$ is the estimate of the standardized variance of the noise, and Median(|W[xij]|)$$Median\left( {\left| {W\left[ {{x_{ij}}} \right]} \right|} \right)$$ denotes the median of the absolute values of the wavelet coefficients of the first layer of high-frequency subbands HH1 after wavelet decomposition.

Selection of wavelet threshold function

Threshold function is also called the threshold contraction strategy, threshold function by thresholding the noise-containing high-frequency wavelet coefficients, it applies different processing strategies to wavelet coefficient modes greater than and less than the threshold, with the aim of maximizing the suppression of wavelet coefficients corresponding to the noise and retaining the coefficients corresponding to the image signal.

Let wj, k be the unprocessed coefficients of the noisy image or signal after wavelet decomposition, w^j,k$${\hat w_{j,k}}$$ denotes the wavelet estimated coefficients output after thresholding, and λ is the image denoising threshold. The commonly used threshold function has a hard inter-value function. This function compares the absolute value of the wavelet transform coefficients wj, k with the selected threshold value λ, and only retains the larger wavelet coefficients and sets the smaller wavelet coefficients to zero, with the following expression: w^j,k={ 0 |wj,k|<λ wi,j |wj,k|λ$${\hat w_{j,k}} = \left\{ {\begin{array}{c} 0&{\left| {{w_{j,k}}} \right| < \lambda } \\ {{w_{i,j}}}&{\left| {{w_{j,k}}} \right| \geq \lambda } \end{array}} \right.$$

The interval function in the ( −λ, λ) interval is not continuous, there is a break point exists, however, in practical applications often need to carry out the derivation of the threshold function, but due to the ( −λ, λ) interval of the threshold function is not continuous all in the break point can not carry out the derivation of the operation, so there are some limitations; in addition, the threshold function is greater than the interval wavelet coefficients remain unchanged and the wavelet coefficients less than the threshold for processing, but the actual situation is greater than the threshold wavelet coefficients may also exist in the noise interference, so this threshold processing method has defects. The wavelet coefficients larger than the threshold may also have noise interference, so there are defects in this thresholding method.

AKF structure based on genetic algorithm (GA) optimization
Basic Flow of Genetic Algorithm

Select the output measurements yk of the data segment, the input measurements uk and the filter parameter values to be optimized ρ After the time update and state update of the improved adaptive threshold filter, the a posteriori estimation of the system state is obtained x^(k)$$\hat x\left( k \right)$$; compute the objective function needed for optimization, and use the genetic algorithm to perform optimal retention, replication, crossover, and mutation operations in the m solution space to obtain the new corrected population, which is reintroduced into the AKF for parameter estimation, and repeat the above steps for the measurement information until the objective function meets the design requirements or the population update of the genetic algorithm ends.

Genetic algorithm is a kind of global search and optimization algorithm independent of the domain of the problem under study, with good parallel computing performance and global search capability. The basic process of genetic algorithm is shown in Fig. 2, with reference to the genetic phenomena of biological natural selection, gene crossover and mutation, the solution object is coded and expressed, and then a reasonable fitness function is set to select the good population individuals to be copied, and the group with greater fitness is inherited to the next generation, and with reference to the mutation phenomenon, the mutation operation is performed randomly on individual genes in order to expand the search space. In this way, continuous reproduction is carried out until the individual with optimal fitness is obtained or the set loop termination condition is satisfied.

Figure 2.

Basic flow of genetic algorithm

The computational steps generally include:

Initialization: the initialization operation generally consists of solving for the binary, floating-point, or symbolic encoding of the parameters, setting the population size, and generating the initial population and evolutionary generations.

Selection: on the basis of the criterion as fitness, genetically superior individuals are selected from the current population and reproduced for inheritance to the next generation.

Crossover: Pairing within the optimized population after selection according to some probability, exchanging some information between better adapted individuals in order to reproduce individuals of the new population.

Mutation: For individuals selected according to a certain probability, the genes at certain positions are changed according to the probability of mutation, and new individuals with broadened diversity are obtained.

Scaling cross operator design

In genetic algorithms, the crossover operator is the main feature that distinguishes them from other evolutionary algorithms. By simulating the biological hybrid reproduction process to perform gene crossover operations on individuals, it increases the diversity of individuals, enlarges the range of optimality search, and plays the role of reaching the global optimal solution in the solution space.

The specific operation is as follows: in the selected population, two individuals are selected according to the design pattern, and part of the genes of both are exchanged in a designed way according to the crossover probability Pc, so as to reproduce to get 2 new individuals. When the probability of crossover Pc is large, the crossover operator has a strong ability to reproduce new individuals, and thus has a better ability to explore the global optimal solution in the solution space; comparatively, when the probability of crossover Pc is small, the probability of the good individuals of the population to be destroyed is smaller, and the ability to find the optimal solution in the localization of the current solution is better.

In this paper, floating point coding is used to express the gene values in each individual as a floating point number in the corresponding range, and the coding length of the individual is equal to the number of variables in the coding length. The single-point crossover, two-point crossover and uniform crossover applicable to binary can also be applied to floating-point coding, while arithmetic crossover operator, scaling crossover operator, etc. can also be used based on floating-point coding. Considering the large length of individuals, this paper uses the scaling crossover operator. After the initial population selection parameter values are randomly dispersed in the solution space, the general population size is large, after mixing the weightless selection operator, the existing size can be obtained less than or equal to the initial population size, followed by carrying out the selection operator operation in the existing population.

The scaling crossover probability is set up in such a way that after sorting the fitness size of the existing population individuals, two individuals are selected to generate new individuals by non-uniform linear combination with the proportion of their fitness as the benchmark. Assuming an arithmetic crossover between Individual A and Individual B, the probability of their selection and the two new individuals produced are: Pc=fAfA+fB$${P_c} = \frac{{{f_A}}}{{{f_A} + {f_B}}}$$ A = PcB+(1Pc)A B = PcA+(1Pc)B$$\begin{array}{rcl} A^\prime &=& {P_c}B + \left( {1 - {P_c}} \right)A \\ B^\prime &=& {P_c}A + \left( {1 - {P_c}} \right)B \\ \end{array}$$

The following describes the selection strategy for the two crossover individuals, the first choice is to find the largest and smallest fitness values based on the fitness of the existing population after sorting, which account for fpmax and fpmin of the sum, respectively, and the probability of the two being selected in the crossover is 1 and 0. The probability of the remaining individuals to be selected in the crossover operator is then shown in Equation (10) Pi=fpmin+(fpmaxfpmin)i1n1$${P_i} = f{p_{\min }} + \left( {f{p_{\max }} - f{p_{\min }}} \right)\frac{{i - 1}}{{n^\prime - 1}}$$

In the form of a roulette wheel as shown in Fig. 3, after selecting two individuals, the scaling crossover operator is operated according to Eq. (9), and at the end of the crossover step, the new individual is added to the population, so that the number of existing populations is n″.

Figure 3.

Roulette selects cross individuals

Exponentially weighted decay memory scaling cross-filtering

Since Pk/k1$${P_{k/k - 1}}$$, Kk and Pk contain only the pre-test data, making the filter valuation process, the new measure value correction role declines, the role of the old measure value is relatively rising, in the process of calculating the introduction of error, resulting in the valuation of the error than expected, causing the filter dispersion; therefore, the fundamental measure to prevent dispersion is to increase the weight of the new measure value in the current filtering. To this end, an adaptive threshold filtering algorithm with exponentially weighted decay memory is proposed, and its main idea is that when there is a large error in the model, the memory length of the threshold filter can be limited by the exponentially weighted decay memory factor at each moment, which plays a role in fully utilizing the new measure value to achieve the effect of suppressing the filter dispersion.

Taking k = N then, the optimal gain matrix formula for threshold filtering is: KN=PN/N1HNT(HNPN/N1HNT+RN)1$${K_N} = {P_{N/N - 1}}H_N^{\prime T}{\left( {H_N^\prime {P_{N/N - 1}}H_N^{\prime T} + R_N^\prime } \right)^{ - 1}}$$

In order to overcome the dispersion, it is necessary to reduce the value of Kk before the N st moment, and the relative prominence of KN at the N rd moment, so that the corrective effect of the new measurements rises, and the effect of the obsolete measurements declines, and because of the inverse relationship between Kk and Rk$${{R_k'}}$$, it is necessary to reduce the value of Kk before the N th moment, and it is possible to make the Rk$${{R_k'}}$$ th moment further and further away from the N th moment; it is gradually getting bigger and bigger.

Using an exponentially weighted approach, P0; R1,R2,,RN $$R_1^\prime,R_2^\prime, \cdots ,R_N^\prime$$; Q1,Q2,,QN$$Q_1^\prime,Q_2^\prime, \cdots ,Q_N^\prime$$ can be changed to ei=0N=1P0;ei=0N=1R1,ei=0N=1R2,,ei=0N=1λiRN,RN;ei=0N=1λiQ1,ei=0N=1λiQ2,,ei=0N=1λiQN,QN$${e^{\sum\limits_{i = 0}^{N = 1} {{P_0}} }};{e^{\sum\limits_{i = 0}^{N = 1} {{{R_1'}}} }},{e^{\sum\limits_{i = 0}^{N = 1} {{{R_2'}}} }}, \cdots ,{e^{\sum\limits_{i = 0}^{N = 1} {{\lambda _i}} }}{R_N'},{R_N'};{e^{\sum\limits_{i = 0}^{N = 1} {{\lambda _i}} }}{Q'_1},{e^{\sum\limits_{i = 0}^{N = 1} {{\lambda _i}} }}{Q'_2}, \cdots ,{e^{\sum\limits_{i = 0}^{N = 1} {{\lambda _i}} }}{Q_N'},{Q_N'}$$

Replacing the data prior to moment N with Eq. (12), the derivation of the corresponding threshold filtering becomes X^*k/k1=ΦkX^k1*$${\widehat X^*}_{k/k - 1} = {\Phi _k}\widehat X_{k - 1}^*$$ X^*k=X^*k/k1+Kk(ZkHkX^k/k1*)$${\widehat X^*}_k = {\widehat X^*}_{k/k - 1} + {K_k}\left( {{{Z_k'}} - {{H_k'}}\widehat X_{k/k - 1}^*} \right)$$ Kk*=Pk/k1*HkT(HkPk/k1*HkT+Rk*)1$$K_k^* = P_{k/k - 1}^*H_k^{\prime T}{\left( {H_k^\prime P_{k/k - 1}^*H_k^{\prime T} + R_k^{\prime *}} \right)^{ - 1}}$$ Pk/k1*=eλkΦk/k1Pk1*Φk/k1T+Qk1*$$P_{k/k - 1}^* = {e^{{\lambda _k}}}{\Phi _{k/k - 1}}P_{k - 1}^*\Phi _{k/k - 1}^T + Q_{k - 1}^*$$ Pk*=[IKk*Hk]Pk/k1*$$P_k^* = \left[ {I - K_k^*{{H_k'}}} \right]P_{k/k - 1}^*$$

The initial value is X0*=E[X0];P0*=var[X0]=P0$$X_0^* = E\left[ {{X_0}} \right];P_0^* = \operatorname{var} \left[ {{X_0}} \right] = {P_0}$$

In the formula, the variable with “*” is the value of the variable at the N moment of optimal filtering, compared with the standard threshold filtering, the first term in the exponentially weighted one-step prediction variance array Pk/k1*$$P_{k/k - 1}^*$$ expression is multiplied by a factor eλk$${e^{{\lambda _k}}}$$ greater than 1, which makes Pk/k1*$$P_{k/k - 1}^*$$ larger, and thus Kk*$$K_k^*$$ larger, as can be seen from Eq. (14), and the increase in Kk*$$K_k^*$$ increases the weight of the new measure at the moment of k of the new measurement value at the moment of 8, which effectively prevents the dispersion of the observation value.

The threshold filtering dispersion criterion is e^kTe^kγtrE(e^ke^kT)$$\widehat e_k^T{\widehat e_k} \leq \gamma \cdot trE\left( {{{\widehat e}_k}\widehat e_k^T} \right)$$

where e^k=ZkHkX^k/k1*$${\hat e_k} = Z_k^\prime - H_k^\prime \hat X_{k/k - 1}^*$$ is the sequence of residuals, γ ≥ 1 is the reserve coefficient, and γ = 1 is taken as the strictest convergence condition, we have eλk=e^kTe^ktr[HkQk1*HkT+Rk*]tr[HkΦk/k1Pk1*Φk/k1THkT]$${e^{{\lambda _k}}} = \frac{{\hat e_k^T{{\hat e}_k} - tr\left[ {H_k^\prime Q_{k - 1}^*H_k^{\prime T} + R_k^{\prime *}} \right]}}{{tr\left[ {H_k^\prime {{{\Phi }}_{k/k - 1}}P_{k - 1}^*{{\Phi }}_{k/k - 1}^TH_k^{\prime T}} \right]}}$$

In order to suppress dispersion and increase the weight of the new measurement value, eλk$${e^{{\lambda _k}}}$$ is generally taken to be a value greater than 1, and is calculated as 1 when eλk$${e^{{\lambda _k}}}$$ is less than 1.

In practice, since different data channels require different attenuation factors to achieve the effect of suppressing dispersion in the calculation process, multiple attenuation memory factors are added to realize the fading attenuation operation at different rates for different data channels, respectively, which can result in multiple attenuation memory factor filters with higher performance.

Equation (16) can be improved as Pk/k1=SkΦk/k1Pk1*Φk/k1T+Q*k1$${P_{k/k - 1}} = {S_k}{\Phi _{k/k - 1}}P_{k - 1}^*\Phi _{k/k - 1}^T + {Q^*}_{k - 1}$$

where Sk is an exponentially weighted matrix of multiple decaying memory factors, Sk=diag(eλ1k,eλ2k,,eλ(11)k,eλnk)$${S_k} = diag\left( {{e^{{\lambda _{1k}}}},{e^{{\lambda _{2k}}}}, \cdots ,{e^{{\lambda _{(1 - 1)k}}}},{e^{{\lambda _{nk}}}}} \right)$$.

Causal feature selection for high-dimensional stationary data

This part focuses on causal feature selection methods for high-dimensional fixed data. Feature selection is typically used to identify a subset of relevant features from a large number of features to construct better predictive models. In bioinformatics, feature (e.g., gene) selection identifies a small number of informative genes for disease prediction. In the era of big data, feature selection is more critical than ever because of the ubiquity of high-dimensional data in a variety of applications. For example, a gene expression dataset in cancer genomics may easily exceed 10,000 features. In recent years, causal feature selection has gradually attracted more attention in the fields of machine learning and causal discovery. The main application areas are bioinformatics, neuroscience, and intelligent systems, aiming at identifying the potential causal relationships (MB) of target features T in order to construct accurate predictive models. Under the fidelity assumption, the MB in a synthetic dataset (BN) is unique among the target features T. MB discovery is crucial for scalable BN structure learning and optimal feature selection. For example, by finding the MB of each feature in the dataset, the discovered MBs are used as constraints in this paper to develop efficient local-to-global structure learning methods. Meanwhile, according to the MB of the target feature T in the dataset, all the remaining features are conditionally independent of the target feature T; therefore, the MB of the category attributes is theoretically optimal for feature selection. Therefore, the goal of this paper is to achieve efficient and accurate MB discovery.

Experiments on feature selection for the optimization method in this paper

In this section, the efficiency and efficacy of the optimization method of this paper are compared with four MB discovery techniques, namely IAMB, MMMB, STMB, and HITON-MB. There are two sets of experiments: the first one is the resultant accuracy of selected features of MB, and the second one is the running time. To assess the accuracy G2 (discrete data) and Fisher z-test (continuous data).

Implementation details and data sets

All the experimental results were conducted on Windows 11 with Intel Core i7-8750H with 4.10 GHz CPU and 8-GIGABYTE (GB) memory. In the experiments, two statistical conditional independence (CI) tests were used, e.g., the G2 test for the discrete dataset and the Fisher z test for the continuous dataset, with significance levels of 0.02 and 0.06, respectively.In Table 1, five benchmark BNs with a sample size of 500 are depicted.These BNs have between 30 and 802 features (variables/nodes), and the benchmark BNs obtain the number of features (variables/nodes) for each of the network’s feature (variable/node) of the MB.

Benchmark BN’s (synthetic dataset)

Bayesian Network #Features #Edges Max In/Out Degree Min/Max |PC set|
Child 30 45 3/7 2/8
Child10 204 128 2/7 2/9
Pig 445 591 4/39 2/41
Gene 802 973 3/10 1/11
Hailfinder 58 65 5/16 2/17
Analysis of experimental results

Table 2 shows the experimental results of the optimization method in this paper with IAMB, MMMB, STMB, and HITON-MB on five synthetic datasets with a sample size of 500. Figure 4 shows the runtime comparison of the optimization method in this paper with IAMB, MMMB, STMB, and HITON-MB on five synthetic datasets with a sample size of 500. The experiments use the F1, Precision, and Recall metrics, as well as classifiers from MATLAB’s built-in functions or tools, to evaluate the accuracy of the generated MBs with respect to the target MBs and to determine their similarity to real MBs.

Evaluation measures on different benchmarks BNs

Network Algorithm F1 Precision Recall Run-Time (s)
Child IAMB 0.79/0.68 0.87/0.68 0.78/0.79 0.09
MMMB 0.75/0.69 0.88/0.78 0.72/0.71 0.25
STMB 0.74/0.71 0.92/0.86 0.66/0.66 0.15
HITON-MB 0.83/0.82 0.95/0.85 0.76/0.76 0.25
WTMB 0.88/0.86 0.98/0.95 0.78/0.78 0.13
Child10 IAMB 0.63/0.63 0.61/0.62 0.66/0.65 1.18
MMMB 0.68/0.67 0.71/0.68 0.64/0.65 2.41
STMB 0.57/0.51 0.71/0.57 0.61/0.56 2.26
HITON-MB 0.78/0.73 0.91/0.76 0.72/0.74 2.31
WTMB 0.73/0.71 0.84/0.76 0.65/0.61 2.22
Pig IAMB 0.73/0.74 0.77/0.81 0.71/0.73 11.06
MMMB 0.82/0.81 0.81/0.81 0.82/0.82 13.01
STMB 0.76/0.72 0.76/0.82 0.74/0.70 11.06
HITON-MB 0.92/0.74 0.86/0.62 0.91/0.91 14.77
WTMB 0.93/0.92 0.92/0.92 0.93/0.90 11.76
Gene IAMB 0.62/0.67 0.61/0.63 0.67/0.67 20.08
MMMB 0.71/0.71 0.67/0.67 0.77/0.77 23.16
STMB 0.55/0.17 0.43/0.11 0.72/0.72 30.23
HITON-MB 0.81/0.79 0.78/0.67 0.91/0.90 23.33
WTMB 0.91/0.91 0.91/0.88 0.94/0.94 21.86
Hailfinder IAMB 0.21/0.22 0.24/0.23 0.16/0.17 0.21
MMMB 0.22/0.23 0.28/0.27 0.16/0.17 0.31
STMB 0.17/0.16 0.27/0.24 0.13/0.12 1.52
HITON-MB 0.21/0.21 0.31/0.27 0.16/0.16 0.76
WTMB 0.22/0.22 0.32/0.30 0.17/0.18 0.22
Figure 4.

Running time (seconds) of the algorithm on five base BNS

Analysis of Table 2 and Fig. 4 reveals that in Child network, this paper’s optimization method is more accurate than IAMB, MMMB, STMB, and HITON-MB in terms of F1, Precision, and Recall rates, whereas the running time is faster than MMMB, STMB, and HITON-MB, but slower than IAMB. On Child10 network, the optimization method in this paper achieves higher accuracy than IAMB, MMMB, STMB in terms of F1 and Precision. In Recall, the optimization method in this paper outperforms IAMB, MMMB, and STMB.HITON-MB is more accurate than the optimization method in this paper in F1 and Recall, however, the accuracy of the optimization method in this paper is comparable to that of HITON-MB considering the accuracy of 0.06 significant level. The running time of the optimization method in this paper is higher than MMMB, STMB, HITON-MB and lower than IAMB.

The optimization method in this paper is more accurate and runs faster than MMMB, STMB, and HITON-MB on large networks like Pig and Gene networks with sample size of 500. In addition, the optimization method in this paper is more accurate than IAMB but runs slower. On dense networks like Hailfinder, the optimization method in this paper is faster than the other three algorithms but slower than IAMB. The optimization method in this paper outperforms all tested methods in terms of F1, Precision and Recall.

The running time of the optimization method in this paper is generally faster than MMMB, STMB, and HITON-MB, but slower than IAMB because the optimization method in this paper reduces the number of independence tests for the entire candidate MB using the target T, which are the conditional independence tests that are currently selected at each estimation. However, IAMB requires more data samples for each test because the number of data samples required is exponentially related to the size of the condition set. Thus, the IAMB learning technique is time-efficient but not data-intensive. When the sample size of the data set is insufficient, the IAMB algorithm is unable to reliably find MBs. due to the limited number of samples, the quality of the MBs learned by the IAMB algorithm is greatly affected in practice, and is not as effective as the optimization method in this paper overall.

Comprehensive experimental results show that compared with other existing algorithms, the optimization method in this paper generally has better performance in terms of F1, Precision, Recall, and Run-Time, and is able to find the MB of the target feature T more reliably.

Comparison of Algorithm Effectiveness

In order to verify the superiority of the optimization method in this paper, we compare the feature classification error rates of several algorithms in the data set through experiments, and further compare the key “scaling cross operator” of the optimization method in this paper with other operators in terms of feature scales, so as to verify the effectiveness of the scaling cross operator.

Comparison of algorithmic classification error rates

In terms of parameter settings, this paper uses the default parameters of weka for the optimization methods in this paper. Table 3 shows the parameter settings of the other four algorithms. All algorithms use the performance of the classifier as the evaluation criterion for the feature subset. To be fair, KNN (k=1) is used as the classifier in this paper, and the maximum number of iterations and population size of all algorithms are set to 60 and 350, respectively. It should be noted that all the algorithms have both single-objective and multi-objective optimization algorithms, but the classification error rate is a common concern for all feature selection algorithms, so this paper selects the individual with the lowest training error rate from the final solution set of each algorithm at the last iteration for comparison.

Experimental parameter Settings

Algorithm Parameters
IAMB Attenuation factor γ=0.2
MMMB Mutation probability=0.2,Mutation retry number=2
STMB c1=c2=3,ω=1.0124,θ=0.5,α=k/WM_rate
HITON-MB pmin=0.16,CR=0.3,F=0.4,α=0.4,β=0.3,σ=0.2

After setting up the experimental parameters, this paper compares the classification error rates of the five algorithms in the datasets. In a total of 20 comparisons in 5 datasets, the optimization method in this paper won 16 times, drew 1 time, and lost 3 times. In terms of classification error rate, the optimization method of this paper ranks first in 4 out of 5 datasets. Therefore, the experimental results verify that the optimization method presented in this paper is effective in dealing with high-dimensional datasets. The advantage of this paper’s optimization method over other more advanced feature selection algorithms is mainly the combination of two search operators with causal relationships using the scaling crossover operator.

To further validate the effectiveness of the optimization method in this paper, Table 4 gives the results of comparing the nonparametric effect sizes of the optimization method in this paper with those of other algorithms using Cliff’s delta, which denotes the degree of overlap between the two sets of samples, ranging from -1 to 1. In general, when |δ| is less than 0.146, it means that the two groups of samples are close to each other. When |δ| is between 0.146 and 0.32, it indicates that the difference between the two sets of samples is small. When |δ| is between 0.32 and 0.473, it indicates that there is a moderate difference between these two groups of samples. When |δ| is greater than 0.473, it indicates that there is a large difference between these two groups of samples. From the results listed in Table 4, it can be seen that the optimization method in this paper has moderate or large differences with the more advanced algorithms on most of the data sets.

Comparison of nonparametric effect sizes using Cliff’s delta

Dataset Algorithm δ
Child IAMB 0.76
MMMB 0.39
STMB 0.61
HITON-MB 0.75
WTMB 0.80
Child10 IAMB 0.75
MMMB 0.43
STMB 0.58
HITON-MB 0.71
WTMB 0.75
Pig IAMB 0.17
MMMB 0.39
STMB 0.42
HITON-MB 0.51
WTMB 0.49
Gene IAMB 0.33
MMMB 0.54
STMB 0.41
HITON-MB 0.19
WTMB 0.40
Hailfinder IAMB 0.61
MMMB 0.42
STMB 0.37
HITON-MB 0.55
WTMB 0.76
Validation of the validity of the scaled crossover operator

In order to further validate the effectiveness of the scaling crossover operator proposed in the optimization method of this paper, three scaling crossover operators of the same class are used in this part for performance comparison, namely arithmetic crossover operator, selection operator and mutation operator. Figure 5 shows the performance of the scaling crossover operator, arithmetic crossover operator, selection operator and mutation operator on “Child10”, “Pig”, “Gene” and “Hailfinder” dataset, respectively.

Figure 5.

Box diagram of feature dimensions obtained by four operators

As shown in Fig. 5, the scaled crossover operator in the optimization method of this paper shows a clear advantage in selecting the number of features compared with the other three operators. Specifically, the scaled crossover operator can obtain a smaller subset of features in the dataset. And the experimental results show that different selection probabilities of different operators lead to different performance in different datasets. The higher the selection probability, the smaller the size of the selected features, and the higher the error rate, and vice versa. This shows that the selection probability of the scaled crossover operator has a key impact on the performance of the optimization method in this paper on different datasets. A lower selection probability of the scaling crossover operator for selecting features and obtaining a smaller subset of features will also be able to improve the correct MB selection rate of this paper’s optimization method in the target feature T.

Conclusion

This paper investigates the available algorithms for the feature selection problem in high-dimensional data. The scaling crossover operator in the genetic algorithm is used to optimize the wavelet threshold filtering algorithm, and the weighted decay memory factor is introduced, and the optimized method and the scaling crossover operator are verified through comparative experiments such as feature selection of datasets, so as to verify the applicability of the optimized method in this paper. The optimized method of this paper and the other four algorithms are applied to five datasets for potential causality identification of target features, and the significance levels are 0.02 and 0.06, respectively.The experimental results show that the optimized method of this paper generally possesses higher potential causality identification accuracies than the other four algorithms in terms of F1, Precision, and Recall rate, and is relatively faster in terms of Run-time. It is also relatively faster in terms of runtime. The optimization method in this paper is able to identify the potential causality of target features more reliably.

The classification error rate of this paper’s optimization method is evaluated against other algorithms. In a total of 20 comparisons of 5 datasets, the optimization method of this paper won 16 times, tied 1 time, and lost 3 times, with an 80% winning rate. And in terms of classification error rate, the optimization method presented in this paper was ranked first in 4 out of 5 datasets. Using Cliff’s delta to compare the nonparametric effect sizes of the five algorithms, the |δ| of this paper’s optimization method is generally greater than 0.473, which is moderately or significantly different from the other algorithms on most datasets. The comparison results verify that the optimization method in this paper is effective in dealing with high-dimensional datasets.

Comparing the scaled crossover operator with several other operators in terms of feature subset scales confirms that the scaled crossover operator has a clear advantage in terms of the number of selected features. The scaled crossover operator has a lower probability of selecting features and can obtain a smaller feature subset, which plays a key role in improving the correctness of feature potential causality identification of the optimization method in this paper.

Language:
English