Research on image processing based on machine learning feature fusion and sparse representation

In recent years, with the rapid development of Internet technology and the emergence of a large number of video websites and social platforms, the image and video data traffic has surged. Under the wave of big data, the importance of image data has become more and more prominent, and the user’s interactive demand for video images has become more and more vigorous, and the image processing field is extremely demanding in terms of technology and results due to its complexity and diversity [1-3]. With the increasing maturity of artificial intelligence technology, especially the breakthrough of machine learning technology, the image processing field ushers in new opportunities [4].

Machine learning, as a branch of artificial intelligence, empowers computer systems to learn and optimize themselves without tedious programming. The key to machine learning lies in the combination of data and statistical techniques, enabling computers to predict and make decisions based on experience in unknown environments [5-6]. Machine learning relies on massive data to train models, which are essentially mathematical functions or algorithms, to achieve the goals of prediction, classification, clustering, etc. by reasonably selecting models such as linear regression, decision trees, neural networks, etc. In machine learning, feature engineering is used as a description of data attributes, and its selection directly affects model performance [7-8]. Effective feature engineering can significantly improve model accuracy and efficiency, and provide strong support for solving practical problems [9].

In the 1980s, graphic image processing technology flourished, with the leap in hardware technology and the increase in social demand, the technology is gradually divided into two core areas of drawing and processing, fine optimization of various types of graphic images to meet the diverse needs of people’s life and work [10-11]. In practical application, users can use advanced equipment and technology to personalize graphic images according to their needs, high-performance hardware can improve image processing efficiency, and two-dimensional processing software such as PS, AI, and three-dimensional processing software such as C4D, MAYA, etc., all show strong image processing capabilities, which can improve the image quality and promote cross-sectoral data sharing and collaboration [12-13].

Sparse representation is the representation of a signal as a smaller amount of data with appropriate transformations and plays a very important role in many aspects of signal processing [14]. Sparse representation can be traced back to traditional methods such as Fourier transform to represent signals. Image sparse representation is a model that represents an image as a linear combination of fewer elementary images, which are often called dictionaries. The vectors in a dictionary are called atoms. Dictionaries can be categorized into complete and overcomplete dictionaries, where an overcomplete dictionary is one when the number of base vectors is greater than the dimension of the signal, which can be represented by a linear combination of multiple atoms [15-16]. Since the number of basis vectors in an overcomplete dictionary is larger than the dimension of the original signal, it provides more options to represent the signal and better accommodates the various variations and complexities of the signal. Compared to dictionaries based on wavelet or Fourier transform, overcomplete dictionaries are more adaptable, can represent signals more sparsely, and can capture more detailed features of signals, thus better representing nonlinear structure and texture information in signals and improving reconstruction performance. The sparse representation based on overcomplete dictionaries provides feature representation for signals and has been successfully applied in many signal processing fields, such as image denoising, image compression, image super-resolution reconstruction, image fusion, etc [17-18].

Machine learning and deep learning based on machine learning have achieved promising results in the field of image processing and possess excellent work in the direction of image feature extraction, segmentation, denoising and classification. Literature [19] describes that the massive accumulation of digital histopathology images has led to a surge in the demand for corresponding analysis, thus machine learning algorithms empowered computer-aided diagnostics is necessary, overviews the potential applications of machine learning algorithms for digital pathology image analysis, and proposes some solutions. Literature [20] compares the current practice of image denoising based on deep learning algorithms and analyzes the underlying logic and operation mechanism of different deep learning algorithms, and finally performs quantitative and qualitative analysis on public data denoising sets. Literature [21] takes SVM and CNN as research cases, conducts experiments to compare and contrast image classification strategies based on traditional machine learning algorithms and deep learning algorithms, and based on the results of the study, it is known that traditional machine learning algorithms have a better performance in the small-sample data set, and deep learning algorithms have a higher recognition accuracy in the large-sample data set. Literature [22] reveals that the UQ method is commonly used in image processing, medical image analysis and other fields after it is proposed, and systematically reviews the cutting-edge research and practice of deep learning-enabled UQ strategy, and finally looks forward to the potential application areas and future development direction of the UQ method. Literature [23] tries to use convolutional neural network to classify plant leaf disease images, and obtains good detection and classification results in both training and practice tests, which confirms the feasibility of plant disease image classification strategy based on deep learning technology. Literature [24] elucidates that deep learning techniques exhibit excellent performance in the field of image segmentation, and critically evaluates the medical image segmentation techniques based on deep learning techniques, summarizes the common hindrances of the technique, and proposes the corresponding solutions. Literature [25] summarizes the research and contribution of machine learning algorithms to promote intelligence and informatization in the fields of image processing and analysis, cybersecurity, and healthcare, etc., and the study promotes the understanding and awareness of the value of machine learning algorithms by practitioner researchers and practitioners in various fields.

The research on image processing based on sparse representation is mainly focused on image fusion. In addition, image texture analysis, enhancement and classification of hyperspectral images applying sparse representation have made breakthroughs in recent years. Literature [26] systematically reviewed paper studies on multisensor image fusion using sparse representation as the underlying logic, involving sparse representation models, dictionary learning methods, etc., and conducted experiments to evaluate the impact of three algorithmic components on the performance of multisensor image fusion when dealing with different applications. Literature [27] conceived an image fusion method that combines image texture decomposition techniques and sparse representation strategies, and conducted corresponding simulation experiments to test it, confirming that the proposed method outperforms the state-of-the-art image fusion methods in terms of vision and evaluation. Literature [28] integrated analytical sparse representation (ASR) and synthesized sparse representation (SSR) and designed a joint convolutional analysis and synthesis (JCAS) sparse representation model to describe the texture patterns of different single-image layer separation tasks, and carried out multi-application evaluations, which showed that the JCAS method has more superior quantitative measurement capabilities and image visual perception. Literature [29] analyzed two hyperspectral image restoration algorithms, fast hyperspectral denoising and fast hyperspectral complementary mapping, both of which take full advantage of the extremely compact and sparse HSI representation associated with its low-rank and self-similarity properties, effectively simplifying the complex computation. Literature [30] talks about a morphological component analysis model composed on the basis of convolutional sparsity theory and uses it as a tool for pixel-level medical image fusion, realizing multi-component and globally sparse representations of the source image, and demonstrating better visual perception and objective evaluation capabilities in evaluation experiments. Literature [31] envisioned an image fusion technique composed based on coupled sparse tensor decomposition and transformed the image fusion problem into the estimation of the core tensor and the three pattern dictionaries and introduced a regularizer to drive the sparse core tensor, and demonstration experiments were conducted on a remote sensing HIS to corroborate the superiority of the proposed method over the present day fusion methods based on HSI-MSI.

In this paper, the ITFSAE image fusion model based on feature fusion and sparse representation is successfully constructed based on sparse autoencoder and combined with orthogonal matching tracking algorithm, maximization selection algorithm and other methods. The model is processed by chunking the original image, composing the union matrix input to the sparse autoencoder, and using the OMP algorithm and the maximization selection algorithm to iteratively converge on the trained feature dictionary, so as to obtain and output the final fused image. At the same time, this paper compares the ITFSAE model with the sparse representation (SR), convolutional sparse representation (CSR), non-subsampled contour wave transform (NSCT), and non-subsampled shear wave transform (NSST) models, and verifies the image fusion effect of the model by comparing the five commonly used evaluation indexes for image fusion, i.e., standard deviation, mutual information, entropy, average gradient, and spatial frequency.

2

Image processing techniques based on feature fusion and sparse representation

2.1

Image processing techniques

Image processing technology mainly refers to the use of computers for image processing and processing, including image processing technology and image processing system, which can realize the analysis and understanding of digital images and constitute a digital page to improve the quality of the image, extract useful information or achieve specific visual effects. It mainly includes image transformation, image enhancement, image restoration, image segmentation, image description, image analysis (recognition), image compression, image fusion and color analysis and other categories, while this paper focuses on image fusion technology.

2.2

Image fusion based on machine learning and sparse representation

2.2.1

Image fusion and its classification

Image fusion is the fusion of different information about the same scene by means of a specific algorithm, so as to synthesize an image that can more accurately describe the scene information.

According to the different fusion schemes adopted, image fusion is broadly categorized into three main types: pixel-level fusion, feature-level fusion, and decision-level fusion.These three types of fusion methods have their own advantages and disadvantages in terms of flexibility, information loss, complexity, anti-interference, and other performance.

1) Pixel-level image fusion: Pixel-level fusion directly targets the pixels of an image for comprehensive consideration, by analyzing the pixels and their neighborhoods in the image, source image features are fused based on the pixels, the correlation between the pixels is fully preserved, and the visual effect of the fused image is enhanced, which is easy for the computer to carry out the next step of processing.

2) Feature-level fusion: feature-level fusion extracts the feature information of the target or region of interest in the source image, such as edges, buildings, people, etc., and then designs a reasonable fusion strategy to fuse these extracted feature information, so as to obtain the composite features of the image for the next step of data processing.

3) Decision-level image fusion: the fusion method at this level needs to first identify and classify the respective feature information of the source image to form the corresponding decision-making information, and then, in response to the specific needs of the actual problem, the decision-making information obtained will be utilized to extract and identify the features of the source image according to the probability of existence of each target and certain criteria, and ultimately, form the global optimal decision-making. The final decision is the global optimal decision.

2.2.2

Sparse autoencoder based image fusion

Sparse representation (SR) is an image modeling technique that exploits the sparse prior of natural image signals. One of the most critical issues in SR-based image fusion is the adoption of SR models. Most of the current SR-based fusion methods use the standard sparse coding model based on individual image components and local blocks, which divides the source image into a set of overlapping image blocks in the original spatial domain for sparse coding to obtain the corresponding sparse representation coefficients. The fundamental idea is that a natural signal can be effectively approximated by combining small atoms from an overcomplete dictionary linearly.

The solution of the sparse representation problem is equivalent to an optimization problem, and the basic algorithms commonly used for sparse coefficients are base tracking (BP), matching tracking (MP), and orthogonal matching tracking (OMP).Sparsity is quantified by a different number of paradigms, and the smaller the value of the paradigm, the better the sparsity.

In this paper, based on machine learning feature fusion and sparse representation, a sparse autoencoder-based image training and fusion model (ITFSAE) is constructed. The sparse autoencoder is used to extract features and train the image to obtain the feature dictionary, while the orthogonal matching tracking algorithm is applied to obtain the sparse coefficient matrix of the image to be fused, and the reconstruction of the source image is realized by combining the feature dictionary with the sparse sparse matrix.

3

Sparse autoencoder-based image training and fusion models

3.1

Sparse autoencoder

Autoencoder (AE) is a neural network based feature representation network with a 3-layer neural network structure, including two parts: encoder and decoder [32]. Among them, the encoder completes the mapping conversion from the input signal to the output representation, and the decoder realizes the reverse mapping of the output representation back to the input space to obtain the reconstructed input. The structure of the autoencoder is schematically shown in Fig. 1.

In Fig. 1: [x₁,,x₂,⋯,x_n], [h₁,h₂,⋯,h_m] and [y₁,y₂,⋯,y_n] are the input, hidden and output layer neurons of the AE, respectively. n and m are the number of neurons. W^m×n and b^m×1 are the weight matrix and bias vector of the encoding process, respectively. W^m×m and b^m×1 are the weight matrix and bias vector for the decoding process.

For p input samples X = [x⁽¹⁾,x⁽²⁾,⋯,x^(p)], the corresponding encoder output H = [h⁽¹⁾,h⁽²⁾,⋯,h^(p)] and decoding reconstruction output Y = [y⁽¹⁾,y⁽²⁾,⋯,y^(p)] are respectively: (1) $H = f_{θ} (X) = s (W X + b)$ (2) $Y = g_{θ^{'}} (H) = s (W' H + b')$

In Eqs. (1)~(2): θ = {W,b} and θ′ = {W′,b′} are model parameters. s is the Sigmoid activation function with the expression: (3) $s (z) = \frac{1}{1 + e^{- z}} z \in (- \infty, + \infty)$

In general, the reconstructed output Y is not an exact reconstruction of the input sample X, but only maximally close to X under the conditional probability that satisfies a certain distribution, i.e., Y ≈ X. Therefore, the training process of the AE is the process of minimizing the reconstruction error function J, which is expressed as follows: (4) $J = \frac{1}{p} \sum_{i = 1}^{p} (\frac{1}{2} {‖ y^{(i)} - x^{(i)} ‖}^{2}) + \frac{λ}{2} \sum W^{2}$

In equation (4): p is the number of input samples. λ is the L₂ regularization coefficient, which is used to reduce the size of the weights to prevent overfitting. The 1st term is the squared sum mean of the error term and the 2nd term is the regularization term or weight decay term.

In order to obtain a more efficient feature representation of the input samples, the sparse autoencoder (SAE) adds a sparsity penalty term to the training process of the AE network so that the hidden layer satisfies a certain degree of sparsity, which improves the efficiency of feature extraction [33]. At this time, the reconstruction error function J_Sparse can be expressed as: (5) $J_{S p a r s e} = J + β \sum_{j = 1}^{m} K L (ρ ‖ {\hat{ρ}}_{j})$ (6) $K L (ρ ‖ {\hat{ρ}}_{j}) = ρ \lg \frac{ρ}{{\hat{ρ}}_{j}} + (1 - ρ) \lg \frac{1 - ρ}{1 - {\hat{ρ}}_{j}}$ (7) ${\hat{ρ}}_{j} = \frac{1}{p} \sum_{i = 1}^{p} h_{j} (x^{(i)})$

In Eqs. (5) to (7): β is the weight of the sparse constraint term. ρ is the sparse parameter. ${\hat{ρ}}_{j}$ is the average activation level. h_j(x⁽ⁱ⁾) is the output of the jth hidden layer neuron corresponding to the ith input sample.

The 2nd term in Eq. (5) is the sparsity penalty term, called KL dispersion, which is used to suppress the hidden layer ${\hat{ρ}}_{j}$ . Obviously, SAE achieves the effect of sparse coding by making the average activation degree of the hidden layer closer to the set sparsity parameter through the sparsity penalty term.

After the overall reconstruction error function of the network is obtained by forward propagation, this reconstruction error is back-propagated from the output layer, and the gradient descent method is used to update the weight matrix W and bias vector [34]. It is represented as: (8) $W = W - α \frac{\partial J_{S p a r s e} (W, b)}{\partial W}$ (9) $b = b - α \frac{\partial J_{S p a r s e} (W, b)}{\partial b}$

In Eqs. (8)~(9), α is the learning rate.

In order to accelerate the convergence speed of the model, and at the same time to prevent the network from the phenomenon of gradient disappearance or gradient explosion, the learning rate decreases with the increase of the number of iterations, which is updated in the following way: (10) $α_{n} = \frac{α_{1}}{1 + γ_{n}} n = 2, 3, \dots, N$

In Eq. (10): γ is the scalar to be set in advance. N is the total number of iterations. α₁ is the initial value of the learning rate. α_n is the learning rate at the nth iteration.

3.2

Dictionary and feature training based on OMP algorithm

3.2.1

Orthogonal matching tracking algorithm

The MP algorithm is one of the earliest greedy iterative algorithms, but since the result of each iteration may not be optimal but suboptimal, it needs to go through several iterations to achieve the optimal convergence result.And the Orthogonal Matching Tracking (OMP) algorithm, which is improved based on the MP algorithm, can effectively solve this problem.

The OMP algorithm is based on the idea of a greedy algorithm, which gradually approximates the original signal by selecting a locally optimal solution during each iteration. It follows the atom selection criterion in the MP algorithm, and one atom of the support set can be obtained in each iteration during reconstruction, and the optimality of the iteration can be guaranteed by recursively orthogonalizing the set of selected atoms, which accelerates the speed of convergence and reduces the number of iterations [35].

The basic idea of the OMP algorithm is to determine the columns of the sensing matrix in a greedy iterative way, ensuring that the columns selected at each subsequent iteration are as close as possible to the redundant vectors at the current stage, and that redundancies are removed from the sampled vectors. At each iteration, the inner product of the current residuals and the observation matrix is calculated, and an atom with the largest correlation is selected and added to the index set to update the residuals and determine the number of iterations. The above process is repeated continuously, and it is guaranteed to continue through multiple iterations until the number of iterations and sparsity are the same. Then the iteration stops.

The basic steps of the OMP algorithm are as follows:

Input: sensing matrix Φ ∈ R^m×n, sampling vector y ∈ R^m, sparsity s.

Output: x of s– sparse approximation $\hat{x}$ .

Initialization: residuals r₀ = y, index set Λ₀ = ϕ, iteration count t = 1.

Step1: Find the footnote λ that corresponds to the maximum value in the inner product of the residuals r and the columns φ_j of the sensing matrix, i.e., $λ_{t} = \arg \max_{j = 1, \dots, N} | 〈 r_{t - 1}, φ_{j} 〉 |$ .

Step2: Update the support set Λ_t = Λ_t–1∪ {λ_t} to record the set of reconstructed atoms in the found sensing matrix Φ_t = [Φ_t–1,φ_λ²].

Step3: Obtain ${\hat{x}}_{t} = \arg \min {‖ y - Φ_{t} \hat{x} ‖}_{2}$ by the least squares method.

Step4: Update the residuals $r_{t} = y - Φ_{t} {\hat{x}}_{t}$ .

Step5: If t < s, t = t + 1, return to step1.

Step6: If t = s, stop the iteration and output $\hat{x} = \hat{x} t$ .

The accuracy of OMP algorithm is not as good as BP algorithm, but it has fewer iterations and low computational complexity, it is a widely used reconstruction algorithm, this algorithm needs to be used in the case of known sparsity.

3.2.2

Dictionary and feature training

Image fusion is the process of combining two or more source images into a single image using various methods, and is required to restore 95% of the basic information of the source image. In this chapter, several groups of images are selected, each group of images is two of the same scene shot under different conditions, through the depth of layer-by-layer training to obtain the image corresponding to the high-quality features shown in Figure 2.

Assuming that the two images are m and n, both of size 128×128, the two images are chunked with a sliding window (using sliding chunks of size 8×8 and sliding step 1), the chunks obtained by sliding are compiled into a combination of column vectors to form new matrices aa1 and bb1, and aa1 and bb1 are added sequentially by columns and rows to combine to form a new union matrix cc1, of size 64×14641. The test image is selected from the training dataset. What is obtained after training is the edge feature information of the two images after simple merging, i.e., after feature extraction by the sparse self-encoder, the output of the hidden layer when it is closest to the original input by comparison between the output and the original input is called the weights of the hidden layer, which embodies the salient features of the image to be fused, which is presented in the form of a matrix. A significant difference between the sparse self-encoder and the sparse representation is the source of the dictionary used to reconstruct the overlap; the sparse self-encoder is obtained from its own training, and the sparse representation is required to be constructed separately. In this method, the weight matrix obtained from the hidden layer containing the features of the image to be fused is an automatically generated dictionary matrix w₁. After obtaining the dictionary w₁, the orthogonal matching tracking algorithm is used to obtain the sparse coefficient matrices x₁ and x₁ of the image to be fused.

Specifically, the weights and biases are first randomly initialized based on the size of the layer with parameters ω₁, ω₂, b₁, and b₂, and converted into the form of vectors. The direct error term, weight penalty term, and sparsity penalty term are set to 0. Then the linear combination value and activation value of each neural network node are calculated by using the forward algorithm, as shown in Eqs. (11) to (12): (11) $z (i) = ω (i) \times x (i) + b (i)$ (12) $a (i) = s i g m o i d (z (i)), i = 1, 2$ where sigmoid is shown in equation (13): (13) $f (z) = \frac{1}{1 + \exp (- z)}$

Next, the weight parameters and bias term parameters are updated with a back propagation algorithm, and the gradient descent method is used to minimize the error [36].

3.3

Image Reconstruction

The previous section allows us to obtain the feature dictionary W₁ of the image that can represent the input signal well after deep training, as well as the sparse coefficient matrices x₁ and x₂ of the individual images to be fused, and the method used in this chapter makes all the image blocks correspond to the same dictionary. The fusion rules are based on a maximization selection algorithm [37]. Specifically, x₁ and x₂ are the sparse coefficient matrices corresponding to the original images, and the maximization selection is used to obtain the joint sparse coefficient matrix A. The reconstructed image matrix is $\hat{x}$ , and the reconstructed image matrix is obtained from Equation (14): (14) $\hat{x} = W_{1} * A$

The column vector of $\hat{x}$ is expanded into a square array, and then the square array is placed at the corresponding position in the image, and the pixel value at each position in the fused image is the average of the pixel values of the corresponding individual square array.

3.4

Overall flow of the algorithm

The overall flow of the algorithm in this paper is as follows:

Input: original images m, n.

Output: fused image $\hat{x}$ .

1) Perform sliding chunking on the original image, and each image chunk is compiled into column vectors.

2) Combine the new matrices into a union matrix.

3) The new union matrix is fed as an input signal into a sparse selfencoder, which is trained to obtain the dictionary W₁.

4) The corresponding sparse coefficient matrices x₁, x₂ of the original image are obtained from dictionary W₁ by matching tracking algorithm, and the joint sparse coefficient matrix A is obtained by maximization selection algorithm.

5) The final fused image is obtained from Eq. $\hat{x} = W_{1} * A$ .

4

Experimentation and analysis of model applications

4.1

Experimental setup

4.1.1

Experimental environment

In this paper, the software environment for ITFSAE model training is based on porting the Caffe deep learning framework to the Windows 11 system and using MATLAB 2017b for code writing. The hardware configuration is: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz, GeForce RTX2060 graphics card, ST1000DM010-2EP102 hard disk, 16G RAM.

4.1.2

Selection of evaluation indicators and comparison algorithms

In order to verify the general validity of the ITFSAE model proposed in this paper, medical images, infrared visible images and multifocus images are selected as test images, and the designed ITFSAE model is compared with four current commonly used image fusion methods, i.e., Sparse Representation (SR), Convolutional Sparse Representation (CSR), Non-Downsampled Contour Wave Transform (NSCT), and Non-Downsampled Shear Wave Transform (NSST). Comparison experiments are carried out.

In order to objectively evaluate the performance of each method, five commonly used evaluation indexes, i.e., standard deviation (SD), mutual information (MI), entropy (EN), average gradient (AG), and spatial frequency (SF), are used to evaluate the image fusion performance of the fused images.The judgment criteria and calculation methods of the five evaluation indexes are shown below:

1) Standard deviation (SD) is the arithmetic square root of the variance, reflecting the degree of dispersion between image pixels, also known as standard deviation or mean square deviation. The standard deviation SD indicates the contrast of the image, the larger the value of standard deviation indicates that the pixel values of the image being evaluated are more dispersed, and the smaller the value indicates that the pixel values of the image being evaluated are more centralized, so the standard deviation is very critical in evaluating the fused image. Its calculation formula is: (15) $S D = \sqrt{\frac{1}{M N} \sum_{i = 1}^{M} \sum_{j = 1}^{N} {(x - \bar{f})}^{2}}$ where $\bar{f}$ is the mean value of image F.

2) Mutual Information (MI) indicates the total amount of information acquired by the fused image F from the input images A and B. The larger value of mutual information indicates that the more information the fused image acquires from the input images, the better the fusion effect is. Conversely, it indicates that the fusion effect is poor. Its calculation formula is: (16) $M I = \sum_{i = 0}^{M - 1} \sum_{j = 0}^{N - 1} P_{F A} (i, j) \log_{2} \frac{P_{F A} (i, j)}{P_{F} (i) P_{A} (j)} + \sum_{i = 0}^{M - 1} \sum_{j = 0}^{N - 1} P_{F B} (i, j) \log_{2} \frac{P_{F B} (i, j)}{P_{F} (i) P_{B} (j)}$

Where P_F(i), P_A(j) and P_B(j) are the edge distributions of images F, A and B and P_FA(i,j) and P_FB(i,j) are their joint distributions.

3) Entropy (EN) is the amount of information contained in each message on average in the messages received by the receiver, also known as information entropy or average self-information. Where the larger the En value of the fusion result, the more information the fused image contains. As for the entropy of the color RGB image, it is usually calculated by converting the color image into a grayscale image. Its calculation formula is: (17) $E N = \sum_{}^{F} P (x) F (x) = - \sum_{}^{F} P (x) \log_{2} P (x)$ where P(x) represents the total probability of pixel value X appearing in image F.

4) Average Gradient (AG) reflects the texture information of the image to be evaluated and is used to evaluate the blurriness of the image. When the AG value is larger it means that the rate of change of the pixel is larger and therefore the image is sharper and vice versa the image is blurrier. The average gradient AG calculation process is shown in the following equation: (18) $A G = \frac{1}{(M - 1) (N - 1)} \sum_{i = 2}^{M} \sum_{j = 2}^{N} \sqrt{[\frac{1}{2} {(x_{1, j} - x_{i - 1}_{, j})}^{2} + {(x_{i, j} - x_{i}_{, j - 1})}^{2}]}$

5) Spatial Frequency (SF) is to calculate the gradient of the image to be evaluated, which can reflect the rate of change of the gray value of the image, and then calculate and evaluate how much detail and texture of the image. In general, a higher value of SF indicates that the image to be evaluated is richer in texture edge information, and vice versa indicates that it lacks more texture edge information. To calculate the spatial frequency SF it is first necessary to calculate the transverse frequency RF, the longitudinal frequency CF, the main diagonal frequency MDF and the sub-diagonal frequency SDF. Where the distance weighted value ωd is set to $\frac{1}{\sqrt{2}}$ : (19) $R F = \sqrt{\frac{1}{M N} \sum_{i = 1}^{M} \sum_{j = 2}^{N} {[F (i, j) - F (i, j - 1)]}^{2}}$ (20) $C F = \sqrt{\frac{1}{M N} \sum_{j = 1}^{N} \sum_{i = 2}^{M} {[F (i, j) - F (i - 1, j)]}^{2}}$ (21) $M D F = \sqrt{\frac{ω_{d}}{M N} \sum_{i = 2}^{M} \sum_{j = 2}^{N} {[F (i, j) - F (i - 1, j - 1)]}^{2}}$ (22) $S D F = \sqrt{\frac{ω_{d}}{M N} \sum_{j = 1}^{N - 1} \sum_{i = 2}^{M} {[F (i, j) - F (i - 1, j + 1)]}^{2}}$

In the next step the spatial frequency SF of this image is calculated using the transverse frequency RF, the longitudinal frequency CF, the main diagonal frequency MDF and the sub diagonal frequency SDF: (23) $S F = \sqrt{{(R F)}^{2} + {(C F)}^{2} + {(M D F)}^{2} + {(S D F)}^{2}}$

4.2

Model training

In this paper, the windows port of the caffe deep learning framework is used to train the dataset, the sparsity parameter ρ = 0.5 is set, the training log is exported and the Accuracy and Loss curves are plotted to provide a visual illustration of the network training.The Loss curves are shown in Fig. 3.

As can be seen from Fig. 3, in the training set loss curve training reaches about 2500 timesloss value begins to fall to a minimum, and in the subsequent iteration floats within a certain range, indicating that the training loss is minimized.

The Accuracy curve for the ITFSAE model training is shown in Figure 4. As can be seen from Fig. 4, the Accuracy is 0.92 when the iteration reaches around 2500 times, and after that it also floats within a certain range, indicating the highest accuracy in the test set.

In summary, it can be seen that the model reaches the set sparsity after about 2500 iterations and is able to output a fused image that meets the requirements.

4.3

Model comparison experiments

In order to verify the fusion effect of the fusion method in this paper, two groups of pre-calibrated multi-focus and infrared images are used for fusion experiments, respectively: “airplane” image, “grass” image. The objective fusion evaluation results of this paper’s ITFSAE model and the SR, CSR, NSCT and NSST image fusion models for the “airplane” fusion image are shown in Figure 5.

As can be seen in Figure 5, in the “airplane” fusion image, the evaluation indexes SD, MI, EN, AG and SF measured by the ITFSAE model constructed in this paper are 41.96, 7.46, 9.22, 6.14 and 16.73, respectively, which are larger than those of other comparative models, thus verifying the validity of the model.

The objective fusion evaluation results of the “grass” fusion image of each image fusion model are shown in Figure 6.

As can be seen in Fig. 6, in the “grass” fusion image, the values of the five evaluation indexes of the ITFSAE model constructed in this paper, namely, SD, MI, EN, AG and SF, are also larger than those of the SR, CSR, NSCT and NSST image fusion models, with the corresponding values of 50.37, 9.28, 9.04 and 15.89, respectively, 6.21 and 15.89, respectively.It can be seen that the final fusion result of the model in this paper performs well in terms of objective performance indicators, which proves that the method has a good fusion effect.

5

Conclusion

In this paper, we combine machine learning feature fusion and sparse representation to construct a sparse autoencoder-based image fusion model, ITFSAE, and evaluate its image fusion effect. The specific experimental conclusions are as follows:

1) Firstly, the ITFSAE model is trained and the sparsity parameter is set ρ = 0.5. The training results show that in the training set loss curve training reaches about 2500 timesloss value begins to fall to the minimum, and floats within a certain range in the subsequent iterations, indicating that the training loss is minimized. And the accuracy is 0.92 at this time, and then it floats within a certain range, indicating that the test set accuracy is at its highest.From this, it can be seen that the ITFSAE model reaches the set sparsity after about 2500 iterations, and is able to output fused images that meet the requirements.

2) Second, the ITFSAE model is compared with four commonly used image fusion models, namely, sparse representation (SR), convolutional sparse representation (CSR), non-subsampled contour wave transform (NSCT), and non-subsampled shear wave transform (NSST), for the comparison experiments, and the evaluation metrics are chosen to be standard deviation (SD), mutual information (MI), entropy (EN), average gradient (AG), and spatial frequency ( SF). In the two groups of image fusion experiments of “airplane” and “grass”, the evaluation indexes SD, MI, EN, AG and SF measured by ITFSAE model are larger than those of other comparative models, which indicates that the model in this paper has a better fusion effect and verifies the validity of the model.

Funding:

The research is supported by the Research Foundation of the Natural Science Foundation of Hunan Province, (Grant No. 2024JJ7189); the Social Science Project of Hunan Provincial Achievement Review Association, (Grant No. XSP24YBC319); Hunan Province General Higher Education Teaching Reform Research Project (Grant No. HNJG-20231101); Hunan Province General Higher Education Teaching Reform Research Project (Grant No. HNJG-20231094).

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

Research on image processing based on machine learning feature fusion and sparse representation

Aiwu Chen

Published Online: Mar 17, 2025

Received: Oct 23, 2024

Accepted: Jan 26, 2025

DOI: https://doi.org/10.2478/amns-2025-0338

KeywordsSparse autoencoder, OMP algorithm, Maximization selection algorithm, Feature fusion

© 2025 Aiwu Chen, published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Keywords
Sparse autoencoder, OMP algorithm, Maximization selection algorithm, Feature fusion