Otwarty dostęp

Research on the optimization method of image classification model based on deep learning technology and its improvement of data processing efficiency

  
19 mar 2025

Zacytuj
Pobierz okładkę

Introduction

With the rapid development of computer vision technology, image big data has become a necessary path for the development of science and technology. Quantitative image data can effectively construct image classification models, target detection, tracking models and semantic segmentation models, etc., to enhance the credibility of the application while improving the accuracy of the model [12]. However, the existing algorithms have low accuracy and slow efficiency for big data image classification, how to effectively improve the model classification accuracy and efficiency has become the hot spot of current research [34].

In the current field of industrial automation and intelligent manufacturing, with the wide application of industrial images, the concern for data privacy and security is increasing [5]. Such industrial image data often contains sensitive content about production processes, equipment configuration, and employees’ personal information, which may face the risk of data leakage and misuse during outsourced processing and cloud storage. Deep learning techniques have become a key technology for recognizing and classifying defects in industrial images [68]. Thanks to its outstanding performance in image recognition and classification, deep learning has become the method of choice for analyzing industrial images, especially in critical areas such as defect detection, showing its irreplaceable importance [9].

Deep learning is widely used in cloud computing, but privacy protection in cloud computing has been a technical challenge. Being able to accurately accomplish the image classification task while ensuring privacy security has long been a challenge in privacy protection, in this case, homomorphic encryption techniques have emerged to provide a new solution idea for image classification in cloud computing environment [1012]. Homomorphic encryption techniques allow complex computational tasks, such as defect detection and classification of images, to be performed directly on encrypted data without having to decrypt the data, which fundamentally protects the privacy and security of the data. Homomorphic encryption is not only applicable to the efficient utilization of cloud computing resources, but also provides a strong guarantee for data privacy protection [1315]. However, the practical application of homomorphic encryption faces many challenges, especially its high computational complexity and low processing efficiency become the main obstacles limiting its wide deployment in industrial image classification applications. Therefore, new approaches need to be generated to balance the computational complexity and accuracy of homomorphic encryption [1618].

For the application of homomorphic encryption in industrial image processing, there is an urgent need to optimize its computational efficiency, as well as to explore new algorithms and technical frameworks to overcome the existing limitations. Through technological innovation, both the security protection of data privacy and the demand for high efficiency in industrial image processing can be ensured, so as to achieve the double optimization of data security and processing efficiency, and promote the further development of industrial automation and intelligent manufacturing [1920]. At the same time, the results of homomorphic encryption for industrial image privacy protection help to enhance public trust in industrial data processing, improve social acceptance of data utilization, and lay a solid foundation for the sustainable development of big data and artificial intelligence technology [2122].

In this paper, traditional image classification methods are first compared and analyzed with a deep neural network classification model to study the advantages of deep learning technology in image classification. Then, based on particle swarm algorithms, orthogonal optimization provides a new idea for improving population algorithms. Then the particle swarm algorithm with orthogonal optimization is used in conjunction with VGG to overcome the hyperparameter optimization problem in VGG. On the other hand, in order to solve the long-tailed distribution of image data, this paper proposes a balanced complementary (BACL) loss to mitigate the inhibitory gradient from the complementary class on the tail class in the long-tailed distribution by reviewing the traditional Softmax cross-entropy loss. Finally, the feasibility of this paper’s method is investigated through performance experiments on the model, and the classification effects of different network models are compared.

Deep learning-based optimization model for image classification
Deep learning based image classification model

Image classification is a significant component of the computer vision field and has been utilized in various fields. In the early days, traditional image classification mostly relied on manual operations and could not adapt to the present era of large-scale data, so deep learning image classification techniques came into being. Traditional image classification is very different from deep learning image classification, and there is also a significant difference in performance.

The core of traditional image classification mainly utilizes manual feature extraction and machine learning algorithms for classification. It is mainly divided into four basic steps: feature extraction, feature coding, spatial constraints, and image classification. Although traditional image classification methods have achieved some results in classifying images, there are still some obvious disadvantages compared with deep neural networks, feature extraction relies on manual work, and the model generalization ability is limited, for high-dimensional data, it requires a large amount of computational resources and execution time, and may encounter a “dimensionality disaster”.

The results of traditional image classification methods are certainly undeniable in the initial processing of image tasks. However, their performance is often unsatisfactory when confronted with more complex and varied real-world image scenarios. As a result, researchers have turned to a more powerful and flexible technique known as deep learning image classification. In the research field of deep learning, deep neural network models have achieved remarkable success in image classification tasks.

When dealing with the task of image classification, deep neural networks process pixel data through a multilayer structure and perform computations step by step, thus extracting successively from low-level to high-level features. Deep neural networks are a highly complex function that nonlinearly maps high-dimensional data points to low-dimensional spaces, making them a crucial tool for processing large-scale high-dimensional data. Convolutional Neural Networks (CNNs) are another way to build a hierarchical deep learning model, which is implemented through a stack of convolutional, pooling, and fully connected layers. The input layer holds the pixel data of the sample image, and the convolutional layer performs a convolution operation on the input features using learnable weights, followed by a nonlinear transformation by introducing bias values to extract features. The pooling layer downsamples the input features with the goal of reducing the number of parameters and computational complexity of the model by gradually reducing their dimensionality. Finally, the fully connected layer imports the learned mapping results into the label space of the samples to achieve the goal of classifying the samples. CNNs have demonstrated superior performance in image processing tasks, and have become one of the mainstays of today’s image processing models.

The basic structure of convolutional neural network is shown in Fig. 1. Convolutional neural network is a deep learning model inspired from the neural connections of the human visual center, and is a kind of feed-forward neural network, which was first born in LeNet network proposed by Lecun et al. in 1998. Convolutional neural networks are a layer-by-layer hierarchically connected network structure. Their basic structure generally consists of an input layer, a convolutional layer, a pooling layer, a fully connected layer, and an output layer.

Figure 1.

The basic structure of the convolution neural network

The input layer is the location of the input data to the network, and is the first location where the CNN “feeds” the data. Among them, due to the content, format, and content of various types of data there are deviations. Therefore, the input data usually needs to be preprocessed, so that the input data of each dimension can be controlled within a certain range, which helps the network to extract features and accelerates the training efficiency of the network.

The convolutional layer is the basic operation level and the core part of CNN, which consists of multiple parameter trainable convolutional kernels or filters. The convolution kernel will calculate its weighted sum with the input data in spatial dimension channel by channel, row by row, column by column in the form of sliding window, and the output matrix obtained from the calculation is the feature map.

After obtaining the output feature map, non-linear mapping using activation functions is often required to improve the non-linear fitting of the network. Here, the most widely used ReLU activation function is generally selected.The mathematical expression of the ReLU function is: ReLU(x)={ x,x00,x0

The pooling operation involved in the pooling layer is a downsampling process that reduces the feature map size exponentially while maintaining feature invariance, thus reducing the network parameters and computation, preventing overfitting of the network, and improving the generalization ability of the neural network.

The fully connected layer (FC) is a two-dimensional planar structure formed by the arrangement of many neurons and is generally the last layers of the CNN. It is characterized by the interconnection of neurons in the latter, previously fully connected layer, forming an extremely dense mesh structure. And, in order to further enhance the generalization ability of the neural network, regularization methods such as Dropout are often also used in the fully connected layers.

The output layer is the last layer of the CNN, and different designs are often used for it due to different tasks. In image classification tasks, the number of neurons in this layer is equal to the number of classification categories.

The commonly used deep convolutional neural network models and models are LeNet, AlexNet, VGGNet, GoogleNet, ResNet, DenseNet.

LeNet is a feed-forward neural network that consists of a total of 7 layers, starting with 5 alternating layers of convolution and pooling, followed closely by 2 fully connected layers. Traditional multilayer fully connected neural networks take each pixel as an independent input, which ignores the correlation between pixels and increases the computational effort.LeNet uses parameter-learning convolution to extract multi-location similar features using fewer parameters, which not only reduces the number of parameters but also automatically learns the features from the original pixels.

The AlexNet model is used to improve the model learning performance by increasing the depth of the convolutional neural network and using parameter optimization strategies. It is an 8-layer network including 5 convolutional layers and 3 fully connected layers, and the maximum pooling operation is performed after the 1st, 2nd, and 5th convolutional layers to reduce the amount of data. After 5 rounds of convolution and pooling operations, linear eigenvalues are obtained by Dropout operation and the more efficient ReLU function is used to alleviate the problem of gradient vanishing to a certain extent, thus improving the convergence speed.

VGGNet has achieved good results in both image classification and localization problems, and is famous for its simplicity, homogeneous topology, and increased depth. The use of 138 million parameters is a major limitation, making it computationally expensive and challenging to deploy on systems with limited resources.

The GoogleNet framework is designed to achieve high accuracy while minimizing computational cost. It introduces inception block in CNN, which utilizes the idea of splitting, transforming, and merging to incorporate multi-scale convolutional transforms, replacing the traditional convolutional layer with a small block that encapsulates filters at different scales to capture spatial information at different scales. In addition, the connection density is reduced by using global average pooling in the last layer instead of using a fully connected layer. However, the main drawbacks of GoogleNet are its heterogeneous topology, which requires customization from one block to another, and the representation bottleneck, which greatly reduces the feature space of the next layer and can lead to the loss of useful information.

ResNet includes ResNet18, ResNet34, ResNet50, ResNet101, and ResNet152, which became the most pioneering work in computer vision and deep learning. For the degenerate phenomenon residual structure, the original mapping H(x) is transformed into H(x) = F(x) + x, where F(x) is the residual obtained by a series of nonlinear transformations and x is the original input. This approach mitigates the phenomenon of gradient vanishing and improves the model fitting ability.

Dense Net includes DenseNet121, DenseNet169, Dense Net201, DenseNet264. because making short connections (residual connections or skip connections) between the input and output layers can make deep convolutional networks deeper, more accurate, and more efficient. The network connects each layer to each layer of the feedforward propagation. A traditional L-layer convolutional neural network has L connections and a L-layer DenseNet network has L(L+1)2 connections. For each layer in the network, the feature maps of all previous layers are used as inputs to that layer, and the layer’s own feature maps are used as inputs to the subsequent layers.The DenseNet network mitigates the phenomenon of gradient vanishing through dense connectivity, and enhances feature propagation by utilizing feature reuse.

After that, other improved network models, ResNext, and convolutional neural network models based on the attention mechanism (e.g., Residual Attention Network (RAN), SENet, etc.) have emerged, etc.

Currently, the attention mechanism has become an effective means of performance improvement in the field of deep learning-based computer vision. Humans can quickly observe the focus area when facing a complex scene, and adding the attention mechanism to the model simulates the human visual system so that it can quickly focus on the focus location features. So the attention mechanism can be abstracted as: Attention=f(g(x),x) where x represents the input feature, g(x) represents the process of generating attention weights to the input feature, and f( ) is a function that adds attention to the input feature.

Classification optimization based on particle swarm algorithm

The particle swarm optimization algorithm in machine learning is also a type of group intelligence optimization algorithm that is simplified and used to solve optimization problems. The particle swarm algorithm simulates birds in a flock by designing a massless particle that has only two attributes, speed and position, with the speed representing the speed of movement and the position representing the direction of movement. Each particle searches for the optimal solution in the search space individually and records it as the current individual extreme value, and shares the individual extreme value with the other particles in the whole particle swarm, and finds the optimal one as the current global optimal solution of the whole particle swarm, and all particles in the particle swarm adjust their own speed and position according to the current individual extreme value that they have found and the current global optimal solution that is shared by the whole particle swarm.

In the classical particle swarm algorithm, when the particles search in the solution space, due to the influence of the attribute of velocity in the position update formula, the problem of inertia weights being too small to explore the new domains occurs in the late stage of the optimization search, which leads the algorithm to fall into the local optimum. This chapter proposes a new strategy by studying the oscillation process of particles in the PSO algorithm and improving the PSO algorithm based on the combination vector of local optimal vectors and global optimal vectors. The orthogonally improved particle swarm algorithm (OPSO) proposed in this chapter can not only solve the drawbacks of global PSO, but also improve the performance of PSO.

In active groups, particles can be updated through an orthogonal diagonalization process, where the position vectors of these particles can be orthogonally diagonalized. The improved particle swarm algorithm (OPSO) relies on the application of the orthogonal diagonalization (OD) process. In this process, orthogonal guidance vectors in the active group can be obtained. In addition, the OD process allows obtaining the diagonal matrix DM by transforming the multiplication of the three matrices.DM is used to update the velocity and position vectors of all particles of the swarm. Thus, the updating process can be performed with the i st vector of velocity and position through a diagonal element of DM di. The strategy of the OD process can provide the best solution and can improve the convergence results in the search space. As mentioned above, the DM can be obtained by converting a square matrix L having size n × n to a diagonal matrix called DM having size n × n as follows. L=NDMN1 where N is a matrix consisting of the eigenvectors of L and the diagonal elements of DM of size n × n. The matrix N is invertible and can be represented so that the Sigmoid function can be written as: L=CDMC1 where the columns are orthogonal to each other in matrix C: DM=C1LC where C is also an orthogonal matrix: DM=CTLC

In this paper, Orthogonally Optimized Particle Swarm (OPSO) algorithm is used with VGG to overcome the hyperparameter optimization problem in VGG, where Orthogonally Optimized Particle Swarm (OPSO) algorithm provides a new idea to improve the population algorithm. In this algorithm, there exists a population with m particle. The size of each particle is d. Based on the OD, the m particles are divided into two groups, active and passive.

The Orthogonal Optimization of Particle Swarms (OPSO) algorithm is described as follows:

Assume that h(x) represents the function to be optimized and that the number of iterations in the search process is defined by NIt. The steps of the improved particle swarm algorithm are described as follows:

Randomly initialize each particle i with a vector of position Pi(0) and velocity Vi(0). I = 1, 2, …, m.

Compute the objective function h(x) using the position vector Pi(0).

Initialize the position vectors of the particles by passing the following equation in the PSO algorithm.

Based on the fitness value of h(x), the m individual position vectors can be sorted in ascending order.

Construct a matrix B of size (m × d). In this matrix, each row occupies one of the m individual position vectors in the same ordered sequence as in step B.

Use PSO to convert matrix 4 to a symmetric matrix A of size (d × d).

Apply OD to matrix A to obtain a diagonal matrix DM of size d × d.

The position and velocity vectors of the 4 particles of the active group can be updated by the following equation: Vi(t)=Vi(t1)+cr(t)[ DMi(t)Xi(t1) ] Xi(t)=Xi(t1)+Vi(t)

Determine Gpers,i(t) from m particle as follows:

Recognition classification based on hyperparameter optimization algorithm and improved network: Gpers.i(t)={ Xi(t)ifh(Xi(t))h(Gi(t1))Gpers.i(t1)Otherwise

Then evaluate f(Gpers.i(t)), where i = 1, 2, …, m.

Determine the optimal position by the following method. selecting Gpers(t) which corresponds to the minimum value of (f(Gpers,i,(t))), where i = 1, 2, …, m.

Then evaluate h(x) to determine the optimal position Gbest(t), where: Gbest(t)=min(Gpers.i(t))

Multi-objective particle swarm optimization algorithm (MOPSO) is also an evolutionary algorithm based on group intelligence.MOPSO encodes the independent variables of the optimization problem as the coordinates of the particles in the multidimensional hyperbody of the solution space. The search of the solution space is accomplished by the motion of the particles in the space to find the optimal solution.

The basic process of MOPSO algorithm is as follows:

Encoding, encode the independent variables of the optimization problem to be solved as particle coordinates. Rewrite the objective function of the optimization problem as the fitness function.

Initialize the population, generate a population of particles that satisfy the boundary conditions of the independent variables by randomizing, etc., and give each particle an initial velocity. It is also necessary to set the upper limit of evolutionary generations or the optimization objective.

Individual evaluation and optimal position recording: Each particle in the particle swarm is substituted into the fitness function for fitness evaluation. And record the historical optimal position pbest of each particle into set Ap, and record the global optimal position gbest of all particles into set Ag. After that, the two sets are updated by removing the old positions that are dominated by the new positions in them.

Velocity and Position Update: Calculate the new velocity and update the position by the velocity update formula: vi=w×vi+c1r1(pbestpi)+c2r2(gbestpi)

The position update formula is: pi=pi+vi

At this time, it is also necessary to determine whether the particle movement crosses the boundary, if it crosses the boundary, the particle can be bound to the boundary, or the particle can be allowed to bounce on the boundary.

Termination judgment, based on the termination conditions to determine whether to terminate the algorithm.

Data processing design based on sample gradient optimization

This is because many real-world image datasets exhibit distinct long-tail distributions. In a long-tailed distribution, the category with a larger number of samples is called the head class, and the category with a smaller number of samples is called the tail class. Since the number of samples in the head class is much larger than the number of samples in the tail class, the long-tailed distribution increases the inter-class competition between different classes in the Softmax cross-entropy loss during model training. These increased inter-class competitions will cause a large bias in the parameter learning of different classes in the classifier, resulting in poor classification performance of the tail class. In this paper, the proposed BACL loss function is described in detail by reviewing the traditional Softmax cross-entropy loss. And based on the double angle sinusoidal decay strategy, which integrates the BACL and NCE losses in order to construct a new joint training framework to improve the classification performance of long-tailed classification models.

Softmax cross-entropy loss is a common and straightforward classification loss function that is widely utilized to solve various image classification tasks. It mainly utilizes the Softmax activation function to compute the predicted probability of the network output z (logit), and obtains a polynomial estimation distribution of the predicted probability p. Then the Softmax cross-entropy loss value is obtained by performing a logistic operation between the polynomial estimation distribution p and the One-Hot matrix y. The expression is given below: LCE=i=1Cyiglog(pi),pi=eztj=1cezt where C is the total number of categories, z = Wf + b denotes the output of the fully connected layer of the convolutional neural network, W is the weight matrix of the classifier, and f denotes the features of the sample learned by the neural network. p = [p1, p2,…, pc] is the polynomia estimated distribution matrix, computed from pi = sofymax(zi): yig ∈ {0,1} is the ground-truth label of the current sample.

Further, the Softmax cross-entropy loss is utilized to perform a partial derivative operation on the network output z, which in turn obtains the gradients of different categories on the output of the fully connected layer of the convolutional neural network. The formula for calculating the gradient for different categories of samples is given below: LCEzj=pjyig={ pi1ifj=iAndyig=1pjifjiAndyig=0

It is worth noting that when category i is the ground-truth class, category j will be considered as the complementary class of category i and the sample of category j is called the complementary sample of category i. It can be noticed that during the back propagation of the neural network, any sample of category i generates an encouraging gradient pi – 1 on the ground-truth class i. At the same time, this sample also generates a disincentive gradient pj on the complementary class j. This disincentive gradient pj is essentially a categorization penalty imposed by the category i on the category j in the competition between the classes and is able to have a negative effect on the prediction of category j. Therefore, in a long-tailed dataset, as the head class brings a large number of complementary samples to the tail class, these complementary samples will generate a large number of inhibitory cumulative complementary class gradients jicpj on the tail class and overwhelm the encouraging ground-truth class gradients jiN(pi1) . These cumulative inhibitory gradients will largely affect the updating of the network parameters, resulting in biased updating of weights of the tail class in the classifier, and thus causing a bias in the tail class weights. Update to be biased, which results in poorer prediction of the tail classes and model classification results biased toward the head class.

In order to solve the above problems, this chapter proposes a balanced complementary (BACL) loss to mitigate the inhibitory gradient from the complementary class on the tail class in the long-tailed distribution, thus providing a fairer training method for the nibble class samples. The expression of the proposed BACL loss function is: LBACL=i=1cyiglog(pi),pi=ezjicnjnmaxezj+ez where category i is the ground-truth class: category j refers to one of the complementary classes corresponding to category i in the long-tailed data and category j has a sample size of ni, and nmax denotes the sample size of the category with the most training samples.

The BACL loss function is utilized to perform a partial derivative operation on the network outputs as a means of obtaining the gradient of the BACL loss with respect to the logarithms z and zj of the network outputs. For any sample given class i, the gradient z of the BACL loss over groundtruth class i and the gradient zi over complementary class j are computed as follows: LBACLzi=pi1,LBACLzj=njnmaxezjezipi where LBACzi denotes the ground-truth class gradient of category i, an incentive gradient of category i:LBACLzj denotes the complementary class gradient, an inhibitory gradient imposed by category i on category j, and nj/nmax represents the adaptive weighting coefficients devised to reduce the inhibitory gradient imposed by category i on category j and to reduce the complementary class gradient’s overwhelming ground-truth class gradient inhibitory effect.

The designed adaptive weighting coefficients nj/nmax have a direct impact on the Softmax prediction probability pi and the computed value of the complementary class gradient LBACzj , and therefore this section will analyze the role of nj/nmax both in terms of the complementary class gradient LBACZzj and the Softmax prediction probability pi.

nj/nmax is able to adaptively reduce the gradient value of the complementary class LBACLzj=njnmaxeziezpi while not changing the gradient value of the ground-truth class LBACLzi=pi1 .

This adaptive weighting coefficient greatly reduces the gradient of the complementary class on the tail class, which in turn reduces the gradient ratio between the complementary class and the groundtruth class. Secondly, nj/nmax can adjust the contribution of the complementary class in the Softmax prediction probability pi to optimize the prediction probability of the ground-truth target and the complementary target, which is conducive to improving the overall classification performance of the model.

The above balanced complementary loss algorithm is mainly from the perspective of complementary class samples, and reduces the unfavorable impact of the gradient of complementary classes on the tail class by introducing an adaptive weight coefficient in the complementary class of the Softmax cross-entropy loss function. In order to better utilize the information from the complementary samples, the complementary entropy loss is further utilized to supplement the training of the complementary samples, which improves the model performance. Complementary entropy loss is used as a training method in addition to cross-entropy loss to guide the feature learning of complementary samples in the model. Since the complementary entropy loss only considers the complementary samples training except the ground-truth class, the complementary entropy loss is defined as the average value of the information entropy of the complementary samples in a small batch of training, and its mathematical expression is as follows: C^(p^z)=1Nk=1NH(p^kz)=1Nk=1Nj=1,jic(p^kj1p^ki)log(p^kj1p^ki) where N denotes the full number of samples, C is the number of categories, p^k is the predicted probability of the k th sample in ground-truth class i, and p^kj represents the probability that the sample is misclassified into complementary class j. It is worth noting that in the work in the literature, the authors adopt an iterative alternating training approach for cross-entropy loss and complementary entropy loss. The method first computes the cross-entropy loss value for the initial network parameter update and then computes the complementary entropy loss value for the second network parameter update. Although this alternating training method can neutralize the prediction probability of the complementary class and improve the Softmax score of the groundtruth class, the training complexity and time complexity of the model increase exponentially due to the need for two backpropagation calculations for each training.

BACL loss and complementary entropy loss can be combined to achieve more desirable training results. The BACL loss and complementary entropy loss are combined into a new loss function for uniform training through a joint training parameter. In order to balance the values of BACL loss and complementary entropy loss function, this paper normalizes the complementary entropy to get the normalized complementary entropy, meanwhile, the version of normalized complementary entropy abbreviation is defined as NCE, which is expressed as follows: LNCE=1C1C^(p^z)

Finally, the joint loss (BACL+NCE) function of the joint training strategy can be expressed as: L=αLNCE+(1α)LBACL where α is the joint parameter in the joint loss, which takes the range of [0,1]. Inspired by the cumulative learning strategy proposed by Zhou et al. this chapter designs a double-angle sinusoidal decay strategy for the learning of the joint training parameter α. The strategy takes the current training progress (Tepoch) as a variable and can adjust the weights of BACL loss and NCE loss at different training stages. This double angle sinusoidal decay strategy is expressed as follows: α=1sin(2TepochTepochs)

Performance analysis of deep learning-based image analysis models
Comparative analysis of different network models

In order to verify the model effect of different network structures, in this paper, the preprocessed RGB information and the D depth information are jointly used as the input of the convolutional neural network, and the four-channel input convolutional neural network is constructed by using the VGG16, ResNet18, MobileNetV2, and InceptionV3 as the structure of the main network respectively, and due to the fact that there is one channel added to the original data, the When constructing the convolutional neural network model, the model parameters of the fourth channel are initialized in the same way as those of the first three channels. The structure and parameters of the improved network also have some changes, the number of parameters of the network model (Millions) and the number of floating point operations performed per second (FlOPS) as a measure of the size of the network model, and the parameters of different network models are shown in Table 1. It can be seen that the lightweight network MobileNetV2 has the least number of parameters at 3.2, due to the inclusion of the bottleneck structure in MobileNet. The VGG16 network model is larger and has the maximum number of parameters and the value of FLOPS, which is related to the fact that the residual structure is not included in the network structure.The ResNet18 network has fewer parameters due to the inclusion of the residual structure, whereas MobileNetV2, although the number of layers in the network is larger, the bottleneck structure in the network allows the model to have fewer number of parameters.

Different network model parameters

Network model Network number Input size Parameter number FlOPS(G)
VGG16 16 224×224×4 135.5 16.15
ResNet18 18 224×224×4 11.5 2.05
MobileNetV2 54 224×224×4 3.2 0.54
InceptionV3 48 299×299×4 22.1 6.84

In this paper, the experiments are first performed without depth information, and only RGB images are used for the maturity state detection experiments. Then, the RGB images and depth information in the dataset are combined to generate RGBD four-channel data. In order to effectively evaluate the effect of model classification, this experiment uses the combined classification accuracy (Acc) of the test set as an index to evaluate the model classification effect, and the performance of different network models is shown in Figure 2. By adding deep information data, the accuracy of VGG16, ResNet18, MobileNetV2, and InceptionV3 were all improved by 4.2%, 2.6%, 1.6%, and 3.1%, respectively. Among the four network models, InceptionV3 is the most effective, with an accuracy of 95.2% using RGB information and 98.3% using RGBD information, and the highest average accuracy of 96.75% with both types of information.

Figure 2.

Performance of different network models

The analysis of the training of the different models is shown in Table 2 and it can be seen that the training time for VGG16 is much longer at 43min37s.The average time required per data for the four convolutional neural networks tested is very close to each other at around 0.03 seconds, requiring very little computational time.ResNet18 takes about 26 milliseconds per sample data, which is the least of the four networks, whereas InceptionV3 takes about 30 milliseconds per sample data. Of the time required to train 500 generations of convolutional neural networks, MobileNetV2 takes the least time, only 27mine53s, while VGG16 takes 43 minutes, which is related to the size of the model and the size of the parameters.Overall the InceptionV3 model still has a certain advantage in terms of accuracy and training time, and performs well.

The training situation of different models is compared

Network model Network number FlOPS(G) Accuracy rate Training time Test time
VGG16 16 16.15 0.945 43min37s 0.0295s
ResNet18 18 2.05 0.932 29min45s 0.0262s
MobileNetV2 54 0.54 0.948 27min53s 0.0299s
InceptionV3 48 6.84 0.966 30min43s 0.0301s

Therefore, this paper compares the cross-entropy loss curves of different models in the test set based on the improved model of InceptionV3. The comparison of the error changes between different network models is shown in Fig. 3, and Figs. (a)~(d) show VGG16, ResNet18, MobileNetV2, and InceptionV3, respectively. It can be seen that the errors of the VGG16 and ResNet18 models are relatively larger, and the Loss values of the two models are kept at 0.5 and 0.3, respectively, after 100 epochs. MobileNetV2 performs relatively well, basically around 0.15, while the improved InceptionV3 network in this paper has the smallest loss, which basically stays around 0.10 after 100 epochs, and the error value after using the depth information is lower than that of the unused model, which indicates that the method in this paper is beneficial to the training of the model, and it can effectively improve the convergence ability of the InceptionV3 network model.

Figure 3.

Error comparison of different network models

Comparative analysis of model classification effects

In order to further analyze the effect of the classification optimization model, the HAULP31 dataset is used, and DenseNet and EfficientNetB are added on the basis of the above network model for comparative analysis, and seven categories of life images, such as trees, flowers, plants, vehicles, daily necessities, people, buildings, and animals, are recognized and classified, and the average accuracy rate of the classification is calculated through three experiments and the The average standard deviation, the larger the standard deviation, the greater the degree of deviation between the data, the smaller the standard deviation is the smaller the degree of deviation, to a certain extent, can reflect the stability of the network. At the same time, the loss value is used for comprehensive comparison, the classification effect of different models is compared as shown in Table 3, the weighted accuracy of the model in this paper reaches 98.23%, compared with several other networks have a better classification and recognition effect, more stable at the same time the time required for training is also the shortest, the training time is only 135 minutes, compared with other models can save about double the time. After analyzing and comparing the results, the model in this paper has a better classification and recognition effect, its generalization ability and robustness are better than other networks, and it is also better in training efficiency.

The classification effect of different models is compared

Model Weighted accuracy Loss Mean standard deviation Training time(Mine)
VGG16 0.9683 0.7596 0.0254 223
ResNet18 0.9714 0.7651 0.0198 215
MobileNetV2 0.9723 0.8123 0.188 322
DenseNet 0.9713 0.7652 0.175 215
EfficientNetB 0.9821 0.7625 0.115 315
Ours 0.9823 0.7023 0.008 135

The learning rate, i.e., the step size in gradient descent, indicates the distance of descending once in the iterative process, which is an important parameter affecting the performance of the model. Therefore, in this paper, the learning rate is adjusted by different gradient descent optimization algorithms to compare and analyze the impact of the learning rate on the network as well as to seek the optimal overall performance of the network. The gradient descent optimization algorithm selects the more classical RMSprop, SGD and this paper’s method for comparison, and the impact of different learning rates on model performance is shown in Table 4. The impact of the two learning rates of the experiment on the model of this paper is not very large, in the learning rate of 0.008, the optimization algorithm’s collection accuracy is 0.9992, slightly higher than when 0.002. However, the effect of different learning rates on SGD and RMSprop is more obvious and the weighted accuracy is better at a learning rate of 0.008. The study further illustrates the accuracy and stability of the classification of the model in this paper.

The impact of different learning rates on model performance

Optimizer learning_rate batch_size smooth_val epoch Weighted accuracy Training time
RMSprop 0.002 16 0.1 100 0.9547 1h55m
0.008 16 0.1 100 0.9658 1h56m
SGD 0.002 16 0.1 100 0.9352 1h48m
0.008 16 0.1 100 0.9763 1h47m
Ours 0.002 16 0.1 100 0.9985 1h25m
0.008 16 0.1 100 0.9992 1h31m

In this paper, 7 types of life images such as trees, flowers, vehicles, daily necessities, people, buildings and animals are recognized and classified, and the average accuracy of classification is calculated through 3 experiments.The classification results of the model are shown in Fig. 4.The correct rate of classification of daily necessities is relatively low, which is 86.75%, in which 8.15% is mistaken as vehicles and 3.20% is classified as buildings. Other than that, the classification correct rate of other images is higher than 90%, and the classification correct rate of trees and buildings is the highest, which is 95.23% and 93.44, respectively.Overall, after the optimization of this paper, the classification correct rate of the image classification model is higher than 85% in seven kinds of life images, and the overall classification performance is excellent.

Figure 4.

Classification of models

Conclusion

In this paper, we optimize the deep learning image classification model based on particle swarm algorithm and sample gradient optimization, and draw the following conclusions through experimental studies:

The accuracy of VGG16, ResNet18, MobileNetV2, and InceptionV3 based on deep learning information has improved by 1.6% to 4.2%. Among the four network models, InceptionV3 is the most effective, with an accuracy of 95.2% using RGB information, and 98.3% using RGBD information, and the highest average accuracy of 96.75% under two kinds of information, and the improved network model’s loss stays around 0.10, which indicates that this paper’s method is favorable to the training of the model, and can effectively improve the convergence ability of the network model.

The weighted accuracy and training time of the model in this paper are 98.23% and 135 minutes, respectively, which have better classification and recognition effects than several other networks, and the time required for training is also the shortest. When the learning rate is 0.008, the collection accuracy of this paper’s optimized algorithm is 0.9992. However, when the learning rate is slightly higher than 0.002, the two learning rates do not have a great impact on this paper’s model. The experimental results show that the model in this paper has better classification and recognition effect, and its generalization ability and robustness are also better than other networks, and it is also better in training efficiency and stability.

On the other hand, the classification model in this paper has a relatively low correct classification rate of 86.75% for daily necessities. Except for that, the classification accuracy rates of other images are higher than 90%. It has been further verified that the performance of the classification model in this paper is excellent.

Język:
Angielski
Częstotliwość wydawania:
1 razy w roku
Dziedziny czasopisma:
Nauki biologiczne, Nauki biologiczne, inne, Matematyka, Matematyka stosowana, Matematyka ogólna, Fizyka, Fizyka, inne