Research on the optimization method of image classification model based on deep learning technology and its improvement of data processing efficiency
Online veröffentlicht: 19. März 2025
Eingereicht: 11. Nov. 2024
Akzeptiert: 20. Feb. 2025
DOI: https://doi.org/10.2478/amns-2025-0395
Schlüsselwörter
© 2025 Yi Zhang, published by Sciendo
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
With the rapid development of computer vision technology, image big data has become a necessary path for the development of science and technology. Quantitative image data can effectively construct image classification models, target detection, tracking models and semantic segmentation models, etc., to enhance the credibility of the application while improving the accuracy of the model [1–2]. However, the existing algorithms have low accuracy and slow efficiency for big data image classification, how to effectively improve the model classification accuracy and efficiency has become the hot spot of current research [3–4].
In the current field of industrial automation and intelligent manufacturing, with the wide application of industrial images, the concern for data privacy and security is increasing [5]. Such industrial image data often contains sensitive content about production processes, equipment configuration, and employees’ personal information, which may face the risk of data leakage and misuse during outsourced processing and cloud storage. Deep learning techniques have become a key technology for recognizing and classifying defects in industrial images [6–8]. Thanks to its outstanding performance in image recognition and classification, deep learning has become the method of choice for analyzing industrial images, especially in critical areas such as defect detection, showing its irreplaceable importance [9].
Deep learning is widely used in cloud computing, but privacy protection in cloud computing has been a technical challenge. Being able to accurately accomplish the image classification task while ensuring privacy security has long been a challenge in privacy protection, in this case, homomorphic encryption techniques have emerged to provide a new solution idea for image classification in cloud computing environment [10–12]. Homomorphic encryption techniques allow complex computational tasks, such as defect detection and classification of images, to be performed directly on encrypted data without having to decrypt the data, which fundamentally protects the privacy and security of the data. Homomorphic encryption is not only applicable to the efficient utilization of cloud computing resources, but also provides a strong guarantee for data privacy protection [13–15]. However, the practical application of homomorphic encryption faces many challenges, especially its high computational complexity and low processing efficiency become the main obstacles limiting its wide deployment in industrial image classification applications. Therefore, new approaches need to be generated to balance the computational complexity and accuracy of homomorphic encryption [16–18].
For the application of homomorphic encryption in industrial image processing, there is an urgent need to optimize its computational efficiency, as well as to explore new algorithms and technical frameworks to overcome the existing limitations. Through technological innovation, both the security protection of data privacy and the demand for high efficiency in industrial image processing can be ensured, so as to achieve the double optimization of data security and processing efficiency, and promote the further development of industrial automation and intelligent manufacturing [19–20]. At the same time, the results of homomorphic encryption for industrial image privacy protection help to enhance public trust in industrial data processing, improve social acceptance of data utilization, and lay a solid foundation for the sustainable development of big data and artificial intelligence technology [21–22].
In this paper, traditional image classification methods are first compared and analyzed with a deep neural network classification model to study the advantages of deep learning technology in image classification. Then, based on particle swarm algorithms, orthogonal optimization provides a new idea for improving population algorithms. Then the particle swarm algorithm with orthogonal optimization is used in conjunction with VGG to overcome the hyperparameter optimization problem in VGG. On the other hand, in order to solve the long-tailed distribution of image data, this paper proposes a balanced complementary (BACL) loss to mitigate the inhibitory gradient from the complementary class on the tail class in the long-tailed distribution by reviewing the traditional Softmax cross-entropy loss. Finally, the feasibility of this paper’s method is investigated through performance experiments on the model, and the classification effects of different network models are compared.
Image classification is a significant component of the computer vision field and has been utilized in various fields. In the early days, traditional image classification mostly relied on manual operations and could not adapt to the present era of large-scale data, so deep learning image classification techniques came into being. Traditional image classification is very different from deep learning image classification, and there is also a significant difference in performance.
The core of traditional image classification mainly utilizes manual feature extraction and machine learning algorithms for classification. It is mainly divided into four basic steps: feature extraction, feature coding, spatial constraints, and image classification. Although traditional image classification methods have achieved some results in classifying images, there are still some obvious disadvantages compared with deep neural networks, feature extraction relies on manual work, and the model generalization ability is limited, for high-dimensional data, it requires a large amount of computational resources and execution time, and may encounter a “dimensionality disaster”.
The results of traditional image classification methods are certainly undeniable in the initial processing of image tasks. However, their performance is often unsatisfactory when confronted with more complex and varied real-world image scenarios. As a result, researchers have turned to a more powerful and flexible technique known as deep learning image classification. In the research field of deep learning, deep neural network models have achieved remarkable success in image classification tasks.
When dealing with the task of image classification, deep neural networks process pixel data through a multilayer structure and perform computations step by step, thus extracting successively from low-level to high-level features. Deep neural networks are a highly complex function that nonlinearly maps high-dimensional data points to low-dimensional spaces, making them a crucial tool for processing large-scale high-dimensional data. Convolutional Neural Networks (CNNs) are another way to build a hierarchical deep learning model, which is implemented through a stack of convolutional, pooling, and fully connected layers. The input layer holds the pixel data of the sample image, and the convolutional layer performs a convolution operation on the input features using learnable weights, followed by a nonlinear transformation by introducing bias values to extract features. The pooling layer downsamples the input features with the goal of reducing the number of parameters and computational complexity of the model by gradually reducing their dimensionality. Finally, the fully connected layer imports the learned mapping results into the label space of the samples to achieve the goal of classifying the samples. CNNs have demonstrated superior performance in image processing tasks, and have become one of the mainstays of today’s image processing models.
The basic structure of convolutional neural network is shown in Fig. 1. Convolutional neural network is a deep learning model inspired from the neural connections of the human visual center, and is a kind of feed-forward neural network, which was first born in LeNet network proposed by Lecun et al. in 1998. Convolutional neural networks are a layer-by-layer hierarchically connected network structure. Their basic structure generally consists of an input layer, a convolutional layer, a pooling layer, a fully connected layer, and an output layer.

The basic structure of the convolution neural network
The input layer is the location of the input data to the network, and is the first location where the CNN “feeds” the data. Among them, due to the content, format, and content of various types of data there are deviations. Therefore, the input data usually needs to be preprocessed, so that the input data of each dimension can be controlled within a certain range, which helps the network to extract features and accelerates the training efficiency of the network.
The convolutional layer is the basic operation level and the core part of CNN, which consists of multiple parameter trainable convolutional kernels or filters. The convolution kernel will calculate its weighted sum with the input data in spatial dimension channel by channel, row by row, column by column in the form of sliding window, and the output matrix obtained from the calculation is the feature map.
After obtaining the output feature map, non-linear mapping using activation functions is often required to improve the non-linear fitting of the network. Here, the most widely used ReLU activation function is generally selected.The mathematical expression of the ReLU function is:
The pooling operation involved in the pooling layer is a downsampling process that reduces the feature map size exponentially while maintaining feature invariance, thus reducing the network parameters and computation, preventing overfitting of the network, and improving the generalization ability of the neural network.
The fully connected layer (FC) is a two-dimensional planar structure formed by the arrangement of many neurons and is generally the last layers of the CNN. It is characterized by the interconnection of neurons in the latter, previously fully connected layer, forming an extremely dense mesh structure. And, in order to further enhance the generalization ability of the neural network, regularization methods such as Dropout are often also used in the fully connected layers.
The output layer is the last layer of the CNN, and different designs are often used for it due to different tasks. In image classification tasks, the number of neurons in this layer is equal to the number of classification categories.
The commonly used deep convolutional neural network models and models are LeNet, AlexNet, VGGNet, GoogleNet, ResNet, DenseNet.
LeNet is a feed-forward neural network that consists of a total of 7 layers, starting with 5 alternating layers of convolution and pooling, followed closely by 2 fully connected layers. Traditional multilayer fully connected neural networks take each pixel as an independent input, which ignores the correlation between pixels and increases the computational effort.LeNet uses parameter-learning convolution to extract multi-location similar features using fewer parameters, which not only reduces the number of parameters but also automatically learns the features from the original pixels.
The AlexNet model is used to improve the model learning performance by increasing the depth of the convolutional neural network and using parameter optimization strategies. It is an 8-layer network including 5 convolutional layers and 3 fully connected layers, and the maximum pooling operation is performed after the 1st, 2nd, and 5th convolutional layers to reduce the amount of data. After 5 rounds of convolution and pooling operations, linear eigenvalues are obtained by Dropout operation and the more efficient ReLU function is used to alleviate the problem of gradient vanishing to a certain extent, thus improving the convergence speed.
VGGNet has achieved good results in both image classification and localization problems, and is famous for its simplicity, homogeneous topology, and increased depth. The use of 138 million parameters is a major limitation, making it computationally expensive and challenging to deploy on systems with limited resources.
The GoogleNet framework is designed to achieve high accuracy while minimizing computational cost. It introduces inception block in CNN, which utilizes the idea of splitting, transforming, and merging to incorporate multi-scale convolutional transforms, replacing the traditional convolutional layer with a small block that encapsulates filters at different scales to capture spatial information at different scales. In addition, the connection density is reduced by using global average pooling in the last layer instead of using a fully connected layer. However, the main drawbacks of GoogleNet are its heterogeneous topology, which requires customization from one block to another, and the representation bottleneck, which greatly reduces the feature space of the next layer and can lead to the loss of useful information.
ResNet includes ResNet18, ResNet34, ResNet50, ResNet101, and ResNet152, which became the most pioneering work in computer vision and deep learning. For the degenerate phenomenon residual structure, the original mapping
Dense Net includes DenseNet121, DenseNet169, Dense Net201, DenseNet264. because making short connections (residual connections or skip connections) between the input and output layers can make deep convolutional networks deeper, more accurate, and more efficient. The network connects each layer to each layer of the feedforward propagation. A traditional
After that, other improved network models, ResNext, and convolutional neural network models based on the attention mechanism (e.g., Residual Attention Network (RAN), SENet, etc.) have emerged, etc.
Currently, the attention mechanism has become an effective means of performance improvement in the field of deep learning-based computer vision. Humans can quickly observe the focus area when facing a complex scene, and adding the attention mechanism to the model simulates the human visual system so that it can quickly focus on the focus location features. So the attention mechanism can be abstracted as:
The particle swarm optimization algorithm in machine learning is also a type of group intelligence optimization algorithm that is simplified and used to solve optimization problems. The particle swarm algorithm simulates birds in a flock by designing a massless particle that has only two attributes, speed and position, with the speed representing the speed of movement and the position representing the direction of movement. Each particle searches for the optimal solution in the search space individually and records it as the current individual extreme value, and shares the individual extreme value with the other particles in the whole particle swarm, and finds the optimal one as the current global optimal solution of the whole particle swarm, and all particles in the particle swarm adjust their own speed and position according to the current individual extreme value that they have found and the current global optimal solution that is shared by the whole particle swarm.
In the classical particle swarm algorithm, when the particles search in the solution space, due to the influence of the attribute of velocity in the position update formula, the problem of inertia weights being too small to explore the new domains occurs in the late stage of the optimization search, which leads the algorithm to fall into the local optimum. This chapter proposes a new strategy by studying the oscillation process of particles in the PSO algorithm and improving the PSO algorithm based on the combination vector of local optimal vectors and global optimal vectors. The orthogonally improved particle swarm algorithm (OPSO) proposed in this chapter can not only solve the drawbacks of global PSO, but also improve the performance of PSO.
In active groups, particles can be updated through an orthogonal diagonalization process, where the position vectors of these particles can be orthogonally diagonalized. The improved particle swarm algorithm (OPSO) relies on the application of the orthogonal diagonalization (OD) process. In this process, orthogonal guidance vectors in the active group can be obtained. In addition, the OD process allows obtaining the diagonal matrix DM by transforming the multiplication of the three matrices.DM is used to update the velocity and position vectors of all particles of the swarm. Thus, the updating process can be performed with the
In this paper, Orthogonally Optimized Particle Swarm (OPSO) algorithm is used with VGG to overcome the hyperparameter optimization problem in VGG, where Orthogonally Optimized Particle Swarm (OPSO) algorithm provides a new idea to improve the population algorithm. In this algorithm, there exists a population with
The Orthogonal Optimization of Particle Swarms (OPSO) algorithm is described as follows:
Assume that Randomly initialize each particle Compute the objective function Initialize the position vectors of the particles by passing the following equation in the PSO algorithm. Based on the fitness value of Construct a matrix Use PSO to convert matrix 4 to a symmetric matrix Apply The position and velocity vectors of the 4 particles of the active group can be updated by the following equation:
Determine Recognition classification based on hyperparameter optimization algorithm and improved network:
Then evaluate Determine the optimal position by the following method. selecting Then evaluate
Multi-objective particle swarm optimization algorithm (MOPSO) is also an evolutionary algorithm based on group intelligence.MOPSO encodes the independent variables of the optimization problem as the coordinates of the particles in the multidimensional hyperbody of the solution space. The search of the solution space is accomplished by the motion of the particles in the space to find the optimal solution.
The basic process of MOPSO algorithm is as follows:
Encoding, encode the independent variables of the optimization problem to be solved as particle coordinates. Rewrite the objective function of the optimization problem as the fitness function. Initialize the population, generate a population of particles that satisfy the boundary conditions of the independent variables by randomizing, etc., and give each particle an initial velocity. It is also necessary to set the upper limit of evolutionary generations or the optimization objective. Individual evaluation and optimal position recording: Each particle in the particle swarm is substituted into the fitness function for fitness evaluation. And record the historical optimal position Velocity and Position Update: Calculate the new velocity and update the position by the velocity update formula:
The position update formula is:
At this time, it is also necessary to determine whether the particle movement crosses the boundary, if it crosses the boundary, the particle can be bound to the boundary, or the particle can be allowed to bounce on the boundary. Termination judgment, based on the termination conditions to determine whether to terminate the algorithm.
This is because many real-world image datasets exhibit distinct long-tail distributions. In a long-tailed distribution, the category with a larger number of samples is called the head class, and the category with a smaller number of samples is called the tail class. Since the number of samples in the head class is much larger than the number of samples in the tail class, the long-tailed distribution increases the inter-class competition between different classes in the Softmax cross-entropy loss during model training. These increased inter-class competitions will cause a large bias in the parameter learning of different classes in the classifier, resulting in poor classification performance of the tail class. In this paper, the proposed BACL loss function is described in detail by reviewing the traditional Softmax cross-entropy loss. And based on the double angle sinusoidal decay strategy, which integrates the BACL and NCE losses in order to construct a new joint training framework to improve the classification performance of long-tailed classification models.
Softmax cross-entropy loss is a common and straightforward classification loss function that is widely utilized to solve various image classification tasks. It mainly utilizes the Softmax activation function to compute the predicted probability of the network output
Further, the Softmax cross-entropy loss is utilized to perform a partial derivative operation on the network output
It is worth noting that when category
In order to solve the above problems, this chapter proposes a balanced complementary (BACL) loss to mitigate the inhibitory gradient from the complementary class on the tail class in the long-tailed distribution, thus providing a fairer training method for the nibble class samples. The expression of the proposed BACL loss function is:
The BACL loss function is utilized to perform a partial derivative operation on the network outputs as a means of obtaining the gradient of the BACL loss with respect to the logarithms
The designed adaptive weighting coefficients
This adaptive weighting coefficient greatly reduces the gradient of the complementary class on the tail class, which in turn reduces the gradient ratio between the complementary class and the groundtruth class. Secondly,
The above balanced complementary loss algorithm is mainly from the perspective of complementary class samples, and reduces the unfavorable impact of the gradient of complementary classes on the tail class by introducing an adaptive weight coefficient in the complementary class of the Softmax cross-entropy loss function. In order to better utilize the information from the complementary samples, the complementary entropy loss is further utilized to supplement the training of the complementary samples, which improves the model performance. Complementary entropy loss is used as a training method in addition to cross-entropy loss to guide the feature learning of complementary samples in the model. Since the complementary entropy loss only considers the complementary samples training except the ground-truth class, the complementary entropy loss is defined as the average value of the information entropy of the complementary samples in a small batch of training, and its mathematical expression is as follows:
BACL loss and complementary entropy loss can be combined to achieve more desirable training results. The BACL loss and complementary entropy loss are combined into a new loss function for uniform training through a joint training parameter. In order to balance the values of BACL loss and complementary entropy loss function, this paper normalizes the complementary entropy to get the normalized complementary entropy, meanwhile, the version of normalized complementary entropy abbreviation is defined as NCE, which is expressed as follows:
Finally, the joint loss (BACL+NCE) function of the joint training strategy can be expressed as:
In order to verify the model effect of different network structures, in this paper, the preprocessed RGB information and the D depth information are jointly used as the input of the convolutional neural network, and the four-channel input convolutional neural network is constructed by using the VGG16, ResNet18, MobileNetV2, and InceptionV3 as the structure of the main network respectively, and due to the fact that there is one channel added to the original data, the When constructing the convolutional neural network model, the model parameters of the fourth channel are initialized in the same way as those of the first three channels. The structure and parameters of the improved network also have some changes, the number of parameters of the network model (Millions) and the number of floating point operations performed per second (FlOPS) as a measure of the size of the network model, and the parameters of different network models are shown in Table 1. It can be seen that the lightweight network MobileNetV2 has the least number of parameters at 3.2, due to the inclusion of the bottleneck structure in MobileNet. The VGG16 network model is larger and has the maximum number of parameters and the value of FLOPS, which is related to the fact that the residual structure is not included in the network structure.The ResNet18 network has fewer parameters due to the inclusion of the residual structure, whereas MobileNetV2, although the number of layers in the network is larger, the bottleneck structure in the network allows the model to have fewer number of parameters.
Different network model parameters
| Network model | Network number | Input size | Parameter number | FlOPS(G) |
|---|---|---|---|---|
| VGG16 | 16 | 224×224×4 | 135.5 | 16.15 |
| ResNet18 | 18 | 224×224×4 | 11.5 | 2.05 |
| MobileNetV2 | 54 | 224×224×4 | 3.2 | 0.54 |
| InceptionV3 | 48 | 299×299×4 | 22.1 | 6.84 |
In this paper, the experiments are first performed without depth information, and only RGB images are used for the maturity state detection experiments. Then, the RGB images and depth information in the dataset are combined to generate RGBD four-channel data. In order to effectively evaluate the effect of model classification, this experiment uses the combined classification accuracy (Acc) of the test set as an index to evaluate the model classification effect, and the performance of different network models is shown in Figure 2. By adding deep information data, the accuracy of VGG16, ResNet18, MobileNetV2, and InceptionV3 were all improved by 4.2%, 2.6%, 1.6%, and 3.1%, respectively. Among the four network models, InceptionV3 is the most effective, with an accuracy of 95.2% using RGB information and 98.3% using RGBD information, and the highest average accuracy of 96.75% with both types of information.

Performance of different network models
The analysis of the training of the different models is shown in Table 2 and it can be seen that the training time for VGG16 is much longer at 43min37s.The average time required per data for the four convolutional neural networks tested is very close to each other at around 0.03 seconds, requiring very little computational time.ResNet18 takes about 26 milliseconds per sample data, which is the least of the four networks, whereas InceptionV3 takes about 30 milliseconds per sample data. Of the time required to train 500 generations of convolutional neural networks, MobileNetV2 takes the least time, only 27mine53s, while VGG16 takes 43 minutes, which is related to the size of the model and the size of the parameters.Overall the InceptionV3 model still has a certain advantage in terms of accuracy and training time, and performs well.
The training situation of different models is compared
| Network model | Network number | FlOPS(G) | Accuracy rate | Training time | Test time |
|---|---|---|---|---|---|
| VGG16 | 16 | 16.15 | 0.945 | 43min37s | 0.0295s |
| ResNet18 | 18 | 2.05 | 0.932 | 29min45s | 0.0262s |
| MobileNetV2 | 54 | 0.54 | 0.948 | 27min53s | 0.0299s |
| InceptionV3 | 48 | 6.84 | 0.966 | 30min43s | 0.0301s |
Therefore, this paper compares the cross-entropy loss curves of different models in the test set based on the improved model of InceptionV3. The comparison of the error changes between different network models is shown in Fig. 3, and Figs. (a)~(d) show VGG16, ResNet18, MobileNetV2, and InceptionV3, respectively. It can be seen that the errors of the VGG16 and ResNet18 models are relatively larger, and the Loss values of the two models are kept at 0.5 and 0.3, respectively, after 100 epochs. MobileNetV2 performs relatively well, basically around 0.15, while the improved InceptionV3 network in this paper has the smallest loss, which basically stays around 0.10 after 100 epochs, and the error value after using the depth information is lower than that of the unused model, which indicates that the method in this paper is beneficial to the training of the model, and it can effectively improve the convergence ability of the InceptionV3 network model.

Error comparison of different network models
In order to further analyze the effect of the classification optimization model, the HAULP31 dataset is used, and DenseNet and EfficientNetB are added on the basis of the above network model for comparative analysis, and seven categories of life images, such as trees, flowers, plants, vehicles, daily necessities, people, buildings, and animals, are recognized and classified, and the average accuracy rate of the classification is calculated through three experiments and the The average standard deviation, the larger the standard deviation, the greater the degree of deviation between the data, the smaller the standard deviation is the smaller the degree of deviation, to a certain extent, can reflect the stability of the network. At the same time, the loss value is used for comprehensive comparison, the classification effect of different models is compared as shown in Table 3, the weighted accuracy of the model in this paper reaches 98.23%, compared with several other networks have a better classification and recognition effect, more stable at the same time the time required for training is also the shortest, the training time is only 135 minutes, compared with other models can save about double the time. After analyzing and comparing the results, the model in this paper has a better classification and recognition effect, its generalization ability and robustness are better than other networks, and it is also better in training efficiency.
The classification effect of different models is compared
| Model | Weighted accuracy | Loss | Mean standard deviation | Training time(Mine) |
|---|---|---|---|---|
| VGG16 | 0.9683 | 0.7596 | 0.0254 | 223 |
| ResNet18 | 0.9714 | 0.7651 | 0.0198 | 215 |
| MobileNetV2 | 0.9723 | 0.8123 | 0.188 | 322 |
| DenseNet | 0.9713 | 0.7652 | 0.175 | 215 |
| EfficientNetB | 0.9821 | 0.7625 | 0.115 | 315 |
| Ours | 0.9823 | 0.7023 | 0.008 | 135 |
The learning rate, i.e., the step size in gradient descent, indicates the distance of descending once in the iterative process, which is an important parameter affecting the performance of the model. Therefore, in this paper, the learning rate is adjusted by different gradient descent optimization algorithms to compare and analyze the impact of the learning rate on the network as well as to seek the optimal overall performance of the network. The gradient descent optimization algorithm selects the more classical RMSprop, SGD and this paper’s method for comparison, and the impact of different learning rates on model performance is shown in Table 4. The impact of the two learning rates of the experiment on the model of this paper is not very large, in the learning rate of 0.008, the optimization algorithm’s collection accuracy is 0.9992, slightly higher than when 0.002. However, the effect of different learning rates on SGD and RMSprop is more obvious and the weighted accuracy is better at a learning rate of 0.008. The study further illustrates the accuracy and stability of the classification of the model in this paper.
The impact of different learning rates on model performance
| Optimizer | learning_rate | batch_size | smooth_val | epoch | Weighted accuracy | Training time |
|---|---|---|---|---|---|---|
| RMSprop | 0.002 | 16 | 0.1 | 100 | 0.9547 | 1h55m |
| 0.008 | 16 | 0.1 | 100 | 0.9658 | 1h56m | |
| SGD | 0.002 | 16 | 0.1 | 100 | 0.9352 | 1h48m |
| 0.008 | 16 | 0.1 | 100 | 0.9763 | 1h47m | |
| Ours | 0.002 | 16 | 0.1 | 100 | 0.9985 | 1h25m |
| 0.008 | 16 | 0.1 | 100 | 0.9992 | 1h31m |
In this paper, 7 types of life images such as trees, flowers, vehicles, daily necessities, people, buildings and animals are recognized and classified, and the average accuracy of classification is calculated through 3 experiments.The classification results of the model are shown in Fig. 4.The correct rate of classification of daily necessities is relatively low, which is 86.75%, in which 8.15% is mistaken as vehicles and 3.20% is classified as buildings. Other than that, the classification correct rate of other images is higher than 90%, and the classification correct rate of trees and buildings is the highest, which is 95.23% and 93.44, respectively.Overall, after the optimization of this paper, the classification correct rate of the image classification model is higher than 85% in seven kinds of life images, and the overall classification performance is excellent.

Classification of models
In this paper, we optimize the deep learning image classification model based on particle swarm algorithm and sample gradient optimization, and draw the following conclusions through experimental studies:
The accuracy of VGG16, ResNet18, MobileNetV2, and InceptionV3 based on deep learning information has improved by 1.6% to 4.2%. Among the four network models, InceptionV3 is the most effective, with an accuracy of 95.2% using RGB information, and 98.3% using RGBD information, and the highest average accuracy of 96.75% under two kinds of information, and the improved network model’s loss stays around 0.10, which indicates that this paper’s method is favorable to the training of the model, and can effectively improve the convergence ability of the network model.
The weighted accuracy and training time of the model in this paper are 98.23% and 135 minutes, respectively, which have better classification and recognition effects than several other networks, and the time required for training is also the shortest. When the learning rate is 0.008, the collection accuracy of this paper’s optimized algorithm is 0.9992. However, when the learning rate is slightly higher than 0.002, the two learning rates do not have a great impact on this paper’s model. The experimental results show that the model in this paper has better classification and recognition effect, and its generalization ability and robustness are also better than other networks, and it is also better in training efficiency and stability.
On the other hand, the classification model in this paper has a relatively low correct classification rate of 86.75% for daily necessities. Except for that, the classification accuracy rates of other images are higher than 90%. It has been further verified that the performance of the classification model in this paper is excellent.
