Otwarty dostęp

Research on Deep Learning-based Image Processing and Classification Techniques for Complex Networks

,  oraz   
17 mar 2025

Zacytuj
Pobierz okładkę

Introduction

Image processing as a common technical problems in production and life, is a variety of ways to obtain the image information through certain technical means into mathematical information, and through the computer program or software will be mathematical information for certain data processing counting process [1]. Currently often use the computer to process the image of the main content of the work includes the image to take a certain standard classification, compression of the image, image quality enhancement, image feature extraction and other work [2]. The current image processing technology can realize the enhancement of the clarity of the image quality, as well as the recognition and extraction of features in the content of the image and other functions, which makes the current image processing technology and the traditional image processing technology has a great difference [3]. The development of deep learning is due to the proposal and development of artificial neural network model, which allows deep learning to reduce the dimensionality of complex problems in the state of processing. Deep learning is a bionic learning algorithm that mimics the workings of the neuronal network of the human brain to extract and learn features from images [4]. With large-scale training datasets, deep learning can accurately recognize targets in images and extract useful information. Therefore, applying deep learning to complex network image processing can greatly improve the processing efficiency [5]. As another major field of image processing, image classification processing is mainly carried out with the help of relevant image classification algorithms to identify and permute the regions in the image, and then extract the features involved in the image, and finally carry out the process of classifier recognition. The key to the entire image classification is the extraction of features, the quality of this step will directly affect the classification results of the subsequent image information. With the help of deep learning, high-performance feature extraction can be realized in this process, laying a solid foundation for subsequent image classification [6-7].

The study proposes a new encoder structure which mainly consists of DCNN, an ECANet module, and a parallel DSA_ASPP module. Based on the above encoder, an image classification algorithm based on lightweight and multi-scale attention fusion is proposed, in order to further explore the properties in the image feature network, the network features are organized by the common statistics of the network, including the number of nodes N, the degree of discretization y, the clustering coefficient C, the maximum weight Qmax and the minimum weight Qmin, to complete the extraction of the information of the image feature network. Meanwhile, the segmentation effect is compared on two large-scale datasets, CamVid and Cityscapes, and finally, the performance of PreactResNet is analyzed by comparison experiments on two fine-grained benchmark datasets, CUB-200-2011 and StanfordDogs.

Overview
Image processing technology

Image processing technology uses computers to analyze and process images, converting them into images of the objects to be used. Literature [8] highlights the advantages of deep learning in image processing, systematically reviews the application of deep learning techniques in the field of image mapping over the last 15 years and provides insights into the existing image restoration methods based on different neural network structures and their information fusion methods. Literature [9] compares three deep learning image processing methods, namely, Single Shot Detection, Faster Region Based Convolutional Neural Networks and You Only Look Once, and finds that the performance of YOLO-v3 is the best among the three algorithms. Literature [10] emphasized the importance of biomedical image processing technology in the medical field, and pointed out that the medical image processing technology based on deep convolutional neural network is the current research hotspot, by reviewing the limitations and the development direction of the medical image segmentation methods based on deep learning, in order to help the researchers to solve the problems existing in the medical field in the current stage. Literature [11] has found that convolutional neural network combined with nondestructive testing technology and computer vision system can efficiently extract deep image features, and this technology can be used for the detection and analysis of complex food matrices, which is of great significance for food quality and safety in the food industry. Literature [12] systematically reviews the literature on pixel-level image fusion based on deep learning, summarizes existing deep learning-based image fusion methods, including convolutional neural networks, convolutional sparse representations, and stacked self-encoders, into several general frameworks, and discusses the key issues and challenges in each framework. Literature [13] comprehensively discusses the literature on deep learning-based medical image segmentation methods, categorizing and comparing the current popular literature according to a multi-level structure from coarse to fine, so as to facilitate readers’ understanding of the relevant principles and guide them to think about the relevant improvement methods. Literature [14] proposes a self-configuration method for image segmentation based on deep learning and applies it to the biomedical field, and verifies the effectiveness of the proposed method through experiments, which has certain application significance. Literature [15] systematically reviews deep learning techniques used to solve such inverse problems in imaging, especially popular neural network architectures for imaging tasks, and discusses how to combine deep learning and analytical methods to effectively solve imaging inverse problems in image processing.

Image classification techniques

Image classification is a computer vision task that analyses and automatically categorizes image information by extracting specific information from it based on digital image data. Literature [16] proposed four new deep learning models, 2D convolutional neural network, 3D CNN, recurrent 2D CNN and recurrent 3D CNN, and applied them to hyperspectral image classification, and the effectiveness and feasibility of the proposed deep learning models were verified through evaluation experiments. Literature [17] designed experiments to compare the performance of traditional machine learning and deep learning image classification algorithms, and the results show that the recognition accuracy of deep learning image classification algorithms is higher than that of traditional machine learning algorithms on large sample datasets. Literature [18] verified the effectiveness and reliability of deep learning and transfer learning methods for image classification by testing on a large ImageNet dataset, and the study helps readers to understand more deeply the application of deep learning techniques in image classification. Literature [19] points out the practicality of data enhancement, by comparing multiple solutions to the data enhancement problem in image classification, it is found that traditional transformation is one of the more successful data enhancement strategies, and a neural enhancement technique is also proposed, which utilizes neural network learning to improve the enhancement of classifiers. Literature [20] focuses on the application of convolutional neural networks in image classification tasks, analyzes the trend of its predecessor to the recent state-of-the-art deep learning systems, and points out the challenges that still exist in the application of convolutional neural networks in image classification. Literature [21] proposes a remote sensing image classification method based on three-dimensional deep learning, which can realize the joint processing of spectral and spatial information, and the experimental analysis verifies the effectiveness and feasibility of the proposed method, which is able to achieve better classification rates than the state-of-the-art methods with lower computational costs. Literature [22], after discussing the application areas of deep learning and commonly used models, highlights that convolutional neural networks show excellent performance in image classification, meanwhile, a simple convolutional neural network for image classification is built and the effects of the learning rate set of different methods and the optimal solution parameter of different optimization algorithms on image classification are analyzed. Literature [23] developed a data enhancement method based on image style transfer, which can generate new images with high perceptual quality, and the superior performance of the developed method in image classification is verified by three specific case studies, and it has some application prospects.

Complex network image processing and classification
Complex network-based image description
Complex network-based texture description

This paper establishes the degree matrix of an image under different thresholds according to the static statistics of complex networks, completes the texture description of an image by counting the degree distribution of network nodes in each state, and proposes a method to establish a complex network model of an image, which regards each pixel of an image as a node of a complex network and considers that each node is connected to each other with edges, and the weights of the edges are determined by the distances between the two pixels and the gray level difference The weighted sum of the two pixels is determined. The initial complex network complete graph model is thresholded dynamically by setting a series of thresholds for edge weights, and edges with weights higher than the threshold are deleted, resulting in edges between pixels with smaller distances and similar pixel values. In order to simplify the complex network model, this paper selects the 28 nodes around the node with its distance less than 3 as the neighborhood, only the nodes in the neighborhood can have an edge connected between the two nodes i(x,y) and j(x′,y′) in the weight of the edge between the two nodes w(v(x,y),v(x′,y′)) for the distance between the nodes and the node represents the difference between the gray value of the pixels of the weighted sum of the nodes, in order to make the node’s degree distribution is more uniform, this paper adopts the weights of the value of the w for the direct summation of the two items above, as shown in the formula (1) shown: w(v(x,y),v(x,y))=(xx)2+(yy)2+| I(x,y)I(x,y) | \[\begin{align} & w\left( v(x,y),v\left( x\prime ,y\prime \right) \right)=\sqrt{{{\left( x-x\prime \right)}^{2}}+{{\left( y-y\prime \right)}^{2}}} \\ & +\left| I(x,y)-I\left( x\prime ,y\prime \right) \right| \end{align}\]

After normalizing the obtained edge weights, a series of thresholds t are set, and edges between nodes with edge weights higher than the thresholds are deleted to obtain the neighborhood θ(vt) and degree deg(vt) of each node corresponding to the thresholds. As shown in Eqs. (2)(3): θ(vr)={ vV|(v,v)E&w(v,v)t } \[\theta \left( {{v}_{r}} \right)=\left\{ v\prime \in V\left| \left( v,v\prime \right) \right.\in E\And w\left( v,v\prime \right)\le t \right\}\] deg(v1)=| θ(vt) | \[\deg \left( {{v}_{1}} \right)=\left| \theta \left( {{v}_{t}} \right) \right|\]

The method proposed in this paper is based on the degree matrix, where the degree of a node is used as the point weight of the node. By counting the number of elements with the same degree in the degree matrix, feature vectors are constructed to realize the texture description of the image. By controlling the number of thresholds, the result of controlling the number of degree matrices and controlling the length of the feature vector data can be achieved. The flow of the algorithm is shown in Figure 1.

Figure 1.

The flow chart of image texture feature extraction algorithm

Complex network-based shape description

In this paper, the Harris corner points of the image are regarded as the nodes of the complex network, and the initial complete graph network model is established as a result. In the complete graph model each node is connected to each other by edges, which is a kind of regular network and cannot be used as distinguishing image topology features, using the dynamic evolution process of the complex network, a series of sub-networks can be generated, and the shape feature extraction of the image is accomplished through the statistics of each sub-network’s degree, jointness, shortest path length, average path length clustering coefficients, and other static statistical feature descriptions.

Dynamic evolution, as an important feature of complex network models, is a process that can be accomplished through the distribution of attributes of edges or through the distribution of attributes of nodes. General complex network evolution models are based on the properties of edges, such as the inter-value evolution method, the minimum spanning tree evolution method, and the K -nearest neighbor evolution method. The minimum spanning tree evolution method chosen in this paper ensures that the sub-networks generated by evolution are all connected, and does not produce the problem of difficult to compute the statistical eigenvalues of the network caused by the disconnection of sub-networks in the threshold evolution method. The process is as follows:In Eq. (4), G0 is the complete graph of the initial network, MST is the minimum spanning tree algorithm, Tn(n = 1,2,3…) is the minimum spanning tree generated by the n th evolution, and Gn is the subnetwork generated by the n th evolution. That is, the minimum spanning tree Tn of the n th generation is the difference between the initial network G0 and the n–1 th evolution generating subnetwork Gn–1, and the n th generated subnetwork Gn is the sum of the n th generated subnetwork Gn–1 and the n th generated minimum spanning tree Tn.

{ T1=MST(G0)G1=T1Tn=MST(G0Gn1)Gn=Gn1+Tn $\left\{ \begin{array}{*{35}{l}} {{T}_{1}}=MST\left( {{G}_{0}} \right) \\ {{G}_{1}}={{T}_{1}} \\ \ldots \\ {{T}_{n}}=MST\left( {{G}_{0}}-{{G}_{n-1}} \right) \\ {{G}_{n}}={{G}_{n-1}}+{{T}_{n}} \\ \end{array} \right.$

Shape feature extraction of an image is realized by calculating the average path length, network diameter, clustering coefficient, maximum degree and maximum number of kernels for each sub-network Gi. And it can be achieved by controlling the number of evolutions to control the data length of the feature vectors and the time required for the procedure.

Complex network based feature extraction for image description

In this paper, we propose a complex network image description where the feature vector consists of texture feature vector VT and shape feature vector Vs. Namely: V=[ VT,VS ] \[V=\left[ {{V}_{T}},{{V}_{S}} \right]\]

The texture eigenvectors consist of distributions of degree matrices at different thresholds: VT=[ dst1,dst2,dst3dstN ] \[{{V}_{T}}=\left[ ds{{t}_{1}},ds{{t}_{2}},ds{{t}_{3}}\cdots ds{{t}_{N}} \right]\] dsti=[n1,n2,n3nL],i[ 1,N ] \[ds{{t}_{i}}=[{{n}_{1}},{{n}_{2}},{{n}_{3}}\cdots {{n}_{L}}],i\in \left[ 1,N \right]\]

dsti is the distribution of the degree matrix under different thresholds, N is the number of selected thresholds, nk is the number of elements with degree k in the current degree matrix, and L is the maximum value of the number of other elements in the selected neighborhood.

The shape feature vector is constructed according to the Harris corner points and then dynamically evolved with the minimum spanning tree algorithm to get the sub-networks at different moments Gi and calculate the average path length, network diameter, clustering coefficient, maximum degree and maximum kernel of each sub-network corresponding to achieve the shape feature extraction of the image. The mathematical expression is shown below: VS=[ F1,F2,F3FM ] \[{{V}_{S}}=\left[ {{F}_{1}},{{F}_{2}},{{F}_{3}}\ldots {{F}_{M}} \right]\] Fi=[ Li,Di,Ci,degmaxi,Kmaxi ],i[ 1,M ] \[{{F}_{i}}=\left[ Li,{{D}_{i}},{{C}_{i}},\deg _{\max }^{i},K_{\max }^{i} \right],i\in \left[ 1,M \right]\]

Fi is the static feature statistic of subnetwork Gi at different moments, Li, Di, Ci, degmaxi\[\deg _{\max }^{i}\], Kmaxi\[K_{\max }^{i}\] are the average path length, network diameter, clustering coefficient, maximum degree and maximum number of kernels of the subnetwork at the corresponding i moments, and M is the number of dynamic evolutions selected.

Coder-decoder based image segmentation processing
General structure of the DECANet network model

Image semantic segmentation methods play a very critical role in the process of parsing image content, the effect of the extracted feature information will have a direct impact on the performance of subsequent methods. According to the actual situation, in the collection of image data, because some images will exist in the light imbalance situation, the light situation more or less will cause the lack of object texture and color and other feature information in the image [24].

In this paper, we propose an image semantic segmentation model DECANet, which aims to solve the problem of invalid information and loss of local detail information of the image in the process of semantic segmentation, and to improve the semantic segmentation accuracy. The overall model architecture of DECANet is shown in Fig. 2.

Figure 2.

Structure of decanet network model

Encoders
Attention Mechanism

Attention mechanism is a technique designed to be able to allow the network model to autonomously learn the characteristic information of some regions it focuses on and make full use of this information, which is mainly inspired by the human brain’s visually selective attention mechanism, scanning all the information, ignoring irrelevant information and highlighting the key important information, so that the corresponding region of the image gets more attention [25].

The 1D convolutional feature matrix is shown below, Matrix Wk is utilized in ECANet to learn channel attention and Matrix Wk has k×C parameters: [ w1,1w1,k00w2,2000wc,ck+1wc,c ]$\left[ \begin{matrix} {{w}^{1,1}} & \ldots & {{w}^{1,k}} & \ldots & \ldots & 0 \\ 0 & {{w}^{2,2}} & \cdots & \cdots & \cdots & 0 \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 0 & \cdots & 0 & {{w}^{c,c-k+1}} & \cdots & {{w}^{c,c}} \\ \end{matrix} \right]$.

y=1H×Wi=1W×j=1Hxij \[y=\frac{1}{H\times W}\sum\limits_{i=1}^{W}{\times }\sum\limits_{j=1}^{H}{{{x}_{ij}}}\]

where yR1×1×C represents the global average pooling of input features xRH×W×C, C1D represents 1-DCNN, the scale of the channel interactions is determined by the size of the convolution kernel k, and σ and ω are the sigmoid function and the weights of each channel, respectively.

ωi=σ(j=1kwyi),yiΩij \[{{\omega }_{i}}=\sigma \left( \sum\limits_{j=1}^{k}{w}\prime {{y}_{i}}\prime \right),{{y}_{i}}\prime \in \Omega _{i}^{j}\]

Ωij$\Omega _{i}^{j}$ denotes the k neighborhood channels of yi and σ is the activation function.

Adaptive selection of 1D convolution kernel size is utilized on 1D convolution, i.e., the convolutional neural network shares weights such that each set of weights is of the same size and the number of parameters is reduced from k×C (where C is the number of channels) to k.

ω=σ(C1Dk(y)) \[\omega =\sigma \left( C1{{D}_{k}}\left( y \right) \right)\]

C and k are proportional using a nonlinear function and there exists a mapping ϕ with the number of convolution kernels being 2 to the k th power. The formula is as follows: C=ϕ(k)=2(γ×kb) \[C=\phi \left( k \right)={{2}^{\left( \gamma \times k-b \right)}}\]

Thus, given the channel dimension C, the convolution kernel size (k) is: k=ψ(C)=| log2(C)γ+bγ |odd  \[k=\psi \left( C \right)={{\left| \frac{{{\log }_{2}}\left( C \right)}{\gamma }+\frac{b}{\gamma } \right|}_{odd\text{ }}}\]

|t|odd denotes the closest odd number to t, γ = 2, b = 1.

In this paper, the attention mechanism module is introduced in the encoder stage to judge the target segmentation accuracy by the amount of information carried by each feature channel, and at the same time the weight coefficients of each feature channel are attached to strengthen the feature learning in a targeted way, which mainly aims to highlight the feature information that is important to the segmentation results and suppress the redundant channel information, so as to improve the overall learning ability and generalization ability of the model.

Hollow Convolution

Hollow convolution, also called dilation convolution or expansion convolution, is mainly the process of expanding the convolution kernel by adding some spaces (zeros) between the elements of the convolution kernel. The receptive field is the size of the area of the pixel point in the output feature map for each layer of the convolutional neural network mapped to the original image, under a particular structure, each receptive field receives the same attentional weight, convolutional kernels with a larger receptive field are more concerned with large target objects, and convolutional kernels with a smaller receptive field are concerned with objects with a smaller target size. In FCN, the range of the receptive fields can be increased by pooling operations with the aim of reducing the size of the image, and then the initial size of the image is restored using up-sampling operations [26].

DSA_ASPP module

Null space pyramid pooling is widely used in various versions of Deeplab. It operates in a simple step by step manner, mainly on the same feature map by utilizing different expansion rates of the null convolution on it, which can alleviate the lattice effect produced by the null convolution, concatenating all the obtained results together, expanding the number of channels, and finally reducing the number of channels to the desired value by utilizing a 1 × 1 convolutional layer.

Using depth-separable convolution and cavity convolution to construct depth-separable cavity convolution, and replacing all the standard convolutions in the ASPP module with depth-separable cavity convolution can largely reduce the number of parameters produced by the model in the training step, and can improve the segmentation accuracy of the network model, which improves the training efficiency to a certain extent. Secondly, the ASPP module is fine-tuned, i.e., the Relu function is replaced by Leaky Relu function, and BatchNorm is added to optimize the model, etc., which is called the improved ASPP module as DSA_ASPP module.DSA_ASPP module is shown in Fig. 3:

Figure 3.

Dsa_aspp module

Bilinear interpolation

The coordinates of the unknown point (x,y) are inferred from a straight line to the point. Linear interpolation, as shown in Figure 4, assumes that the known coordinates (x0,y0) and (x1,y1), (x0,y0) and (x1,y1) are connected to a straight line, and finds the value y of a point x on the line in the interval [x0,x1]. This solution process is the process of linear interpolation.

Figure 4.

Linear interpolation

As can be seen from the figure: yy0xx0=y1y0x1x0 \[\frac{y-{{y}_{0}}}{x-{{x}_{0}}}=\frac{{{y}_{1}}-{{y}_{0}}}{{{x}_{1}}-{{x}_{0}}}\]

Since the value of x is known, the value of y can be obtained from Eq: y=y0+(xx0)y1y0x1x0=y0+(xx0)y1(xx0)y0x1x0 \[y={{y}_{0}}+\left( x-{{x}_{0}} \right)\frac{{{y}_{1}}-{{y}_{0}}}{{{x}_{1}}-{{x}_{0}}}={{y}_{0}}+\frac{\left( x-{{x}_{0}} \right){{y}_{1}}-\left( x-{{x}_{0}} \right){{y}_{0}}}{{{x}_{1}}-{{x}_{0}}}\]

Known y The procedure for finding x is the same as above, except that x and y are swapped.

First do linear interpolation in the x direction to figure out f(R1) and f(R2): f(R1)x2xx2x1f(Q11)+xx1x2x1f(Q21)WhereR1=(x,y1) \[f\left( {{R}_{1}} \right)\approx \frac{{{x}_{2}}-x}{{{x}_{2}}-{{x}_{1}}}f\left( {{Q}_{11}} \right)+\frac{x-{{x}_{1}}}{{{x}_{2}}-{{x}_{1}}}f\left( {{Q}_{21}} \right)Where{{R}_{1}}=\left( x,{{y}_{1}} \right)\] f(R2)x2xx2x1f(Q12)+xx1x2x1f(Q22)WhereR2=(x,y2) \[f\left( {{R}_{2}} \right)\approx \frac{{{x}_{2}}-x}{{{x}_{2}}-{{x}_{1}}}f\left( {{Q}_{12}} \right)+\frac{x-{{x}_{1}}}{{{x}_{2}}-{{x}_{1}}}f\left( {{Q}_{22}} \right)Where{{R}_{2}}=\left( x,{{y}_{2}} \right)\]

Then do a linear interpolation in the y-direction to figure out f(P): f(P)y2yy2y1f(R1)+yy1y2y1f(R2) \[f\left( P \right)\approx \frac{{{y}_{2}}-y}{{{y}_{2}}-{{y}_{1}}}f\left( {{R}_{1}} \right)+\frac{y-{{y}_{1}}}{{{y}_{2}}-{{y}_{1}}}f\left( {{R}_{2}} \right)\]

This allows you to calculate the desired result f(x,y): f(x,y)(x2x)(y2y)(x2x1)(y2y1)f(Q11)+(xx1)(y2y)(x2x1)(y2y1)f(Q21)+(x2x)(y2y)(x2x1)(y2y1)f(Q12)+(xx1)(yy1)(x2x1)(y2y1)f(Q22) $\begin{align} & f\left( x,y \right)\approx \frac{\left( {{x}_{2}}-x \right)\left( {{y}_{2}}-y \right)}{\left( {{x}_{2}}-{{x}_{1}} \right)\left( {{y}_{2}}-{{y}_{1}} \right)}f\left( {{Q}_{11}} \right) \\ & +\frac{\left( x-{{x}_{1}} \right)\left( {{y}_{2}}-y \right)}{\left( {{x}_{2}}-{{x}_{1}} \right)\left( {{y}_{2}}-{{y}_{1}} \right)}f\left( {{Q}_{21}} \right) \\ & +\frac{\left( {{x}_{2}}-x \right)\left( {{y}_{2}}-y \right)}{\left( {{x}_{2}}-{{x}_{1}} \right)\left( {{y}_{2}}-{{y}_{1}} \right)}f\left( {{Q}_{12}} \right) \\ & +\frac{\left( x-{{x}_{1}} \right)\left( y-{{y}_{1}} \right)}{\left( {{x}_{2}}-{{x}_{1}} \right)\left( {{y}_{2}}-{{y}_{1}} \right)}f\left( {{Q}_{22}} \right) \end{align}$

In this paper, after extracting the features in the encoder stage, we use bilinear interpolation to do a 4-fold upsampling operation on the output feature map, and then the features are fused, and the fused feature map is subjected to a 4-fold upsampling operation, so that the size of the feature map is restored to the same size as the input image size.

Lightweight Segmentation Convolution Based Image Classification
Group Convolution

Group convolution can improve model performance and reduce the number of parameters. Mostly, the input feature maps are processed in different groups, and then the different outputs are merged again.

Group convolution reduces the number of parameters compared to standard convolution using feature grouping. Here is an example of calculating the number of parameters for each of these two convolutions. Suppose, using C*H*W for the input and N for the number of convolution kernels, the output is also N. If the size of each convolution kernel is C*K*K, the total number of parameters for the standard convolution is N*C*K*K.

The group convolution improves the previous standard convolution, the basic input and output parameters are the same as above, the group convolution introduces the parameter G, with G indicating the number of groups. If the input feature map is to be divided into G Groups, then the parameter of each group will be changed accordingly. By analyzing and calculating, the number of parameters of group convolution is expressed by P, then the number of parameters of group convolution can be expressed as follows: P=N*CG*K*K \[P=N*\frac{C}{G}*K*K\]

where CG$\frac{C}{G}$ denotes the number of input channels per group, the output is NG$\frac{N}{G}$, CG*K*K$\frac{C}{G}*K*K$ is the convolution kernel, the number of convolution kernels is N, and the number of convolution kernels per group is NG$\frac{N}{G}$. As can be seen from the above example, the total number of parameters of the convolution operation is reduced by some amount to become the original 1G$\frac{1}{G}$, which effectively proves that group convolution can reduce the number of parameters.

Lightweight split convolution

Suppose, the input feature map is C*h*w and the output feature map is N*h*w. Different image features are extracted by convolving the input feature map, usually this process can be represented as: Y=WX+b \[Y=WX+b\]

where input XRC×h×w, output YRN×h×w, C is the number of input channels, N the number of output channels, W denotes the weights, and WRC×h×w×N, b denote the bias terms.

Typically, the bias terms are generally ignored to simplify the notation, and the convolution between complete feature maps can be expressed as: yi=jwijxij \[{{y}_{i}}=\sum\limits_{j}{{{w}_{ij}}}{{x}_{ij}}\]

Expanding the convolution equation (23) yields the following matrix expression: [ y1y2yN ]=[ W11W12W1CW21W22W2CWN1WN2WNC ][ x1x2xC ] \[\left[ \begin{matrix} {{y}_{1}} \\ {{y}_{2}} \\ \vdots \\ {{y}_{N}} \\ \end{matrix} \right]=\left[ \begin{matrix} {{W}_{11}} & {{W}_{12}} & \cdots & {{W}_{1C}} \\ {{W}_{21}} & {{W}_{22}} & \cdots & {{W}_{2C}} \\ \vdots & \vdots & \ddots & \vdots \\ {{W}_{N1}} & {{W}_{N2}} & \cdots & {{W}_{NC}} \\ \end{matrix} \right]\left[ \begin{matrix} {{x}_{1}} \\ {{x}_{2}} \\ \vdots \\ {{x}_{C}} \\ \end{matrix} \right]\]

where yi is the i nd output value, i = 1,2,⋯,N, wij are each element of the weight matrix representing each parameter, and the convolution kernel size is K*K, xij is the input value, and j = 1,2,⋯,C.

In order to reduce feature redundancy and feature loss and extract more effective features, the lightweight segmentation convolution divides all input feature map channels into two main parts according to the ratio α, where one part uses the more complex 3*3 convolution to extract typical feature information, and the other part uses the simpler 1*1 convolution to extract the feature information for self-calibration in order to supplement the information of some tiny hidden details. The arithmetic of this whole process is as follows: [ y1y2yN]=[ W1,1W1,aCWN,1WN,αC ][ x1xαC ]+[ W1,ac+1W1,CWN,αC+1WN,C ][ xαC+1xC] \[\left[ \begin{matrix} {{y}_{1}} \\ {{y}_{2}} \\ \vdots \\ {{y}_{N}} \\ \end{matrix} \right]=\left[ \begin{matrix} {{W}_{1,1}} & \cdots & {{W}_{1,aC}} \\ \vdots & \ddots & \vdots \\ {{W}_{N,1}} & \cdots & {{W}_{N,\alpha C}} \\ \end{matrix} \right]\left[ \begin{matrix} {{x}_{1}} \\ \vdots \\ {{x}_{\alpha C}} \\ \end{matrix} \right]+\left[ \begin{matrix} {{W}_{1,ac+1}} & \cdots & {{W}_{1,C}} \\ \vdots & \ddots & \vdots \\ {{W}_{N,\alpha C+1}} & \cdots & {{W}_{N,C}} \\ \end{matrix} \right]\left[ \begin{matrix} {{x}_{\alpha C+1}} \\ \vdots \\ {{x}_{C}} \\ \end{matrix} \right]\]

where α is the proportion of channel divisions and takes the value α = 0.5. The left half of the matrix wij is convolved over αC channels using a 3*3 convolution kernel, j = 1,…,αC and the right half of the matrix wij is convolved over (1–α)C channels using a 1*1 convolution kernel, j = αC,…,C.

Lightweight Segmentation Convolution and Deep Neural Network Fusion
Lightweight Segmentation Convolutional and Residual Network Modeling

Res Net brings a great performance improvement and is therefore widely used as a basis for many network structures.Res Net proposes a residual learning module to simplify the training of networks that were previously trained by increasing the depth, which solves the performance degradation caused by the increase in the depth of the network.In this subsection, we construct a new network model that embeds the lightweight segmentation convolution into the residual network structure to realize the image classification task.

The residual module fits a residual mapping to the convolutional layers, denoting the shallow mapping by H(x). The output of the convolutional operation fitted with several different convolutional blocks is F(x) = H(x)–x, and the mapping value of the lower layer can be converted to H(x)=F(x)+x. The whole process can be realized by using a “shortcut” operation, which fuses the different output feature layers with the previous input layer. In network structures, it is common to jump several layers and then add a residual module to effectively fuse the features of different layers. The shortcut constant mapping is defined as: y=F(x,Wi)+x \[y=F\left( x,{{W}_{i}} \right)+x\]

The shortcut connection is simple to operate, does not increase the complexity of the computation and additional parameters, brings a great role in comparison with simple networks, and has a great improvement in performance in comparison with models with parameters of the same size, depth, and width of the network. The x and F dimensions in Eq. (26) are equal. If they are not equal, the dimensions can usually be obtained by different operations, such as the linear operation Ws, with the following expression: y=F(x,Wi)+Wsx \[y=F\left( x,{{W}_{i}} \right)+{{W}_{s}}x\]

where Ws is used only when matching dimensions.

Lightweight Segmentation Convolution with ResNeXt Network Model Construction

The ResNeXt network is constructed by repeating a module that incorporates several identical structures. The whole can be divided into several branches for feature extraction separately, and only a few relevant hyperparameters need to be set. A new hyperparameter is proposed, called “base”, which is similar to the depth and width of the network, and is a dimension factor.

Lightweight Segmentation Convolutional and Width Network Modeling

The performance degradation with increasing depth of the network model is mitigated by an improved wider deep residual network, which also accelerates the network convergence. In addition, the use of dropout method inside the deep residual block is proposed to bring about training performance optimization and reduce overfitting. The residual block with identity mapping can be represented as: xl+1=xl+F(xl,Wl) \[{{x}_{l+1}}={{x}_{l}}+F\left( {{x}_{l}},{{W}_{l}} \right)\]

Where the input of the l st unit is xl, the output of the l rd unit is xl+1, the residual function is F, and the parameters of the module are Wl. The residual network consists of residual modules stacked in order.

Lightweight split convolution with PreactResNet model construction

The PreactResNet network can be called the pre-activation network, which mainly switches the order of the convolution layer, activation layer, and BN layer, so that there is a pathway that can go directly from the first ResNet module to the last ResNet module without going through the nonlinear function Relu in the middle, which can improve the correct rate of the model.

Placing the activation function before the convolutional layer further enhances the shortcut connection property in the residual network structure, and the residuals can be expressed as Eq. (29) and Eq. (30): yl=h(xt)+F(xl,Wl) \[{{y}_{l}}=h\left( {{x}_{t}} \right)+F\left( {{x}_{l}},{{W}_{l}} \right)\] xl+1=f(yl) \[{{x}_{l}}+1=f\left( {{y}_{l}} \right)\]

where h(x) and f(y) are residual mappings, i.e., h(xt) = xl, f(yl) = yl, and xi are the l th residual cell inputs, Wl is the l th residual cell weight, F is the residual function, and the function f is the operation after the elements are summed up and is the Relu activation function.

Experimental results and analysis
Validity analysis of complex network feature statistics

In the first two sections, the network feature extraction of images is described, in order to verify the effectiveness of network features, in this section, by selecting three groups of images for comparison experiments, as Qmin and Qmax are both descriptions of the network weights, in the experimental process, only Qmin is statistically and descriptively described, in the process of analyzing the validity of the commonly used statistics, including the number of nodes N, the degree of discretization y, the clustering coefficient C and minimum weight Qmin.

Three groups of images contain three pictures belonging to the two categories of people and flowers, and for the three groups of images, the number of nodes N, the degree of dispersion y, the clustering coefficient C are obtained by solving the formula, and the minimum weight Qmin is obtained by calculating.Finally, the histogram statistics of the obtained data are shown in Fig. 5.

Figure 5.

Network parameter statistics profile

In the figure, the number of nodes N, the degree of dispersion y, the clustering coefficient C and the minimum weight Qmin between image 1 and image 2, image 3 show obvious differences, while the distribution of statistics of image 2, image 3 show more obvious similarity. For the number of nodes N, image 1 mainly focuses on the number of fixed points as 0 and 1, and the number of nodes of image 2 and image 3 mainly focuses on the position 0; for the discretization y, image 1 are all significantly higher than image 2 and image 3, for the clustering coefficient C, image 1 mainly focuses on 0.3-0.7, and image 2 and image 3 mainly focuses on 0.3-0.9, and for the weight Qmin, image 1 mainly focuses with < 10 part, while image 2 and image 3 present a more uniform distribution. The network parameters of the images exhibit obvious distributional variability, which provides the necessary prerequisite for image classification.

Since in the three groups of images, the pictures have obvious differences and the experiment exists by chance, in order to enhance the persuasive power of the network features, three images in the Scene 15 dataset in the building class and street class are chosen: building, street1 and street2.

For the three images, the number of nodes N, the dispersion y, and the clustering coefficient C are obtained by solving according to the formula, and the minimum weight Qmin is obtained by solving according to the formula.Finally, the histogram statistics of the obtained data are shown in Fig. 6. Analyzing the distribution graphs of the number of nodes N, dispersion y, clustering coefficient C and minimum weight Qmin of the three images, it can be found that there are still more obvious distribution differences in the common statistics of the network.

Figure 6.

Network parameter statistics profile

In summary, the network features of the image formed by extracting commonly used statistics for the network are distinguishable, laying the foundation for image classification.

Image segmentation effect
Data sets

Cityscape: The Cityscape dataset is a large-scale dataset for urban segmentation from the perspective of the automobile. The dataset contains multiple stereoscopic video sequences from 50 different city street scenes containing 4000 high quality finely annotated images, which are further divided into 2555, 400 and 1045 images according to the criteria for training, validation and testing respectively. The resolution of the images is 2048 × 1024, which poses a great challenge for real-time semantic segmentation methods. The labeled images have 30 classes, out of which 19 classes are used for the semantic segmentation task. To ensure a fair comparison with other methods, only 19 classes are used in the experiments of this paper.

Cam Vid: The Cam Vid dataset is a small-scale dataset for road scene segmentation, which is also from the perspective of a car. There are a total of 621 images with high quality pixel-level annotations extracted from video sequences, of which 314 are used for training, 89 for validation and 218 for testing. The images have the same resolution of 960 × 720. The annotated images are provided with 32 categories of truth labels, out of which a subset of 11 categories is used for the experiments in this paper.

Experimental effects

The experiments were first conducted on the Cityscapes dataset to compare the models in this paper, focusing on comparing the network PreactResNet in this chapter with the mainstream real-time lightweight semantic segmentation models for urban landscapes, ENet, ICNet, BiSeNetV1, BiSeNetV1-L, BiSeNetV2, BiSeNetV2-L, BiSeNetV3-1 and BiSeNetV3-2 were compared and the comparison results are shown in Table 1. As mentioned in the previous subsection, the image resolution of the Cityscapes dataset is too high, so the images in the dataset are cropped at different scales according to different model settings. γ denotes the downsampling ratio, and a downsampling ratio of 0.5 corresponds to a size of 512 × 1024 after cropping, 0.75 denotes a size of 768 × 1536 after cropping, and 1.0 means that the original image is trained according to the size of the original image without downsampling. Backbone denotes the backbone network where each model is trained. Secondly, for fair comparison, separate experiments were also conducted using STDC1 and STDC2 as the backbone networks.

Comparison Experimen results on the Cityscapes dataset

Model γ Backbone MIoU-val(%) MIoU-test(%) Fps
ENet 0.4 - - 56 74.5
ICNet 0.9 PSPNet50 66.4 63.8 25.9
BiSeNetV1 0.65 Xception-39 67.9 67.1 106.9
BiSeNetV1-L 0.65 ResNet-18 71.6 69 64.7
BiSeNetV2 0.4 - 71.8 70.4 144.1
BiSeNetV2-L 0.4 - 72.7 70.3 43.4
BiSeNetV3-1 0.4 STDC1 69.6 67.9 243.8
BiSeNetV3-2 0.4 STDC2 71.4 70.3 167.3
PreactResNet1 0.4 STDC1 71.1 69.6 255.8
PreactResNet2 0.4 STDC2 74.7 73.6 184.3

The evaluation of the experiments shows that the method proposed in this paper achieves a better balance between accuracy and speed compared with other methods, and the segmentation accuracy can reach 69.6% and the speed can reach 255.8 FPS when using STDC1 as the backbone network, which means that the model in this paper achieves a high inference speed. When using STDC2 as the backbone network, the segmentation accuracy of the model can reach 73.6% and the inference speed can reach 184.3FPS, which achieves the highest accuracy.

To further validate the performance of the deep neural network images in this paper, comparison experiments were also conducted on the CamVid dataset. To ensure a fair comparison with other methods, the experiments used an input image resolution of 940 × 710 for training and prediction. The experimental results are shown in Table 2.

Comparison Experimen results on the CamVid dataset

Model Backbone Resolution MIoU(%) Fps
ENet - 940×710 48 55
ICNet PSPNet50 940×710 65.8 27.1
BiSeNetV1 Xception-39 940×710 62.1 177.1
BiSeNetV1-L ResNet-18 940×710 66.8 113.9
BiSeNetV2 - 940×710 70.5 122.7
BiSeNetV2-L - 940×710 70.8 41.9
BiSeNetV3-1 STDC1 940×710 70.6 196
BiSeNetV3-2 STDC2 940×710 71.2 153.4
PreactResNet STDC1 940×710 69.6 221.5
PreactResNet STDC2 940×710 75.4 143.2

From the experimental results, it can be seen that the method in this paper achieves an inference speed of 221.5 FPS and a segmentation accuracy of 69.6% when using STDC1 as the backbone network. Meanwhile, when STDC2 is used as the backbone network, the segmentation accuracy of the image can reach up to 75.4%, and the inference speed is 143.2 FPS.Taken together, the method in this paper achieves a better balance between speed and accuracy on the CamVid dataset.

Image classification effect

In this section, two datasets, CUB-200-2011 and Stanford Dogs, are selected to show the experimental results, and the proposed method is compared with other network models, and the experimental results are shown in Table 3. From the results, the PreactResNet model outperforms all other methods on the CUB and Dogs datasets.

Comparison results of model performance

Model Underlying network 1-Stage Stanford Dogs (%) CUB-200-2011 (%)
ResNet50 ResNet50 88.7 85.7
GP-256 VGG16 × 89.1 87
MaxEnt DenseNet161 89.6 87.8
DFL-CNN ResNet50 93.7 88.6
NTS-Net ResNet50 94.2 88.7
Cross-X ResNet50 × 94.9 88.9
CIN ResNet101 93.6 89.3
ACNet ResNet50 93.4 89.3
S3N ResNet50 93.1 89.7
FDL ResNet161 90.9 90.3
PMG ResNet50 3.5 90.8
FBSD ResNet161 94.1 91
API-Net ResNet161 96.3 91.2
StackedLSTM GoogleNet 3.5 91.6
CAL ResNet101 94.7 91.8
HDML GoogleNet 95.3 92.4
DCML ResNet50 95.9 92.8
ViT ViT-B_16 15.8 91.6
TransFG ViT-B_16 96.4 92.6
FFVT ViT-B_16 96.4 92.6
RAMS-Trans ViT-B_16 96.7 92.7
AFTrans ViT-B_16 6.6 92.8
PreactResNet ViT-B_16 97 93

Specifically, the fourth column in the table shows the results of the comparison of PreactResNet on Stanford Dogs and the fifth column shows the results of the comparison on CUB-200-2011. Compared to the best results to date on the CUB-200-2011 dataset, PreactResNet achieves a 0.2% improvement in the Top-1 metric and a 1.4% improvement compared to the original framework, ViT. On the Stanford Dogs dataset compared to the best results to date, PreactResNet T achieves a 0.4% improvement in the Top-1 metric and a 1.2% improvement compared to the original framework ViT. Compared to other mainstream CNNs, PreactResNet has a substantial performance improvement on both datasets. Compared to models that use Vi T as the underlying network, the PreactResNet proposed in this paper focuses more on extracting features between each Transformer level. Overall, PreactResNet outperforms other algorithms in classification.

The change curves of loss and accuracy of PreactResNet on CUB-200-2011 and Stanford Dogs datasets are shown in Fig. 7, (a) to (d) are the loss values and accuracy on CUB-200-2011 and Stanford Dogs datasets, respectively. The orange curve in the figure indicates the trend of loss and accuracy, which is generated by Tensorboard, and the light-colored part represents the curve of real data. For better presentation, in this paper, we use dark curves to represent the accuracy and loss trends by changing the smoothing coefficients. As shown in Fig. 7(a) and Fig. 7(b), the training loss of PreactResNet model can be steadily reduced on both datasets. Due to the use of pre-trained ViT-B/16 model weights to train the network, the test accuracy improved rapidly in the first 2000 cycles, as shown in Fig. 7(c) and Fig. 7(d). Meanwhile, the test accuracy curve has no decreasing trend, proving that no fitting phenomenon occurs. From the figure, it can be seen that the overall classification effect of PreactResNet is better, the probability of misclassification is lower, and very good classification accuracy can be achieved in several categories.

Figure 7.

Training loss and test accuracy curve

Conclusion

The study proposes a deep neural network image classification method based on lightweight segmentation and convolution, which selects the number of nodes N, the degree of discretization y, the clustering coefficient C, the network weights Qmax and Qmin, and the network statistical constants to characterize the basic visual features of the digital image, which reduces the differences of similar features.

The experimental results and visualization analysis verify the effectiveness of the lightweight semantic segmentation method proposed in this paper. The method’s efficiency in inference speed and segmentation accuracy can reach 69.6% and 255.8FPS, while also effectively reducing computational costs while maintaining higher accuracy. Finally, experiments on two benchmark fine-grained image classification datasets, CUB-200-2011 and Stanford Dogs, show that the method achieves the optimal classification performance among all models with ViT as the underlying network.

Język:
Angielski
Częstotliwość wydawania:
1 razy w roku
Dziedziny czasopisma:
Nauki biologiczne, Nauki biologiczne, inne, Matematyka, Matematyka stosowana, Matematyka ogólna, Fizyka, Fizyka, inne