Research on deep learning image segmentation method based on attention mechanism

Attention mechanism refers to the fact that when human beings receive external information, after certain filtering, processing and selection, they focus their attention on the important information, process and memorize it, while ignoring the irrelevant information. It is one of the bases for human beings to acquire knowledge and perceive the world, and it is also one of the important research directions in the field of artificial intelligence [1-4]. Image segmentation refers to dividing the pixels in a digitized image into different regions and assigning a label to each region, such as foreground, background, object, and so on. Currently, image segmentation techniques have been widely used in various fields, such as medical imaging, natural image processing, face recognition, intelligent transportation systems and robotics [5-8]. Traditional image segmentation methods are mainly based on features such as pixel color information, texture information and edge information, and these methods will fail in complex situations. And deep learning based image segmentation algorithms are increasingly used for their excellent performance and high accuracy [9-12].

In recent years, the application of deep learning techniques in the field of image processing has made great progress, especially in image segmentation and analysis. Deep learning-based image segmentation techniques can automatically divide a digitized image into several non-overlapping regions and assign each region with a corresponding semantic label [13-16]. It is robust and adaptable and can be used for a variety of different types of images, such as medical images and nature images. Currently, there are three main approaches for deep learning based image segmentation techniques, namely FCN, U-Net and MaskR-CNN [17-20].

This paper provides a brief overview of deep learning techniques and attention mechanism based on the image segmentation technology framework of deep learning. In order to further improve the quality and efficiency of image segmentation, the encoder-decoder idea is utilized to construct the feature fusion network (TCNet) with large convolutional kernel and attention mechanism. An image segmentation network incorporating four parts: large convolutional kernel module (LKM), enhanced transformer module (ETM), multi-scale feature fusion module (MFM) and decoder is designed. Set up the experimental environment as well as the loss function, and analyze the values of the loss function parameters. Select several datasets to compare the performance indexes of classical image segmentation algorithms and obtain the image segmentation performance of the TCNet method designed in this paper.

2

Theoretical foundations of image segmentation

2.1

Image Segmentation

Image segmentation can be divided into three main categories, semantic segmentation, instance segmentation and panoramic segmentation. As a common sense image segmentation generally means semantic segmentation, which provides not only category prediction but also spatial location information of these categories by categorizing all pixel points on a picture. Image semantic segmentation can be viewed as a pixel-level classification task where each pixel is labeled and classified into a specific category by densely predicting and inferring labels for each pixel. Instance segmentation is a combination of target detection and semantic segmentation to achieve edge segmentation of an object, resulting in a single segmentation instance. As opposed to semantic segmentation, instance segmentation is segmenting a single object and labeling it. Panoramic segmentation is a recently developed segmentation task that can be represented as a combination of semantic and instance segmentation.

2.2

Deep learning based image segmentation

2.2.1

Deep learning techniques

In recent years, deep learning technology has been rapidly developed and has been widely used in computer vision, image processing, natural language processing and other fields. Different from traditional machine learning algorithms in large-scale data and complex scenes, deep learning algorithms are able to directly use the raw data as input to the model, and utilize the multi-layer network structure in the network model to transform the input data into high-level abstract feature expressions. This enables the extraction of data features. The advantage of this algorithm is that it can realize the extraction of data features and there is no human intervention in the learning process of the model in order to improve the fitting ability of the model and improve the processing effect of the algorithm.

2.2.2

Attention mechanisms

The nature of the attention mechanism is a weighted summation operation that aims to make the model more focused on the more important parts of it [21-22]. In general, the attention mechanism consists of three main steps. The first is the calculation of attention weights, by which the attention weights of different positions are calculated in some way to express the degree of importance between them. The second is weighted summation, which weights and sums the information according to the attentional weights. The third is updating the model, where the weighted summation results are federated with the output of the original model in order to update the model. Common attention mechanisms include global attention, local attention, self-attention, etc. Different attention mechanisms are applicable to different scenarios and tasks.

The attentional mechanism works to amplify the weights of useful information and minimize the weights of useless information. The elements in the data are constructed into a series of 〈key,value〉 data pairs, and given an element Query in the target, the weight coefficients corresponding to value for each key are obtained by calculating the similarity between Query and each key, and then the value is weighted and summed up, which results in the final Attention value. So essentially the Attention mechanism is a weighted summation of the value values of the elements in the data, while Query and key are used to calculate the weight coefficients of the corresponding value s. Equation (1) expresses the essential idea of the attention mechanism:

(1)

A t t e n t i o n (Q u e r y, S o u r c e) = \sum_{i = 1}^{L_{x}} S i m i l a r i t y (Q u e r y, K e y_{i}) * V a l u e_{i}

\[Attention(Query,Source)=\underset{i=1}{\overset{{{L}_{x}}}{\mathop \sum }}\,Similarity(Query,Ke{{y}_{i}})*Valu{{e}_{i}}\]

where L_x = ||Source|| represents the length of Source. Conceptually, Attention is understood as selectively filtering out a small amount of important information from a large amount of information and focusing on these important information, ignoring mostly unimportant information. The process of focusing is reflected in the calculation of weight coefficients, the larger the weight, the more focused on its corresponding Value value, i.e., the larger the weight value, the more important the signal total.

Self-attentive mechanisms are a recent scientific advance in obtaining long-range interactivity, but are still mainly used only for sequence modeling and generative modeling tasks. The key idea behind the self-attention mechanism is to obtain a weighted average of the values computed by the hidden units. Unlike pooling or convolution operators, the weights used in the weighted average operation are obtained dynamically through a similarity function between the hidden units. As a result, the interaction between the input signals depends on the signals themselves, rather than being predetermined by their relative positions. It is worth mentioning that this allows the self-attentive mechanism to obtain long-term interactivity without increasing the parameters.

The self-attention mechanism is based on an improvement of the attention mechanism, which is better at capturing the internal relevance of the data or features. A larger sense field and more contextual information is obtained by capturing the global information. The self-attention mechanism provides a modeling way to efficiently capture global contextual information through the triad of (key,query,value). In the self-attention mechanism, Query, Key, and Value are the same, i.e., Query and Key represent the same object, e.g., the same sequence or the same network layer output. A self-attention modeling formula is shown below, and its computational flow can be described as: (2) $H = V \cdot S o f t \max (\frac{K^{T} Q}{\sqrt{d}})$ \[H=V\cdot Soft\max (\frac{{{K}^{T}}Q}{\sqrt{d}})\]

where H represents the attention value matrix, Q, K, and V correspond to the Query matrix, Key matrix, and Value matrix, respectively, and the similarity or correlation measure used in Eq. is the scaled dot product.

3

Image segmentation algorithms incorporating attention mechanisms

In this study, we construct a large convolutional kernel and attention mechanism feature fusion network (TCNet) based on the idea of encoder-decoder, which consists of four parts: a large convolutional kernel module (LKM), an augmented Transformer module (ETM), a multiscale feature fusion module (MFM), and a decoder.

3.1

Large Convolutional Kernel Module

The receptive field is a key factor that affects the network to extract the complete shape information of the image or region of interest, in order to expand the effective receptive field (ERF) of the network more effectively, this study has done an in-depth exploration on the extra-large convolutional kernel. Equation $E R F = λ K \sqrt{L}$ $ERF=\lambda K\sqrt{L}$ shows that the ERF is linearly correlated with the size of convolution kernel K and quadratically linearly correlated with the depth of the network L , and λ is the scaling factor, so the effective receptive field of the network can be enlarged by increasing the size of the convolution kernel. Based on this fact, this study designs the Large Convolutional Kernel Module (LKM), which uses an oversized convolutional kernel as the core of the module, combined with deep convolution, to effectively expand the receptive field and capture more complete shape information.

The large convolutional kernel module, i.e., LKM, contains two BN-Conv sequences, two 1×1 dense convolutional layers and one 31×31DW supersized convolutional layer. Specifically, the LKM takes as input the feature X_pre output from the previous convolutional block and models the primary features first. The formula can be expressed as: (3) $x_{n o r m}^{(i)} = \frac{x^{(i)} - μ}{\sqrt{σ^{2} + ε}}$ \[x_{norm}^{(i)}=\frac{{{x}^{(i)}}-\mu }{\sqrt{{{\sigma }^{2}}+\varepsilon }}\] (4) ${\tilde{x}}^{(i)} = γ \times x_{n o r m}^{(i)} + β$ $${\tilde \chi ^{(i)}} = \gamma \times x_{norm}^{(i)} + \beta $$

In Eq. (3), μ represents the mean of the feature map X_pre, σ² represents the variance of X_pre, and ε is a parameter that prevents the denominator from becoming zero. In Eq. (4), γ is a trainable parameter and β is a trainable bias.

After that, the feature map X_norm is fed into the 1×1 set convolutional layer to realize the information exchange between the channels and output the feature map ${\tilde{X}}_{n o r m}$ \[{{\tilde{X}}_{norm}}\]. subsequently, ${\tilde{X}}_{n o r m}$ \[{{\tilde{X}}_{norm}}\] passes through the 31×31DW extra-large convolutional layer and the 1×1 dense convolutional layer, and the output feature map is connected to X_pre after hopping to generate the feature map X_mid: (5) $X_{m i d} = C o n v (D W ({\tilde{X}}_{n o r m})) \oplus X_{p r e}$ $${X_{mid}} = Conv\left( {DW\left( {{{\tilde X}_{norm}}} \right)} \right) \oplus {X_{pre}}$$

In Eq. (5), Conv denotes the 1×1 dense convolution operation, DW is the very large convolutional layer, and the symbol ⊕ denotes the channel splicing operation.

Subsequently feature map X_mid produces feature map ${\tilde{X}}_{m i d}$ ${\tilde X_{mid}}$ after passing through the same BN-Conv sequence, LKM uses GELU as the activation function of the network and this process can be expressed as Eq: (6) $G E L U ({\bar{X}}_{m i d}) = x P (X \leq x) = x φ (x)$ \[GELU({{\bar{X}}_{mid}})=xP(X\le x)=x\varphi (x)\]

where φ(x) is a normally distributed probability function that can be simply denoted by N(0,1).

The activation function processed feature maps are fed into the last 1×1 dense convolutional layer to produce the feature map ${\tilde{X}}_{p o s t}$ ${\tilde X_{post}}$, and finally, ${\tilde{X}}_{p o s t}$ ${\tilde X_{post}}$ and X_mid are channel spliced to output the final feature map ${\tilde{X}}_{p o s t}$ ${\tilde X_{post}}$: (7) $X_{p o s t} = {\tilde{X}}_{p o s t} \oplus X_{m i d}$ $${X_{post}} = {\tilde X_{post}} \oplus {X_{mid}}$$

3.2

Enhancements to the Transformer module

Inspired by the outstanding performance of the visual Transformer in modeling distant pixel relationships, the Enhanced Transformer Module (ETM) is designed in this paper. It uses a DW mega-convolution kernel as the core component to perform convolution operations, which, combined with a multi-head self-attention mechanism, can further address the problem of incomplete labeling in images due to sensory field constraints.

The ETM contains a similar structure for the upper and lower parts, each of which consists of three layer normalization (LN) layers, one multinomial self-attention layer, one 31×31DW mega-convolutional layer, and the GELU activation function, with the difference that the upper part of the multinomial self-attention layer employs windowed multinomial self-attention blocks (W-MSAs), whereas the lower part employs sliding-windowed multinomial self-attention blocks (SW-MSAs).

3.3

Multi-scale fusion module

In order to effectively fuse the multi-scale feature maps produced by the large convolutional kernel CNN branch and the augmented Transformer branch, the multi-scale fusion module (MFM) is designed in this study. The feature maps generated by the large convolutional kernel CNN branch contain complete shape information, while the feature maps output from the augmented Transformer branch are rich in remote pixel dependencies. Therefore, the MFM, which combines the multi-scale fusion mechanism and the multi-head self-attention mechanism, can be used as a bridge to effectively fuse the two feature maps.

The MFM processes the feature maps to be fused in the upper, middle and lower branches. For the middle branch, the output feature maps E from the augmented Transformer branch and L from the large convolutional kernel CNN branch are firstly flattened into a long sequence, which is copied and resized into feature marker blocks of special size (1, 1, C). Then, the two feature marker blocks are crosswise spliced to the two ends of the two long sequences, and after the Transformer computes the respective attention maps, the fusion feature map $\hat{T}$ $\hat{T}$ is produced by splicing the two parts of the attention maps and performing the convolution operation: (8) $\hat{T} = A t t e n t i o n (L N (E^{i}, L^{i}))$ \[\hat{T}=Attention(LN({{E}^{i}},{{L}^{i}}))\]

The two methods, average pooling and maximum pooling, generate two different spatial context descriptors, E_avg and E_max . The two descriptors are then forwarded to the shared network to generate the channel attention graph. The shared network consists of a multilayer perceptron (MLP) and a hidden layer. After the shared network is applied to each descriptor, the output feature vectors are finally merged using element summation to output the feature map $\hat{E}$ $\hat{E}$:

(9)

\begin{matrix} E = σ (M L P (A v g P o o l (E^{i})) + M L P (M a x P o o l (E^{i}))) \\ = σ (W_{1} (W_{0} (E_{a v g})) + W_{1} (W_{0} (E_{\max})) \end{matrix}

\[\begin{matrix} E=\sigma (MLP(AvgPool({{E}^{i}}))+MLP(MaxPool({{E}^{i}}))) \\ =\sigma ({{W}_{1}}({{W}_{0}}({{E}_{avg}}))+{{W}_{1}}({{W}_{0}}({{E}_{\max }})) \\ \end{matrix}\]

where σ denotes the Sigmoid function and W₁ and W₀ are the two shared weights of the shared network.

For the downward branching, the maximum pooling and convolution operations are first performed on the feature map L to increase the feature channels. Then the feature map is fed into the spatial attention block to generate a spatial attention map using the spatial relationships of the features. Finally, the feature map $\hat{L}$ $\hat{L}$ is generated: (10) $\begin{matrix} \hat{L} = σ (C o n v ([A v g P o o l (L^{i}); M a x P o o l (L^{i})])) \\ = σ (C o n v (L_{a v g}; L_{\max})) \end{matrix}$ \[\begin{matrix} \hat{L}=\sigma (Conv([AvgPool({{L}^{i}});MaxPool({{L}^{i}})])) \\ =\sigma (Conv({{L}_{avg}};{{L}_{\max }})) \\ \end{matrix}\]

Eventually, the three types of feature maps $\hat{E}, \hat{L}, \hat{T}$ $\hat{E},\hat{L},\hat{T}$ are feature spliced to generate the final fused feature map. The formula can be expressed as: (11) $O u t p u t = C o n c a t (\hat{E}, \hat{L}, \hat{T})$ \[Output=Concat(\hat{E},\hat{L},\hat{T})\]

4

Experimentation and analysis of image segmentation techniques

In this paper, the proposed deep learning image segmentation technique with the addition of an attention mechanism is applied to the field of medical images to test the feasibility and applicability of the technique.

4.1

Experimental environment and parameter settings

4.1.1

Experimental environment

The hardware and software environments used in the experiments are shown in Table 1.

Table 1.

The hardware and software environment of the experiment

Central processor	Intel(R)Core i7-8600K CPU@3.60 GHz
Graphics card	Nvidia TITAN Xp 24GB
Memory	64GB
Operating system	Windows 10
Python	3.9.15
CUDA	11.6
torch	1.13.0+cu116
torchvision	0.14.0+cu116
Simulation platform	PyCharm

4.1.2

Experimental parameters

In order to improve the computational efficiency, all the images are resized to 448×448, in addition to vertical flip, horizontal flip, diagonal flip, random shift and random scale change data enhancement operations are done.

In this paper, the optimal training parameters are obtained based on multiple training. This includes epoch set to 120, training batch set to 15, and model training optimized using Adam optimizer. Since too large a learning rate in the late stage of training will lead to instability of the model, while too small a learning rate will lead to a decrease in learning efficiency. Therefore, in this paper, a cosine annealing strategy is used to adjust the learning rate of the network. It essentially adjusts the learning rate curve to a cosine function model that decreases with the number of training sessions.

(12)

l r = \frac{1}{2} l r_{b e g i n} (1 + \cos (\frac{T_{n o w}}{T_{t o t a t}} π))

\[lr=\frac{1}{2}l{{r}_{begin}}(1+\cos (\frac{{{T}_{now}}}{{{T}_{totat}}}\pi ))\]

where lr_begin is the initial learning rate setting, which is set to 0.001 in this paper, T_brat is the total number of training rounds, and T_now is the current number of training rounds.

4.1.3

Evaluation indicators

In order to fully evaluate the effectiveness of the model, five parameters, accuracy (AC), sensitivity (SE), specificity (SP), Jaccard’s coefficient (JA), and Dice’s coefficient (DI), are used in this paper for comparison. Of these, accuracy is the most intuitive metric, indicating the percentage of correctly classified pixels over all pixels. Sensitivity (also known as recall) indicates the percentage of recognized correct samples to the actual correct samples, through which you can view the segmentation results of positive samples. Specificity, on the other hand, indicates the percentage of negative samples identified versus the overall negative samples. Jaccard’s coefficient and Dice’s coefficient are used to compare the difference between the model segmented samples and the actual labeled samples, with higher scores indicating that the results of the two are more similar.

It is worth noting that in the ablation experiments, all five metrics are used in this paper. However, in the comparison experiments, only AC, JA, and DI are compared in this paper because some of the models compared do not give the used parameters used in this paper. The expressions for the five metrics are as follows: (13) $A C = \frac{T P + T N}{T P + T N + F P + F N}$ \[AC=\frac{TP+TN}{TP+TN+FP+FN}\] (14) $S E = \frac{T P}{T P + F N}$ \[SE=\frac{TP}{TP+FN}\] (15) $S P = \frac{T N}{T N + F P}$ \[SP=\frac{TN}{TN+FP}\] (16) $J A = \frac{T P}{T P + F N + F P}$ \[JA=\frac{TP}{TP+FN+FP}\] (17) $D I = \frac{2 \times T P}{2 \times T P + F N + F P}$ \[DI=\frac{2\times TP}{2\times TP+FN+FP}\]

where TP, FP, TN and FN represent true cases, false positive cases, true negative cases, and false negative cases, respectively. They represent the cases in which the data of the true example is judged to be true, the data of the false example is judged to be true, the data of the false example is judged to be false, and the data of the true example is judged to be false, respectively.

4.1.4

Loss function

The cross-entropy loss function is generally used as the loss function used for training in usual image segmentation networks, which has the advantage of being simple and easy to understand, and is suitable for the case of sample averaging. However, for datasets with positive and negative sample inhomogeneity, it may lead to poor training of the network. A similar problem exists in the dermoscopy dataset, both in the ISIC-2017 dataset as well as in the PH2 dataset, where the percentage of category regions is different. Statistically, more than 75% of the pixels in the training set of the ISIC-2017 dataset are labeled as background. Therefore, the Dice loss function is introduced in some segmentation networks, which can effectively solve the data inhomogeneity problem. However, the limitation of the Dice loss function is that it will adversely affect the backpropagation during training. Therefore, in order to balance the issues of sample uniformity and training stability, the weighted cross-entropy loss function and Dice loss function are used in this paper. The formulas are as follows: (18) $L_{t o t a l} = α \cdot L_{B C E} + L_{D i c e}$ \[{{L}_{total}}=\alpha \cdot {{L}_{BCE}}+{{L}_{Dice}}\] (19) $L_{B C E} = - \frac{1}{n} \sum_{i = 1}^{n} [{\hat{x}}_{i} \log x_{i} + (1 - {\hat{x}}_{i}) \log (1 - x_{i})]$ \[{{L}_{BCE}}=-\frac{1}{n}\underset{i=1}{\overset{n}{\mathop \sum }}\,[{{\hat{x}}_{i}}\log {{x}_{i}}+(1-{{\hat{x}}_{i}})\log (1-{{x}_{i}})]\] (20) $L_{D i c e} = 1 - \frac{2 \sum_{i = 1}^{n} x_{i} {\hat{x}}_{i}}{\sum_{i = 1}^{n} (x_{i} + {\hat{x}}_{i}) + ε}$ \[{{L}_{Dice}}=1-\frac{2\underset{i=1}{\overset{n}{\mathop \sum }}\,{{x}_{i}}{{{\hat{x}}}_{i}}}{\underset{i=1}{\overset{n}{\mathop \sum }}\,({{x}_{i}}+{{{\hat{x}}}_{i}})+\varepsilon }\]

The results of the loss function parameter comparison experiment are shown in Figure 1. Comparing the values of the loss function parameter α, the overall segmentation of the network is best when the parameter α takes the value of 0.5.

4.2

Data sources

In this paper, the proposed TCNet model is evaluated using public image datasets published by ISIC, which are ISIC 2017 dataset, PH2 dataset and ISIC 2018 dataset.

1)

ISIC 2017 dataset and PH2 dataset

The ISIC 2017 dataset contains 800 skin lesion images for training and 350 skin lesion images for validation.

The PH2 dataset contains a total of 220 skin lesion images. In this paper, the ISIC 2017 training set is used to train the model and the PH2 dataset is used to test the model.

2)

ISIC 2018 dataset

The ISIC 2018 dataset contains 2500 skin lesion images for training. Since the public test set has not been released yet, this paper validates the model performance using cross-validation for a fair comparison.

In this paper, the format, label type, and data division of these three datasets are described in the table. The dataset information is shown in Table 2.

Table 2.

Data set information

Name	Format	Label type	Training	Data partitioning validation	Testing
ISIC 2017			800	350	0
PH2	jpg	Pixel level	0	0	220
ISIC 2018			1865	635	0

4.3

Attention Mechanism Accession Strategy

The attention mechanism is widely used as an important network optimization module in medical image segmentation tasks, which can improve the network’s ability to learn features in regions of interest. In this paper, we investigate the effect of adding the attention module at different locations in the TCNet network in order to further improve the performance of the network.

Hybrid Attention Module is an attention mechanism module that contains both channel attention and spatial attention. The channel attention mechanism is used to adjust the weights between different channels in the feature map to enhance the weights of useful features. The spatial attention mechanism is used to adjust the weight of each pixel in the feature map to suppress features that are not related to the target region. The hybrid attention module can effectively improve the feature learning ability of the network and the sensitivity of the pixels in the region of interest, thus improving the segmentation performance of the network. In the experiments, the hybrid attention module is added to different positions of the TCNet network, including each basic convolutional unit of the encoder and decoder, cross-layer connection, and up-sampling module, respectively. The experimental results show that the segmentation performance of the network is all improved to some extent after adding the attention module.

This chapter further analyzes the effect of adding the hybrid attention module to the network for segmenting different classes of pixels. The training curves for different joining strategies are shown in Fig. 2. When adding the three attention mechanisms of all-convolution Attention, input-single-output Attention, and input-four-output Attention respectively, it can be found that there are large differences in the fluctuations of the training curves of the three attention mechanisms. The TCNet curve with input four output Attention has a loss value of 0.5 after 1000 times of training, and after 2500 times of training, the training curve tends to flatten out, but the pixel segmentation effect of the overall training curve is more general.

The loss curves of different joining strategies on the test set are shown in Figure 3. After experimental comparison, it can be clearly seen that different strategies for joining the attention mechanism have a huge impact on the performance of the algorithm. From the comparison results in the figure, all the convolutional units are added to the attention mechanism with the best effect. In addition, it is also found in the experimental process that canceling the deep supervision mechanism of TCNet will improve the segmentation performance of the algorithm to a certain extent, and the reason is analyzed to be related to the characteristics of the medical image dataset.

4.4

Segmentation results

In order to verify the segmentation performance of the TCNet model, this paper analyzes the model in comparison with advanced segmentation methods. These methods include FCN, UNet, ISL-MSCA, SSLS, mFCN-PI, SBPSF and UNet3+.

Segmentation results obtained by TCNet model and other methods trained on ISIC 2017 dataset and tested on PH2 dataset. The segmentation performance of each model on the PH2 test set is shown in Fig. 4. The TCNet model outperforms the other compared methods in all three metrics, JA, pixel-AC, and DI. Compared with the UNet method, the TCNet model is 2.475% and 3.96% higher in both JA and DI metrics. And compared with SSLS method, TCNet model has significantly improved in pixel-AC index, and the difference between the two in pixel-AC index reaches 11.153%.

The segmentation results obtained after cross-validation of the model and other methods on the dataset ISIC 2018 are shown in Figure 5.

The TCNet model was experimentally compared with several state-of-the-art segmentation networks on the ISIC 2018 dataset, which include UNet++, Deeplabv3+, CE-Net, Deeplabv3, UTNet, MedT, and CA-Net.

UTNet, MedT and CA-Net and TCNet models all use the attention mechanism, the difference is that the TCNet model achieves the extraction of global contextual information and local detail information of the image by using multiple attention mechanisms. The TCNet model proposed in this paper achieves 90.334% JA, which realizes a performance improvement of 7.64% compared to the UNet++ method. On the remaining two indexes, the pixel-AC method and the DI method outperform the other methods, with performance indices of 93.457% and 91.773%, respectively, indicating that the model has a better segmentation performance.

4.5

Ablation experiments

In order to explore and analyze the effectiveness of the TCNet model in segmentation tasks, this chapter conducts ablation experiments on the Large Convolutional Kernel Module (LKM), the Enhanced Transformer Module (ETM), and the Multiscale Feature Fusion Module (MFM) in the decoder. Ablation experiments were performed on two classical datasets.

The DRIVE data were obtained from the Diabetic Retina Screening Program in the Netherlands. It is a classical dataset for evaluating retinal vessel segmentation methods.

The CHASEDB dataset contains 28 retinal images captured from the right and left eyes of 14 school-aged children.

The results of the ablation study on DRIVE and CHASEDB datasets are shown in Table 3. After adding Enhanced Transformer Module (ETM) and Multi-scale Feature Fusion Module (MFM) module to the Large Convolutional Kernel Module (LKM), this image segmentation framework improves the Acc, Sp, and AUC performance indices on the DRIVE dataset. Also on CHASEDB dataset, the increase in Acc, Sen, Sp, and AUC after the addition of LKM module, ETM module, and MFM module in the decoder indicates that the method successfully captures more contextual and advanced information.

Table 3.

Ablation research results in DRIVE and the CHASEDB data set

LKM	ETM	MFM	DRIVE				CHASEDB
			Acc	Sen	Sp	AUC	Acc	Sen	Sp	AUC
✓			0.925	0.836	0.972	0.966	0.958	0.875	0.975	0.982
✓	✓		0.964	0.841	0.966	0.973	0.944	0.869	0.963	0.991
✓		✓	0.952	0.869	0.979	0.987	0.957	0.873	0.967	0.985
✓	✓	✓	0.973	0.824	0.983	0.989	0.970	0.889	0.986	0.992

5

Conclusion

This paper explores the processing performance of an image segmentation algorithm that incorporates an attention mechanism in a deep learning framework. A large convolutional kernel and attention mechanism feature fusion network (TCNet) is constructed, and extensive test experiments of this method are conducted on multiple datasets, and multiple metrics verify the applicability of the method.

In the loss function test of the algorithm, five parameters, accuracy (AC), sensitivity (SE), specificity (SP), Jaccard’s coefficient (JA), and Dice’s coefficient (DI), are used for comparison. Combining the data in the figure, the TCNet method obtains the best performance when the loss function is 0.5. Substituting different hybrid attention mechanisms, the best results are obtained when all convolutional units are added to the attention mechanism. In the classical dataset of medical images, the method TCNet in this paper achieves better performance. And the results of the ablation study on DRIVE and CHASEDB datasets show that the LKM module, ETM module, and MFM module designed in the encoder of this paper can effectively improve the Acc, Sen, Sp, and AUC index values of the image segmentation model. It shows the reasonableness of constructing the large convolutional kernel and attention mechanism feature fusion network in this paper. Compared with the traditional image segmentation algorithm, this method has better performance in segmentation effect and stronger generalization ability.

Idioma:: Inglés

Calendario de la edición:: 1 veces al año
Temas de la revista:: Ciencias de la vida, Ciencias de la vida, otros, Matemáticas, Matemáticas aplicadas, Matemáticas generales, Física, Física, otros

RSS Feed de revista

Research on deep learning image segmentation method based on attention mechanism

Haibo Li

Publicado en línea: 17 mar 2025

Recibido: 24 oct 2024

Aceptado: 12 feb 2025

DOI: https://doi.org/10.2478/amns-2025-0210

Palabras claveAttentional mechanism, Deep convolutional neural network, Multi-scale fusion, Image segmentation

© 2025 Haibo Li, published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Palabras clave
Attentional mechanism, Deep convolutional neural network, Multi-scale fusion, Image segmentation