Multi-task learning based feature extraction method in signal processing of high resolution remote sensing video images

Remote sensing (RS) refers to a detection technology that perceives targets or natural phenomena over long distances without direct contact, and narrowly refers to the use of various sensors (e.g., photogrammeters, scanners, and radars, etc.) on various platforms in high altitude and outer space to obtain information on the earth’s surface, and to study the shapes, sizes, locations, properties of objects on the ground, as well as their interrelationships with the environment, by means of data transmission and processing, a It is a modernized technical science [1-4]. Remote sensing is usually categorized into aerospace remote sensing and airborne remote sensing according to the different platforms of the sensors they carry. Video and images are the commonly available information for remote sensing systems. In order to process this information and analyze it to explore the parts of interest, the proposed image signal processing method is a necessary technical guarantee [5-8].

With the rapid development of modern science and technology, geographic information system (GIS) plays an increasingly important role in the national economy, while remote sensing has its own unique advantages in information acquisition, the combination of the two is bound to have a broad prospect in the military and civilian fields [9-12]. In the past decades, many feature extraction methods have been proposed, and the most classic ones are principal component analysis and linear discriminant analysis (LDA). However, in real life, people sometimes encounter situations where the number of training samples is very small, and at this time most of the traditional pattern recognition methods are not very effective, and it has been found that multi-task learning methods can help solve this problem [13-17]. The idea of multi-task learning technique is introduced into feature extraction, using other databases will have some relevant and useful data information to help extract better features, thus improving the efficiency of signal processing of high-resolution remote sensing video images [18-22].

The article introduces the principle of multi-task learning, applies the multi-task learning mechanism to the feature extraction of deep learning models, transforms a multi-class feature extraction and recognition problem into multiple binary feature extraction and recognition problems, and additionally introduces an edge extraction task, optimizes the loss function of edge extraction, improves the overall accuracy of the edge region, and establishes a multi-task based multi-decoder triple attention model. Taking Vaihingen, WHDLD, and DLRSD datasets as experimental objects, multiple semantic segmentation models are selected as comparison methods to explore the fitting effect of feature extraction methods through the Loss curve of model training. And compare the accuracy, memory usage, and frame rate of different model tests. Then on the three datasets, different feature extraction methods are applied to extract multiple categories of objects to explore the extraction accuracy of the MD TANet model for different specific features and verify the generalization ability of the model.

2

Multi-task learning

Inspired by the fact that humans often migrate experiences from similar tasks when learning new knowledge, multitask learning is a machine learning method that obtains additional information about the main task from multiple related tasks to influence the task outcome. It aims to improve prediction efficiency and accuracy through inductive migration and shared representation. It has been widely used in the fields of natural language processing, machine vision, and speech information recognition.

According to the differences in the hidden layer parameter sharing methods during CNN training, multi-task learning is categorized into soft parameter sharing and hard parameter sharing, and the multi-task learning parameter sharing methods are shown in Fig. 1. Among them, soft parameter sharing builds independent models with similar structure for different tasks, the parameters of different models are independent of each other, and the parameter constraints and task associations are realized by regularization. In the case of low task similarity, using soft parameter sharing can instead improve the diversity of extracted features. Hard parameter sharing shares the underlying parameters among different tasks and then designs different model outputs to meet the requirements of each task. Compared with soft parameter sharing with looser constraints, hard parameter sharing methods usually use part of the same parameters directly between different tasks, so they require a higher degree of correlation or similarity of tasks, and the underlying features of each task should be similar. Common hard parameter sharing multi-task neural network models are mostly built based on coder-decoder structure, and the models often share encoders and design corresponding decoders according to different tasks. The multi-decoder triple attention model designed in this paper adopts the common encoder-decoder structure and performs the feature extraction and decoding functions respectively.

The biggest advantage of multi-task learning is that it helps optimize the generalization error of the model and improve the overall performance of the model compared to a single task. The reason is that, when multiple tasks are trained at the same time, the representation of the model and the sample size of the data are implicitly weakened and enhanced, respectively, relative to a single task, and both of them have a positive effect on the reduction of the risk of overfitting of the original model. As the multi-task jointly learns the noise suppression model under different tasks, it helps to improve the overall task accuracy. In the multitasking mode, different tasks perform backpropagation of the gradient at the same time, and there is a risk that a single task is prone to fall into local optimization, whereas multiple related but differing tasks tend to have different locally optimal solutions, and the probability of the gradient being close to the globally optimal solution is greater in the case of joint learning.

3

Multi-task learning based feature extraction methods

The current mainstream multi-task deep learning model is dominated by the hard parameter sharing method, and the PAD-Net morphology in the multi-task hard parameter sharing model is further optimized for the multi-task model based on the attention mechanism, in view of which, this paper combines the attention mechanism to construct a multi-task learning model. For high-resolution remote sensing video image signal processing, the triple attention model effectively solves the problem of insufficient global features of the model, which makes it difficult to extract a wide range of targets, but there is still misjudgment for similar features, and the multi-task learning mechanism is used to improve the extraction accuracy of remote sensing image features in the deep learning model.

3.1

Attention mechanisms

Attention Mechanisms (AM) is a technique that has gained widespread use in deep learning, especially in Natural Language Processing (NLP) and Computer Vision (CV). It draws on the principles of human visual attention, i.e., the ability to focus on certain key parts that tend to be more important to the task at hand when confronted with a large amount of information. When confronted with a new image, people’s eyes tend to be attracted to areas of interest or obvious color contrasts in the image, and in the process of extracting key information, they tend to temporarily ignore other detailed parts of the image. Introducing the attention mechanism in deep learning models helps the models to recognize the most informative parts when processing sequential data. Attention mechanisms can be broken down into channel attention mechanisms and spatial attention mechanisms.

3.1.1

Channel Attention Mechanisms

The channel attention mechanism and the core idea is that not all channels are equally important in a given feature response (i.e., the feature map or feature representation generated by the convolutional layer).

A pooling operation, including average pooling and global pooling, is first performed on each channel to obtain the global statistical features for that channel. This operation reduces the 2D feature map to a single scalar value, which provides the basis for subsequent learning of each channel weight. Next, a small multilayer perceptron (typically containing one or two hidden layers) is used to learn the weights of each channel from the globally pooled features; the number of hidden layers, as well as their dimensions, need to be decided based on the specific application. Finally, a Sigmoid activation function is used to normalize these weights such that each of them has a weight between 0 and 1. These weights are subsequently used to weight the original feature channels, thus improving the network’s sensitivity to important features. Based on the above description, the channel attention is computed using the formula below: (1) $\begin{array}{rcl} M_{c} (F) & = & σ (M L P (A v g p o o l (F)) + M L P (M a x p o o l (F))) \\ = & σ (W_{1} (W_{0} (F_{A v_{R}})) + W_{1} (W_{0} (F_{M a x}))) \end{array}$ $$\begin{array}{rcl} {M_c}(F) &=& \sigma (MLP(Avgpool(F)) + MLP(Maxpool(F))) \\ &=& \sigma ({W_1}({W_0}({F_{A{v_R}}})) + {W_1}({W_0}({F_{Max}}))) \\ \end{array}$$

where σ is the Sigmoid function, F is of size C × H × W, where r is the decreasing rate, and MLP weights W₀ and W₁ are shared parameters for both inputs.

3.1.2

Spatial attention mechanisms

The spatial attention mechanism is an algorithm used to enhance the attention of a neural network to features at specific spatial locations in the input data. In visual tasks, this mechanism recognizes that pixels at different locations in an image contain varying degrees of informativeness, and that certain regions may be more critical to the completion of a particular task. The introduction of spatial attention allows the model to dynamically adjust the focus of its processing, similar to how the human visual system prioritizes those parts of a scene that are meaningful when observing it. In standard CNNs, the convolutional operations at each layer give equal importance to each spatial location of the input feature map. However, in numerous visual tasks, different spatial locations contribute differently to the final output decision. Spatial attention mechanisms are proposed to enable the model to automatically identify and focus on the most useful regions, thereby improving performance. A typical spatial attention module usually takes the following steps:

First global average pooling and maximum pooling are performed, and then the feature maps after global average pooling and maximum pooling are spliced to obtain a feature map dimension of H × W × 2. This captures the global context and helps the network understand which locations are more important. Next a convolution operation is performed on the aforementioned pooled features using a 1×1 convolution kernel to produce a spatial attention graph. The size of this map is identical to that of the input feature map, and each value is associated with the weight of the corresponding position in the input feature map. The weights are then normalized to between 0 and 1 using a Sigmoid activation function. Finally, the normalized spatial attention map is subjected to element-level multiplication operations with the original feature map to enhance or suppress features at specific locations. The computational formula is shown below: (2) $\begin{array}{rcl} M_{s} (F) & = & σ (f^{1 \times 1} ([A v g p o o l (F); M a x p o o l (F)])) \\ = & σ (f^{1 \times 1} ([F_{A v g}; F_{M a x}])) \end{array}$ $$\begin{array}{rcl} {M_s}(F) &=& \sigma ({f^{1 \times 1}}([Avgpool(F);Maxpool(F)])) \\ &=& \sigma ({f^{1 \times 1}}([{F_{Avg}};{F_{Max}}])) \\ \end{array}$$

Again, σ is a Sigmoid function, F is of size C × H × W, and f is a 1 × 1-convolution.

3.2

Multi-decoder based feature extraction

Aiming at the multi-categorization feature extraction problem of high-resolution remote sensing video images, this paper constructs a multi-task-based multi-decoder triple-attention model (MD TANet), which transforms a multi-categorization semantic segmentation problem into a multiple binary-categorization semantic segmentation problem, where each category constructs a separate decoder, which consists of multiple attentional modules, and each decoder pays attention to the corresponding category only, without considering the features of other categories, thus reducing the inter-category competition. There is no need to consider the features of other categories, thus reducing competition between categories.

3.2.1

Self-attention module

In order to enhance the linkages within the features, two self-attention modules are proposed by introducing DANet.

In the positional attention module, a connection is established for the input local features in the positional dimension by first transforming the features into the form of Q, K, and V using convolution, and in order to model the positional features, it is necessary to match the shapes of Q, K, and V to obtain an (H × W) × (H × W) attention weight map, and then dot-multiply the weight map by V to restore it to the original dimensions. In addition, the positional attention module introduces a short-circuit connection, which dot-multiplies the input features with the attention weight map, and the result obtained is summed with the original features, which is beneficial for the transfer of gradients during back propagation and avoids the problems of gradient vanishing and gradient explosion. It is expressed as Eq. (3) and Eq. (4) using Eq: (3) $A t t = s o f t \max (Q_{(H W \times C)} \cdot K_{(C \times H W)})$ $$Att = soft\max ({Q_{(HW \times C)}} \cdot {K_{(C \times HW)}})$$ (4) $F_{o u t} = (V_{(C \times H W)} \cdot A t t) . r e s h a p e (C \times H \times W) + I n p u t_{(C \times H \times W)}$ $${F_{out}} = ({V_{(C \times HW)}} \cdot Att).reshape(C \times H \times W) + Inpu{t_{(C \times H \times W)}}$$

Where Att means the attention weight map and F_ωu represents the output.

The channel attention module switches the position of the dot product of the position attention module, and the shape of the generated attention weight map is (C × C), so as to establish the influence relationship of features among different channels. The overall structure of the channel attention module is shown in equations (5) and (6): (5) $A t t = s o f t m a x (Q_{(C \times H W)} \cdot K_{(H W \times C)})$ $$Att = softmax({Q_{(C \times HW)}} \cdot {K_{(HW \times C)}})$$ (6) $F_{o u t} = (A t t \cdot V_{(C \times H W)}) . r e s h a p e (C \times H \times W) + I n p u t_{(C \times H \times W)}$ $${F_{out}} = (Att \cdot {V_{(C \times HW)}}).reshape(C \times H \times W) + Inpu{t_{(C \times H \times W)}}$$

3.2.2

Label Attention Module

Combined with the attention mechanism, the structure of the labeled attention module is shown in Fig. 2. Due to the increase in the number of channels of the attention probability map, the attention probability map is no longer obtained by multiplying Q and K dots, but by direct convolution of the input features. Similarly, the attention output is convolved with the original input and then summed to obtain the final output. Equations (7) and (8) describe the overall structure of the labeled attention module: (7) $A t t = s o f t m a x (C o n v {(I n p u t_{(C \times H \times W)})}_{(N \times H \times W)})$ $$Att = softmax(Conv{(Inpu{t_{(C \times H \times W)}})_{(N \times H \times W)}})$$ (8) $F_{o u t} = C o n v {(C o n v {(I n p u t_{(C \times H \times W)})}_{(N \times H \times W)} + A t t)}_{(N \times H \times W)}$ $${F_{out}} = Conv{(Conv{(Inpu{t_{(C \times H \times W)}})_{(N \times H \times W)}} + Att)_{(N \times H \times W)}}$$

Due to the introduction of additional loss functions in the labeled attention module, the way the parameters are updated during the backpropagation of the training model changes, in the labeled attention module, the parameters will be constrained by two loss functions at the same time, the segmentation loss function and the attention loss function, both of which work together to optimize the parameters as shown in Eqs. (9) and (10): (9) $W = W - \frac{η}{b a t c h_{-} s i z e} \sum \frac{\partial L_{s e g}}{\partial W}$ $$W = W - \frac{\eta }{{batc{h_ - }size}}\sum {\frac{{\partial {L_{seg}}}}{{\partial W}}}$$ (10) $W = W - \frac{η}{b a t c h_{-} s i z e} \sum (\frac{\partial L_{s e g}}{\partial W} + \frac{\partial L_{a t t}}{\partial W})$ $$W = W - \frac{\eta }{{batc{h_ - }size}}\sum {(\frac{{\partial {L_{seg}}}}{{\partial W}} + \frac{{\partial {L_{att}}}}{{\partial W}})}$$

3.2.3

MD TANet modeling

The model employs an encoder-decoder architecture, the encoder is basically the same as the VGG architecture adopted by UNet, and each decoder contains three attention modules containing two self-attention modules and one labeled attention module. As shown in equations (11) and (12): (11) $F_{i} = U p s a m p l i n g (C o n v (P A M (F_{e}) + C A M (F_{e})) + L A M (F_{e}))$ $${F_i} = Upsampling(Conv(PAM({F_e}) + CAM({F_e})) + LAM({F_e}))$$ (12) $F = a r g m a x ([F_{1}, ... F_{N}])$ $$F = argmax([{F_1},...{F_N}])$$

The dataset contains a total of N category, F_e represents the feature map output by the encoder, F_i represents the output of each category decoder, and F represents the final multi-class fusion output.

In the label attention module, since each decoder corresponds to only one category, the number of attention channels of label attention is changed accordingly, from the original multi-channel to a single-channel probability map. After each decoder gets the predicted probability of the corresponding category, it directly combines the probability maps of each category, and the category with the highest probability is selected at each point as the final prediction result. Since the labeled attention model has already hinted at the feature extraction region in the attention module, which is equivalent to a layer of weak constraints, there is no need to do further fusion of the overall results to avoid the loss of feature information. In terms of the loss function, due to the splitting of the task, the loss function of each category is calculated separately, as shown in Eqs. (13), (14) and (15): (13) $L o s s_{s e g} = F o c a l (p r e d i c t, l a b e l)$ $$Los{s_{seg}} = Focal(predict,label)$$ (14) $L o s s_{a x} = C E (a t t, l a b e l_{d o w n_s a m p l i n g})$ $$Los{s_{ax}} = CE(att,labe{l_{down\_sampling}})$$ (15) $L o s s_{t o t a l} = \sum_{i}^{N} L o s s_{-} a t t_{i} + \sum_{i}^{N} L o s s_{-} s e g_{i}$ $$Los{s_{total}} = \sum\limits_i^N L os{s_ - }at{t_i} + \sum\limits_i^N L os{s_ - }se{g_i}$$

The final overall loss function is the sum of all the attention loss functions and the semantic segmentation loss function.

3.3

Edge Optimization Module

In order to enhance the attention of the edge region, the boundary information between different features can be easily obtained by using the labels in the training samples, so the structure of the label attention is reused, and the label information is replaced by the boundary information, so that the model’s attention is focused on the boundaries of different feature types.

The loss function part of the model adds a new distance transform loss function (for the edge extraction task) and a root mean square error loss function (for the distance map extraction task) based on MD TANet.

The calculation of the distance transform loss function firstly needs to calculate the distance graph label_dis to get the distance of each point from the nearest different categories of points.

The distance transform loss function is the sum of the distances of the points where the result of the prediction map is different from the result of the true value map, as shown in equations (16) and (17): (16) $L o s s_{D T} = \sum_{i \in θ} l a b e l_{d i s} [i] / N$ $$Los{s_{DT}} = \sum\limits_{i \in \theta } l abe{l_{dis}}[i]/N$$ (17) $θ = {i \in A | p_{e d g e} [i] \geq 0.5 & l a b e l_{e d g e} [i] < 0.5, p_{e d g e} [i] < 0.5 & l a b e l_{e d g e} [i] \geq 0.5}$ $$\theta = \{ i \in A|{p_{edge}}[i] \geq 0.5\:{\text{& }}\:labe{l_{edge}}[i] < 0.5\:,{p_{edge}}[i] < 0.5{\text{& }}\:labe{l_{edge}}[i] \geq 0.5\}$$

Wherein, label_dis[i] represents the value of pixel i in the distance graph, N represents the total number of pixels, θ represents the set of value range of i, A represents the set of all pixels, p_edge[i] represents the value of pixel i of the edge extraction branch output result, label_edge[i] represents the value of pixel i of the edge binary image prime point, & represents and operation.

In the distance map extraction branch part, since the values in the distance map of the labels are continuous values, the common root mean square error loss function can be used to measure the distance of the prediction results from the labels, as shown in equation (18). Combining all the above loss functions, the final loss function consists of four parts: the attention loss function Loss_ant, as shown in Equation (13), the feature extraction loss function Loss_seg, as shown in Equation (14), the distance graph root-mean-square error loss function Loss_MSE, and the edge-distance transform loss function Loss_DT, and the four are summed to obtain the total loss function, as shown in Equation (19): (18) $L o s s_{M S E} = \sqrt{\sum_{i} \sum_{j} {(p_{d i s_{i j}} - l_{d i s_{i j}})}^{2}}$ $$Los{s_{MSE}} = \sqrt {\sum\limits_i {\sum\limits_j {{{({p_{di{s_{ij}}}} - {l_{di{s_{ij}}}})}^2}} } }$$ (19) $L o s s_{s o x i l} = L o s s_{a u t} + L o s s_{s e g} + L o s s_{M S E} + L o s s_{D T}$ $$Los{s_{soxil}} = Los{s_{aut}} + Los{s_{seg}} + Los{s_{MSE}} + Los{s_{DT}}$$

Overall, in this paper, in order to optimize the feature extraction accuracy in the edge region, the edge extraction task is additionally introduced, which contains a total of the following three optimizations. 1)

Add edge attention module, which increases the decoder from three attention modules to four, so that the model pays more attention to the extraction accuracy of the edge region.

2)

Add an edge extraction branch, which uses a distance transform loss function to sum the distance values of misjudged points and take targeted optimization of the model using the mispredicted points.

3)

Add the distance map generation branch, which transforms the edge detection task into the distance map generation task, and directly adopts the root mean square error loss function, thus improving the extraction accuracy of the edge region.

4

Experimental results and analysis

In order to validate the effectiveness of the multi-task learning based feature extraction method in this paper, test experiments and comparative experiments on Vaihingen, WHDLD, and DLRSD datasets are conducted in this chapter. This section provides a detailed description of the experiments.

4.1

Experimental data set

The Vaihingen datasets are remote sensing image datasets for semantic segmentation tasks, including orthorectified remote sensing images and corresponding digital surface model maps. Each dataset is manually labeled with six ground object categories: impervious surfaces, buildings, short vegetation, trees, cars, and others.

The Wuhan Densely Labeled Dataset (WHDLD) is a remote sensing image dataset for semantic segmentation tasks, which is obtained by cutting a very large sized remote sensing image of Wuhan city. The dataset is manually labeled with six ground object categories: buildings, roads, sidewalks, vegetation, bare soil, and water. The dataset contains a total of 4940 RGB images with a ground sampling spacing of 2m and their corresponding labeling data, all of which have an image size of 256×256.

Densely Labeled Remote Sensing Dataset (DLRSD) is another remote sensing image dataset for semantic segmentation tasks. The dataset contains a total of 17 labeling categories, which are: aircraft, bare soil, buildings, cars, chapels, yards, docks, grounds, grass, mobile homes, sidewalks, beaches, oceans, ships, tanks, trees, and other waters. The dataset contains a total of 2100 RGB images with ground sampling spacing of 0.3m and their corresponding labeling data, all of which have an image size of 256×256.

4.2

Experimental environment and evaluation indicators

4.2.1

Experimental environment

In order to verify the performance of the multitasking network model, the experiments are completed on the servers of the high-performance computing platform, and the configuration of the experimental platform is shown in Table 1.

Table 1.

Experimental platform configuration

	Name	Configuring
Hardware	CPU	Intel(R) Xeon(R) Gold 5115 CPU@2.40GHz
	Memory	128GB
	GPU	Tesla P100
	Display storage	16GB
Software	Programming language	Python 3.7.15
	Compiling environment	Pycharm
	Anaconda version	Conda 4.10.3
	Depth learning framework	Pytorch 1.9.0
	Operating system	CentOS 7.7

4.2.2

Evaluation indicators

When evaluating the performance of deep learning network-based models, they are generally categorized into three aspects: accuracy, memory footprint, and frame rate. 1)

AP and mAP

The average accuracy AP is associated with two metrics: precision and recall. True Positive (TP) indicates how many samples with positive labels are predicted to be positive cases, and True Negative (TN) indicates how many negative samples are actually also predicted to be negative cases, and the larger the percentage of TP and TN, the better. False Positive (FP) represents the number of wrong detections, i.e., the number of predicted positive cases that are actually negative samples, and False Negative (FN) represents the number of missed detections, i.e., the number of predicted negative cases that are actually positive samples.

Precision represents the proportion of true positive samples predicted by the network in the overall true samples of the detection results, as shown in Equation (20), which reflects the detection rate of the model: (20) $\Pr e c i s i o n = \frac{T P}{T P + F P}$ $$\Pr ecision = \frac{{TP}}{{TP + FP}}$$

Recall represents the proportion of true-positive samples predicted by the network to the true true samples, as shown in Equation (21), which reflects the model’s check rate: (21) $Re c a l l = \frac{T P}{T P + F N}$ $$\operatorname{Re} call = \frac{{TP}}{{TP + FN}}$$

For the performance of the model, the higher the values of Precision and Recall, the better the performance of the model, but these two are often contradictory to each other, in order to better evaluate the performance of the algorithm, different Recall values as the horizontal coordinates, Precision values as the vertical coordinates, to draw the PR curve, which integrates the results of Precision and Recall, the PR curve The accuracy on the curve takes the mean value which is the average accuracy (AP), which is commonly used to evaluate the detection effect of a single category.The area under the PR curve is given by equation (22): (22) $A P = \int_{0}^{1} P (R) d R$ $$AP = \int_0^1 P (R)dR$$

mAP is then averaged over all categories of AP for evaluating the effectiveness of multi-category detection, as given in Eq. (23) for calculation: (23) $m A P = \frac{\sum_{i = 1}^{N} A P_{i}}{N}$ $$mAP = \frac{{\sum\limits_{i = 1}^N A {P_i}}}{N}$$

Where N indicates the total number of categories. 2)

Parameters and Parameters Size

Parameters refers to the number of parameters included in the model, i.e., the number of parameters that the model needs to learn in the training process. Take the convolution used in the model as an example, suppose the number of channels of the input feature map is C, the size of the convolution kernel is K × K, and every point of weight on the convolution kernel has to be learned, so the number of parameters that need to be learned for a single channel of the convolution kernel. If the number of channels of the output feature layer is O, the number of input parameters is calculated first, and then multiplied based on the number of output channels, the formula for calculating the number of parameters after the convolution operation is shown in equation (24): (24) $P a r a m e t e r s = (K^{2} \times C) \times O$ $$Parameters = ({K^2} \times C) \times O$$

Generally use the number of parameters Parameters to measure the model size Parameters Size. the number of parameters directly determines the size of the model, which is measured in units of one, but due to the large number of parameters of the model, it is generally taken as a unit measure of megabytes (M), and corresponds to the unit of the model size of the model is MB. 3)

FLOPS and FPS

Floating point operations FLOPS represents the number of floating point operations per second, i.e., “throughput”, which measures the computational time complexity of the model, and is often used as an indirect measure of the speed of the model, and the larger its value, the better. Frame rate FPS indicates the number of frames processed per second, i.e., the number of images that can be processed per second or the time required to process an image, evaluating the speed of model execution, the shorter the time, the faster the speed.

4)

MIoU

By counting the ratio of the intersection and concatenation of the true and predicted values, IoU can measure the degree of overlap between the predicted results and the true labels: (25) $I o U = \frac{T P}{T P + F N + F P}$ $$IoU = \frac{{TP}}{{TP + FN + FP}}$$

In the feature extraction algorithm, IoU denotes the ratio between the number of pixels in the common part of the target region obtained from the network prediction and the target region in the corresponding label and the number of pixels covering the concatenation of the two.

The average intersection and merge ratio MIoU, i.e., IoU is calculated on each category and averaged, assuming that the total number of categories is k, which becomes k + 1 with the addition of the background category, the MIoU calculation formula is shown in Equation (26): (26) $M I o U = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{T P}{T P + F N + F P}$ $$MIoU = \frac{1}{{k + 1}}\sum\limits_{i = 0}^k {\frac{{TP}}{{TP + FN + FP}}}$$

4.3

Experimental results and analysis

4.3.1

Overall performance analysis

In order to investigate the impact of multi-task learning, this paper conducts comparative experiments on the selected Vaihingen dataset, comparing different feature extraction methods such as U-Net, EANet, PSPNet, BiseNetv2, CCNet, RefineNet, EMANet, and Deeplabv3+, respectively.

The above model and the MD TANet model of this paper are trained under the experimental environment configuration in Section 4.2.1. During the training process, a multi-threaded approach is used to read data to speed up the data reading speed. The training batch size batch size is set to 64, the initial learning rate are set to 0.01, both use the cos strategy to adjust the learning rate, both use the SGD optimizer for parameter updating, and the epoch number is set to 250.

The difference between the predicted value of the model and the true value of the label is evaluated by the loss value. The smaller the loss value is, the better the model fits the true value. After training the model for 250 epochs, the Loss curve’s change during training can be seen in Fig. 3, and the loss value of the model converges to 0-8, and the MD TANet model converges with the lowest loss value of 0.17, which is the best fitting effect.

In order to better evaluate the network performance, five metrics, mAP, mIoU, FPS, Parameters and FLOPS, are chosen to evaluate the accuracy of the detection segmentation results, and 0.5 is used as the threshold for the confidence level, so its corresponding evaluation criteria can be expressed as mAP@50 and mIoU@50.

The evaluation of the results of different models on the validation set is shown in Table 2. Experiments show that the MD TANet model has the best recognition accuracy on the Vaihingen validation set, with the mAP@50 and mIoU@50 reaching 78.10% and 75.98%, respectively, the former is 1.16%~5.96% higher than the comparison method, and the latter is improved by 2.43%~6.38%. In terms of speed, the multitasking model MD TANet achieves a high real-time performance of 90.72 FPS while increasing the number of small parameters.The comparative analysis of the above experimental results proves that the improved multitasking model MD TANet achieves a better balance between accuracy and speed, and shows a better performance of remote sensing video image feature extraction.

Table 2.

The evaluation results of different models in the validation set

Model	Input size	mAP@50	mIoU@50	FPS	Parameters	FLOPS
U-Net	256*256	73.86	70.62	82.42	10.57	25.23
EANet	256*256	75.88	72.31	94.22	7.93	28.13
PSPNet	256*256	72.14	69.60	73.35	11.49	22.12
BiseNetv2	256*256	74.81	72.82	68.32	13.31	33.49
CCNet	256*256	76.94	73.55	76.22	10.77	53.35
RefineNet	256*256	76.42	72.48	88.65	9.42	40.44
EMANet	256*256	75.78	70.69	73.76	12.07	29.65
Deeplabv3+	256*256	76.49	73.54	76.21	11.25	35.37
MD TANet	256*256	78.10	75.98	90.72	10.38	26.54

4.3.2

Effect of different categories of extraction

In order to further validate the effectiveness of the proposed multitasking model MD TANet, feature extraction experiments for specific features are conducted on Vaihingen, WHDLD, and DLRSD datasets, where the comparison experiments conducted on DLRSD dataset are selected to be conducted by several algorithms that have better performance on other datasets. 1)

Comparison experiment results on Vaihingen dataset

Using the Vaihingen dataset, the performance of this paper’s method and the comparison algorithms are compared. The test results on the Vaihingen dataset are shown in Fig. 4, and Table 3 shows the test results of different models on Vaihingen dataset. The mIoU of this paper’s algorithm on the Vaihingen dataset reaches 75.98%, and except for the feature extraction accuracy of the automobile category (68.89%), which is slightly lower than that of RefineNet (69.02%) and CCNet (70.47%), the rest of the categories and the overall accuracy have achieved a certain improvement. Among the comparison methods, PSPNet has the lowest accuracy of 69.60%, mainly due to the difference in the accuracy of the other categories.Deeplabv3+ has a relatively high accuracy of 73.54%. Since the methods in this chapter are optimized for the edges of category objects, consistent optimization for all categories should be able to achieve some improvement in most categories, which is supported by the experimental results. For the MD TANet method in this paper, the accuracy in roads and buildings is better, reaching more than 80%. The lowest accuracy among the main 5 categories is for cars, with only 68.89%. Cars have relatively low segmentation accuracy due to the fact that the objects are significantly smaller and the features are not easily captured.

2)

Comparison experimental results on WHDLD dataset

The test results for the WHDLD dataset are shown in Fig. 5, and Table 4 shows the test results of different models on the WHDLD dataset.The mIoU metrics of the MD TANet model on the WHDLD dataset achieve a certain improvement compared to the individual comparison algorithms, reaching 67.55% of the mIoU metrics. The MD TANet method is ahead of the other algorithms in terms of cross-comparison metrics in each category. The multitask learning-based feature extraction method in this paper achieves an improvement of 3.95% to 6.61% compared to the comparison methods.

For this paper’s method, the accuracy is significantly better in the two categories of vegetation and water, reaching more than 82%. As water is easier to distinguish, it even reaches more than 90%. While the accuracy in the bare soil category is relatively low, the intersection and merger ratio index is only 48.96%.

3)

Comparison experimental results on DLRSD dataset

Using the DLRSD dataset, the performance of this paper’s method and some cutting-edge feature extraction algorithms are compared. The test results on the DLRSD dataset are shown in Fig. 6, and Table 5 shows the test results of the different models on the DLRSD dataset.The MD TANet model is still superior to the comparative methods on the DLRSD dataset for more categories, and it is able to achieve 72.20% on the mIoU. For the crossover ratio of categories, the MD TANet model was able to achieve the highest feature extraction results on more than half of the categories, despite the large number of categories in this dataset. Among them, the categories of church (59.33%), yard (84.99%), and ship (77.07%) are able to achieve significant improvements compared to other methods.

Table 3.

Test results of different models on the Vaihingen data set

Models	Roads	Building	Dwarf vegetation	Tree	Car	Other	Average (%)
MD TANet	85.92	87.81	75.71	78.34	68.89	59.22	75.98
U-Net	80.25	80.34	67.55	72.26	65.75	57.55	70.62
EANet	81.49	83.84	71.53	74.96	66.86	55.16	72.31
PSPNet	80.56	80.15	67.71	69.71	67.62	51.85	69.60
BiseNetv2	82.91	82.51	73.74	73.83	68.51	55.41	72.82
CCNet	82.42	83.57	73.15	73.84	70.47	57.82	73.55
RefineNet	83.14	82.21	71.87	70.28	69.02	58.37	72.48
EMANet	81.51	81.84	69.88	69.37	66.18	55.33	70.69
Deeplabv3+	83.93	83.54	73.57	74.21	69.52	56.48	73.54

Table 4.

Test results of different models on the WHDLD data set

Models	Building	Roads	Walkway	Vegetation	Naked soil	Water	Average (%)
MD TANet	62.61	68.74	49.54	82.01	48.96	93.44	67.55
U-Net	60.41	59.42	42.01	78.71	36.97	90.79	61.39
EANet	53.11	63.63	47.76	79.21	35.15	90.76	61.6
PSPNet	60.12	62.75	46.89	73.29	32.99	92.86	61.54
BiseNetv2	58.35	64.14	48.18	71.91	32.57	91.09	61.04
CCNet	61.51	64.25	45.86	77.95	39.03	93.02	63.6
RefineNet	61.14	63.02	42.01	79.39	37.72	93.29	62.76
EMANet	59.69	62.73	47.87	73.11	37.36	92.18	62.16
Deeplabv3+	61.57	62.93	46.98	79.62	35.05	92.19	63.06

Table 5.

Test results of different models on the DLRSD data set

Models	MD TANet	EANet	CCNet	RefineNet	Deeplabv3+
Airplane	68.30	64.65	62.58	62.89	64.76
Naked soil	59.05	53.16	54.51	53.89	53.49
Architecture	71.44	74.51	73.38	72.40	72.39
Car	67.53	62.20	65.23	64.38	65.20
Church	59.33	53.020	51.380	53.110	53.55
Courtyard	84.99	81.28	80.53	81.72	80.22
Quay	54.60	42.82	46.30	46.88	45.73
Field	91.83	93.6	90.93	94.66	93.25
Grass	64.28	66.18	65.11	64.78	64.88
Mobile house	61.07	63.96	62.1	61.06	60.84
Walkway	77.61	76.94	73.11	75.71	76.83
Sand beach	65.71	71.78	62.75	68.89	68.18
Ocean	96.73	93.25	90.79	92.56	91.86
Ship	77.07	65.04	70.23	72.28	70.15
Tanks	71.69	76.44	75.07	73.61	68.31
Tree	74.76	72.60	73.35	69.93	71.01
Other waters	83.44	82.18	81.75	80.98	80.04
Average (%)	72.20	70.21	69.36	69.98	69.45

5

Conclusion

In this study, a multi-task learning feature extraction method, MD TANet, is innovatively developed, which is specifically designed for the problem of multi-category feature extraction in high-resolution remote sensing video images, and effectively improves the feature extraction accuracy. Through the introduced multi-task learning framework, a multiclassified semantic segmentation problem is transformed into multiple biclassified semantic segmentation problems, which avoids the competition relationship between different classes for parameters and reduces the misclassification among similar features.

The experimental results confirm that MD TANet combined with the multi-task learning framework outperforms other feature extraction in terms of overall accuracy, with average accuracy and mIoU values improving by 1.16% to 5.96% and 2.43% to 6.38%, while achieving more than 90 FPS in terms of frame rate. The overall mIoU values of the MD TANet model are 75.98%, 67.55%, and 72.20% for different feature extraction of features in the Vaihingen, WHDLD, and DLRSD datasets, respectively, and are higher than the other comparative methods in terms of the accuracy of feature extraction for most categories of features. The MD TANet network structure proposed in this paper is capable of meeting the demands of daily high-resolution remote sensing video image data processing tasks, with an emphasis on innovation and practicality.

This paper proposes a multi-decoder triple attention model based on multi-task learning, and makes some progress in the task of recognizing multiple types of features in remote sensing images, but there are still some shortcomings, which need to be followed up with more in-depth research and exploration: (1) The research area of this paper has a simpler topographic environment and richer remote sensing resources, and the follow-up research can be carried out in the areas of complex topography and the areas where it is difficult to obtain high-quality remote sensing data due to the influence of clouds and rain. The following studies can be carried out in areas with complex terrain and where it is difficult to obtain high quality remote sensing data due to cloud and rain. (2) The feature extraction accuracy of the method in this paper still has some room for improvement.

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

Multi-task learning based feature extraction method in signal processing of high resolution remote sensing video images

Xinming Fan

Published Online: Mar 21, 2025

Received: Oct 28, 2024

Accepted: Feb 06, 2025

DOI: https://doi.org/10.2478/amns-2025-0695

KeywordsMulti-task learning, Attention mechanism, Semantic segmentation, Feature extraction, Remote sensing image

© 2025 Xinming Fan, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Keywords
Multi-task learning, Attention mechanism, Semantic segmentation, Feature extraction, Remote sensing image